How to write AAX plug-ins for Avid's TI DSP-based platforms.
Contents
Overview of TI DSP Algorithms in AAX
Avid's hardware-accelerated audio systems allow AAX plug-ins to offload their real-time processing tasks to a dedicated processor, guaranteeing reliable performance at ultra-low latency. Avid's TI DSP-based products utilize Texas Instruments DSP chips to host plug-ins in a managed shell environment.
The AAX host handles all system-level communications and resources on the DSP and provides a consistent API to manage communication between the plug-in's real-time algorithm and its other components. This design allows AAX plug-ins to use the same communication methods whether they are running natively, on a TI-based accelerated system, or in some other distributed environment.
Each AAX plug-in contains a real-time algorithm callback. For TI DSP-based platforms, this callback is compiled into a relocatable ELF DLL. This library is loaded onto the appropriate DSP by the host, and may share the DSP with other plug-ins if the host determines that the required system resources are available. A real-time execution environment called the TI Shell is also loaded onto each DSP. The TI Shell manages the DSP's memory and interrupts and guarantees reliable real-time performance even at single sample operation.
Getting Started with HDX DSP
This section provides a quick overview of what you will need for creating an AAX DSP plug-in to run on Avid's TI-based HDX DSP platforms.
-
Plug-in structure
TThe algorithm component for an AAX DSP plug-in is compiled into a binary that runs on the TI DSP. Because this algorithm callback runs on a separate device than the rest of the plug-in, the algorithm must be separated from the plug-in's other components and no pointers may be shared between the two. All memory used by the algorithm must be set up via fields in the algorithm's context structure, and the AAX packet system must be used for transmitting coefficients from the plug-in data model to the algorithm.
For more information about the structure of an AAX plug-in algorithm and features for communicating with the algorithm, see Real-time algorithm callback .
-
Development Environment
To compile your plug-in's AAX DSP binary you will need to run TI's free Code Composer Studio (CCS) IDE on a Windows system or VM.
The latest Code Composer Studio versions no longer support compilation for C6727 DSPs, so you must use CCS version 7 or earlier.
To get Code Composer Studio version 7, visit https://www.ti.com/tool/ccstudio and navigate to Previous Versions > Legacy versions > Code Composer Studio Version 7 Downloads
For additional steps to set up Code Composer Studio see TI Development Tools
-
Language support
The C6727 compiler for AAX DSP plug-ins supports C and C++ up to C++98. In particular, note that no C++11 or later language support is available. For more specific details see C++ standard support .
The HDX DSP Platform
HDX DSP is Avid's core mixer and plug-in accelerator platform. Avid's HDX and Pro Tools | Carbon systems both use the HDX DSP platform, with multiple TI C6727 DSPs each clocked at 350 MHz. These DSPs utilize a 32-bit floating-point architecture, with the option to perform 64-bit double-precision operations at some performance cost. Each HDX card includes 18 DSPs and is connected to the host system over a high bandwidth PCIe connection, while each Pro Tools | Carbon system includes 8 DSPs and is connected to the host system over a Gigabit Ethernet connection.
DSP characteristics: instruction processing
The C6727 DSP utilizes a VLIW architecture and contains dual data paths. Each data path includes four independent functional units, so the DSP can accommodate up to 8 parallel instructions per cycle. To take advantage of this architecture, the TI compiler relies heavily on instruction pipelining for optimization.
DSP characteristics: audio buffers
In order to realize the maximum possible performance benefit from this architecture, the algorithm routine for a single HDX DSP plug-in is always called with the same buffer size. By guaranteeing that each algorithm will be called with a consistent buffer size, the TI compiler is able to properly account for any possible iterative instruction pipelining, resulting in large performance gains.
HDX DSP uses a four-sample processing quantum by default for plug-in instances. Plug-ins that require additional processing time per callback, e.g. to mitigate the overhead cost of the chip's DMA facilities, may optionally request a 16, 32, or 64-sample quantum. Note that at higher block sizes, the number of potential I/O channels available to plug-ins on a chip will be reduced.
DSP characteristics: memory
Each DSP on the HDX DSP platform includes 16 MB of external RAM and 256 kB of internal RAM. The DSP has the ability to execute code from either internal or external RAM, though the real-time performance cost of external RAM accesses is significant. The chip's internal RAM is addressable at the core clock rate.
Each DSP also has a program cache of 32 kB. Plug-in code is loaded into this cache from internal memory, so for best performance your plug-in should not use more than 32 kB for its program code. You can look at the CCS-generated .map file to find your plug-in's program code size.
SDRAM performance
Asynchronous access to data in the C6727's SDRAM is very slow, requiring 50 cycles/word to read and 15 cycles/word to write. This is primarily due to clock domain bridging, lack of data caching, and the fact that data from the core is given a low priority in order to avoid stalling real-time DMA transfers.
Executing program code from external memory
The TI C6727 supports executing program code from external memory. When executing from uncached external memory, expect cycle counts to increase by a factor of 4x to 5x compared with the equivalent internal-memory code. Assuming that no cache thrashing occurs, subsequent calls will be cached and thus the program's location in either external or internal memory will produce similar cycle counts.
- Note
- The CCSv4 Profiler contains a bug that produces incorrect cycle counts for cached external-memory program code. Therefore, when gathering cycle count data for a plug-in that stores its program data in external memory, an RTI-based timing method should be used.
System characteristics: DSP/host data transfers
Plug-ins loaded onto the HDX DSP platform may transfer arbitrarily large data blocks between the DSP and the host, within the limits of available DSP memory and system bandwidth.
DSP/host bandwidth
Neither
AAX nor the HDX DSP platform include any explicit plug-in bandwidth limiting constraints. If a plug-in's data transfer requests bump up against the physical bandwidth limit for the system then this will delay the blocking data transfer request on the host, as the transfer will be held off for higher-priority operations on the DSP, and may also delay automation data from reaching other plug-ins on the affected DSPs in the same group.
The recommended upper limit for DSP/host data transfer requests in an individual plug-in when running on an HDX PCIe card is 10 MB/s, divided by the maximum number of plug-in instances that will run on a single chip. On the HDX card, DSPs are wired to the FPGA crossbar in groups of three, with a data bandwidth of approximately 67 MB/s for each group. The overall system bandwidth for each DSP is therefore approximately 20 MB/s. This bandwidth is shared by all data reads and writes, including custom data transfer requests as well as plug-in and mixer automation and metering data.
This limit is significantly lower on Pro Tools | Carbon. Carbon uses a single group for all eight DSPs, so the overall system bandwidth for each DSP is approximately 8 MB/s. In addition, data transfers between Carbon and the host system must be executed over a Gigabit Ethernet connection with up to 75% of its bandwidth already reserved for AVB audio data. This leaves 250 Gb/s for all other command traffic. If your plug-in utilizes frequent or large DSP/host data transfers then be sure to test it on Pro Tools | Carbon to verify whether it is compatible.
DSP/host data transfer characteristics
The minimum data transfer size for all host-to-DSP communications for HDX PCIe cards is 128 bytes. This limit applies to all host-to-DSP data transfers, including data sent to buffered ports, unbuffered ports, and private data blocks (via the AAX Direct Data interface.)
Since each transfer has a minimum size of 128 bytes, the use of many small packets does not increase transfer efficiency or save system bandwidth. Quite the opposite: updating a single 64-byte packet would require less bandwidth than updating two 4-byte packets in an HDX PCIe system, since the former would require only one 128-byte transfer while the latter will require two.
On Pro Tools | Carbon there is no minimum data transfer size. That said, for best performance on Carbon it is still recommended to minimize the number of data packets that are sent.
TI Shell characteristics: Memory allocation
Memory resource availability
The TI Shell code that is loaded onto each DSP uses approximately 56 kB of internal memory, leaving 200 kB of internal memory per DSP. This memory is shared between the plug-ins on the chip and holds the plug-ins' code and data, per-instance blocks declared in Describe(), and instance overhead.
As a general guideline, plug-in instances should not use more than 200 / n kB of internal memory, where n is the number of instances of your plug-in that will run on a single chip based on its cycle count requirements. If each plug-in instance on the chip requires more internal memory than this then the plug-in may need to declare an explicit number of instances that can run per chip based on this memory usage rather than declaring its cycle count utilization.
Shared and per-instance memory allocation
When a plug-in instance is created on a DSP, its program code is loaded onto that DSP. This copy of the program code is then re-used for all subsequent instances of the effect that are loaded onto the DSP. Static and global data are also shared between all instances of an effect on the DSP. Other allocations, such as coefficient and private data blocks, are per-instance.
- Host Compatibility Notes:
- Beginning in Pro Tools 11, AAX DSP algorithms also support optional temporary data spaces that can be described in the Describe module and are shared among all instances on a DSP. This is an alternative to declaring large data blocks on the stack for better memory management and to prevent stack overflows. Please refer to AAX_IComponentDescriptor::AddTemporaryData() for usage instructions.
Placing data into external memory
An AAX plug-in may optionally request that its private data or program code be placed into external memory. Because standard access calls to the DSP's SDRAM are very slow, it is strongly recommended that all of a plug-in's real-time data be placed in internal RAM, and the TI Shell will load a plug-in's program code and all private plug-in data blocks into internal memory by default.
Requesting more than 256 kB of data in internal memory for plug-in data plus the memory required by the TI Shell will lead to undefined behavior, so it is important to explicitly request external memory for plug-in data when appropriate.
To load program code, static data, or global variables into external memory, use the TI SECTION
pragmas. For example, #pragma CODE_SECTION_(".extmem")
can be used before function definitions that are either initialization code, or infrequently used background code. For static variables, use #pragma DATA_SECTION_(".extmemdata")
before each variable definition.
DMA support
Because of the slower access time of external RAM, you should consider using a
DMA transfer for recurring transfers, and possibly even for larger one-time transfers. This is of particular relevance for data reads, which must traverse the various clock domains and priority switches twice (address send, and then data return.)
The TI Shell supports three DMA modes: Scatter (for transfers from internal to external memory), Gather (for transfers from external to internal memory), and Burst (contiguous block copies). The Scatter mode can accomplish transfer speeds of up to 2.1 DSP cycles/byte transferred, while the Gather mode can accomplish 2.7 cycles/byte transferred.
The Scatter and Gather DMA facilities use a linear buffer for internal memory and a FIFO for external memory. It is possible to transfer to or from multiple offsets within the external memory FIFO using an offset table, which can contain up to 65,536 (2^16) entries. The offset (burst) length may be 4, 8, 16, 32, or 64 bytes long.
The TI Shell also supports a Burst DMA mode which implements linear data reads or writes.
For more information on DMA support and for example code, see \ExamplePlugIns\DemoGain_DMA
in the SDK.
TI Shell characteristics: Data packet services
In addition to supporting direct transfers of arbitrary data via DMA, the TI Shell also supports a packetized data delivery mechanism for host-to-DSP data transfers. Packet delivery ports may be either unbuffered or buffered, and are described using the
AAX_EDataInPortType parameter in
AAX_VComponentDescriptor::AddDataInPort().
Unbuffered ports
Unbuffered ports use a straightforward implementation that delivers posted packets to the algorithm as soon as possible. In an unbuffered port, newer packets will always override older packets. Therefore, an algorithm may not receive every packet that was posted to an unbuffered port, but it will always receive the most up-to-date information possible.
Unbuffered ports deliver their data without blocking or synchronizing with the algorithm's execution. Although bus arbitration guarantees that a read from the algorithm callback will not occur in the middle of a write from the host, it is important to note that the data in an unbuffered port may change during algorithm execution.
Buffered ports
Buffered data ports store incoming packets in a host-managed queue. This queue acts as a buffer and provides the host with more flexibility in how it delivers packets. A key feature of buffered data ports is that new data will never be delivered to these ports during algorithm execution.
The behavior of buffered data ports varies depending on the host platform. In HDX DSP plug-ins, Buffered data ports use a FIFO to queue data packets as they are posted. New packets are dequeued and delivered to the algorithm individually, with the next packet arriving before each algorithm render callback.
Data port overhead and restrictions
Each HDX DSP supports a maximum of 164 buffered data ports, which matches the maximum I/O limit for each DSP. System overhead costs associated with using the on-chip packet services are as follows:
Memory Overhead
-
The memory overhead for an unbuffered data port is simply the size of the data packet.
-
This DSP memory overhead for a buffered data port is two times the size of the data packet. A large (>100-element) packet queue is also allocated on the host.
CPU overhead
Unbuffered ports do not incur any additional CPU overhead.
Individual buffered ports incur non-trivial CPU overhead. For example, in Pro Tools 10.2 each buffered port requires 5 cycles of overhead per render callback. This overhead can quickly add up in "small" plug-ins that contain many buffered data ports. Therefore, we strongly recommend that plug-ins use consolidated coefficient packets when possible in order to minimize this overhead. This optimization can result in large performance gains for callbacks that require 1000 or fewer cycles to operate.
The trade-off of this optimization is that more work ends up being done on the host and more data must be transmitted to the algorithm, since the entire coefficient packet must be re-calculated and re-sent every time any of its input parameters change. This is usually beneficial trade-off to make, especially given the 128-Byte per-transfer minimum for HDX PCIe cards discussed above. However, care must be taken in extreme cases such as when packet delivery threatens to bump up against the maximum recommended bandwidth for host/DSP data transfers, especially on Pro Tools | Carbon.
TI Shell characteristics: Instance allocation
Multi-shell packing
With a few exceptions, AAX DSP plug-ins will share DSPs with other plug-ins. This occurs transparently to the plug-in due to the fact that all system resource management is handled by the TI Shell.
When a new plug-in instance is created, the TI Shell and AAX host will attempt to intelligently allocate it to a DSP based on both memory and CPU resource requirements. If one plug-in on the chip requires a large amount of memory and very few processing cycles, it may be packed with another plug-in that does not require much memory but that is very CPU intensive.
The exceptions to this model are plug-ins that use DMA, register for a background processing callback, register a maximum number of instances per chip or use a processor affinity constraint when reporting CPU requirements. With the exception of a processor affinity, these plug-ins will receive dedicated DSPs to which only additional instances of the same plug-in type will be added.
- Host Compatibility Notes:
- Beginning with Pro Tools 10.2, the TI shell supports a "processor affinity" property, which indicates that a DSP ProcessProc should be preferentially loaded onto the same DSP as other instances from the same DLL binary. This is a requirement for some designs that must share global data between different processing configurations.
Note that this property should only be used when absolutely required, as it will constrain the DSP manager and reduce overall DSP plug-in instance counts on the system.
DSP Shuffles
A DSP shuffle will occur in Pro Tools when the engine must re-allocate DSP resources in order to make more processing power available. A shuffle will force the re-instantiation of the plug-in's DSP algorithm component, potentially on a new chip, while leaving the plug-in's host objects intact. During a shuffle, the engine will perform the following steps:
-
Disconnect audio from an effect
-
Call instance initialization with the removing instance flag on the old location
-
Repeat for all instances of all DSP Effects in the system
-
Load the effect in the new location
-
Re-send the last packets to all data-in ports
-
Call private data init for any private data
-
Call instance init with the 'adding instance' flag, in the new location
-
Begin audio processing
-
Reconnect audio
-
Repeat the instantiation and connection process for all instances of all DSP Effects in the system
Note that the system may perform some audio processing with each new instance before all of the Effect instances in the system have been re-instantiated.
Additional TI Shell services
Background processing
AAX plug-ins may request idle time from the main TI Shell thread. This results in a true idle context callback which can be used for non-critical
background processing tasks on the DSP. This facility restricts the DSP to only allocate plug-in instances of the same type.
A plug-in's background processing callback is not provided with a reference to the plug-in's data structures and must therefore access plug-in data via global variables. The background process will be interrupted by system events and the audio render callback. For more information and an example on how to create a plug-in that relies on background processing, see \ExamplePlugins\DemoGain_Background
in the SDK.
Requirements for HDX DSP Plug-Ins
Plug-in description
To support HDX DSP platforms, a plug-in must add a TI ProcessProc (real-time processing entrypoint) for each of its algorithms. This is done via a call to
AAX_IComponentDescriptor::AddProcessProc_TI(), which is parametrized with the names of both the algorithm's TI DLL and of its exported entrypoint.
At minimum, the TI ProcessProc requires the following AAX Properties:
Performance measurement and reporting
In order to determine each algorithm's resource requirements, the host collects cycle count information from the plug-in via the plug-in's Describe callback. Each plug-in Effect is responsible for correctly reporting its algorithms' cycle counts for each accelerated platform that it supports. For plug-ins that use DMA or background threads, a maximum per-chip instance count is also required.
- Note
- All reported values must represent the algorithm's worst case performance.
Each of these values are reported as properties of a given algorithm ProcessProc and are provided by the plug-in via
AAX_IComponentDescriptor::AddProcessProc_TI(). If an effect does not report its cycle count usage then it will be limited to a single instance per TI chip. This can be useful during development, but is not a supported mode for general use; all shipped plug-ins must correctly report their cycle requirements.
The DigiShell utility can be used to accurately measure plug-in cycle count requirements. For more information about DigiShell, see
DSH Guide.
Shared vs. per-instance cycles
Because a single call into a plug-in is used to process multiple instances of that effect on that chip, two cycle count properties must be reported for each TI algorithm:
-
AAX_eProperty_TI_SharedCycleCount
This property describes the algorithm's one-time processing overhead that doesn't change as instances are added to a chip.
-
AAX_eProperty_TI_InstanceCycleCount
This property describes the additional cycle counts that each instance adds to the base shared overhead.
Many plug-ins exhibit different performance characteristics for both of these metrics depending on the plug-in's state. When reporting a plug-in's shared and per-instance cycle count requirements it is important to ensure that the reported values are the maximum possible requirements of the algorithm.
Often a plug-in will experience its worst-case per-instance processing load in one configuration and its worst-case shared processing load in another configuration. In this situation, the plug-in's reported cycle count requirements should reflect the state in which the sum of the two metrics is highest.
It's a common practice to not describe
AAX_eProperty_TI_InstanceCycleCount and
AAX_eProperty_TI_SharedCycleCount for the plug-ins during development and debugging process of the DSP plug-ins. This is acceptable, although in this case the one instance of such a plug-in will require the whole chip. In AAX SDK example plug-ins this is implemented using
AAX_TI_BINARY_IN_DEVELOPMENT
macros. If defined, it turns off the cycle count properties for the plug-in.
Measuring shared cycles
Measuring shared cycle counts requires instantiating multiple instances of an effect and observing how the processing time changes as instances are added. The shared and instance cycle counts are then calculated by performing a linear regression on the number of uncached cycle counts as the number of plug-in instances on the chip increases.
Note that these values will differ between debug and release builds of an algorithm, so a plug-in's describe function should report the correct cycle count values based on the relevant build configuration.
DigiShell includes the ability to measure shared cycle counts using the
DAE.cyclesshared
command. For more information about performance profiling using DigiShell, see
Cycle count performance test.
- Note
- HDX DSP requires reporting of an algorithm's worst-case cycle counts.
-
Because HDX PCIe and Pro Tools | Carbon use the same HDX DSP platform, either product may be used to take plug-in cycle count measurements.
DMA and background thread performance reporting
For algorithms that use
DMA or
background thread facilities, the maximum number of algorithm instances that will fit on a chip is difficult to predict from cycle counts alone. Due to the asynchronous behavior and limited capacity of the DMA system, the DMA system may begin to miss its deadlines before the CPU is fully loaded. In addition, due to differences in background processing requirements between algorithms, an effect's background process may begin to miss its deadlines and be starved before the interrupt-time audio processing is at capacity. Plug-ins that use these facilities must therefore report the maximum number of instances that will run reliably at a given sample rate, in addition to reporting their shared and per-instance cycle counts as above.
Some other plug-ins may also wish to report the maximum number of instances that will run reliably at each sample rate. For example, plug-ins that use a lot of host/DSP data bandwidth may need to limit the number of instances per DSP chip in order to run successfully on Pro Tools | Carbon.
Maximum reliable instance counts are reported using an additional property,
AAX_eProperty_TI_MaxInstancesPerChip. A plug-in should register separate components for the following three sample rate ranges in order to register distinct values for this property:
-
Sample rates from 42kHz to 50kHz
-
Sample rates from 84kHz to 100kHz
-
Sample rates from 168kHz to 200kHz
Notes regarding DMA and background thread performance reporting:
-
Because the number of instances will decrease as sample rate increases, the plug-in must be tested at the highest available pulled-up sample rate (i.e. 50kHz instead of 48kHz) in each of these three ranges.
-
On the HDX platform, effects that use DMA or background threads will not be mixed with effects of other types on a given chip.
-
The maximum number of instances per DSP cannot be measured via DSH in these cases, so careful listening tests must be manually performed in order to determine whether a certain number of instances of a DMA or background-enabled plug-in actually operate correctly on a DSP.
Dynamic resource usage
All resources used by an AAX DSP plug-in algorithm are considered static. Plug-ins may not dynamically change the amount of memory or DSP cycles that are allocated to them after these metrics are provided in Describe.
The ability to dynamically change DSP cycle count requirements at run time is provided in the AAX SDK but is not currently supported by any host.
Plug-in compilation and packaging
Exported symbols
Each HDX DSP algorithm (ELF DLL) may contain multiple entrypoints. A single DLL may be used for all of your plug-in's entrypoints and program code, or you may divide your plug-in's entrypoints and program code between multiple DLLs.
Your plug-in must export one "C"-style callback for each algorithm ProcessProc that your plug-in registers. This entrypoint must conform to the standard AAX real-time algorithm callback prototype:
# include "elf_linkage_aax_ccsv5.h"
extern "C"
TI_EXPORT
void
MyEffect_AlgorithmProcessFunction(
SMyEffect_Alg_Context * const inInstancesBegin [],
const void * inInstancesEnd)
Listing 1.1: The standard AAX real-time algorithm callback prototype
Packaging
The ELF DLLs for an AAX DSP plug-in must be placed in the ./Content/Resources directory within the plug-in bundle.
TI Development Tools
Development for TI algorithms is primarily performed in TI's Code Composer Studio. Code Composer Studio (CCS) is a full-featured, Eclipse-based IDE providing JTAG hardware debugger support, a hardware simulator, and a suite of profiling tools. Most importantly, CCS includes an excellent C compiler that is capable of providing highly optimized DSP instructions without too much tuning.
- Note
- As of this writing, Code Composer Studio for Mac does not support the C6000 series processor. CCS for Windows is required for AAX DSP plug-in development. See MacOS Host Support CCSv7 on the Texas Instruments wiki for current compatibility information.
Code Composer Studio
The AAX SDK supports Code Composer Studio versions 4 ("CCSv4") and higher ("CCSv5", etc.), with hardware debugging support beginning in version 4.2. As of the writing of this documentation, CCS versions 4, 5, and 7 have been tested by Avid.
- Note
- This documentation was originally written for CCSv4 and was later updated with instructions for updating from CCSv4 to CCSv5. Versions 5 and higher use a different project file format from version 4; when this documentation describes changes required for version 5 then these changes will also be required by other later versions which use this new project format.
Installation
-
Download and install the latest Code Composer Studio from TI's website.
- Note
- Windows 10 requires Code Composer Studio version 6.1.3 or higher
-
As of Code Composer Studio version 7 TI does not charge for licenses. You can simply download the tool and start using it. Along with this the end user license agreement has changed to a simple TSPA compatible license. For more information see the TI web site.
-
The default installation will work fine, but a custom install will be smaller. You only need support for the C6000 chipset and the Spectrum Digital JTAG drivers, so you can deselect all the other chipsets and JTAG drivers.
-
Go to TI's Code Generation Tools page. You will need to log in.
-
Download and install the C6000 Code Generation Tools v7.0.x or later, using the typical installation settings. For AAX DSP development you will only need support for the C6000 chipset and, if you will be using a hardware debugger, for the Spectrum Digital JTAG drivers, so you may deselect all the other chipsets and JTAG drivers.
-
Launch CCS and go to Help > Install New Software...
-
In the opened dialog select "Code Generation Tools Updates" in the "Work with:" drop-down list.
-
Select "TI Compiler Updates" > "C6000 Compiler Tools [version]".
-
Press Next and continue installation using the "typical" installation settings.
As of the publishing of this version of the AAX SDK Avid is internally using v7.4.6. Avid has tested 7.4.4 and 7.4.6, but we assume that later revisions will work as well. The latest CGTools version available as of this writing is v7.4.21.
For more information about configuring your CCS workspace with CGTools v7.4.x, see Workspace setup
Workspace setup
The idea of a CCS workspace is similar to a Visual Studio solution file. Note that workspaces tend to store absolute paths and developer-specific info, so you may wish to avoid checking them in to your source control server.
Setting up workspace-global macros
To set up workspace global macros:
-
When you open CCS for the first time, select a directory for your "workspace". As mentioned above, we recommend that this be outside of your source tree.
- Note
- Pay attention that you can not reuse your Code Composer Studio workspace after updating to a later versions. In particular, we have found the CCSv4 workspaces are incompatible with CCSv5. After updating your system to a later Code Composer Studio version you must create a new workspace and import your existing projects into this new workspace.
-
Go to File > Import... and select Code Composer Studio > Build Variables (CCS > Managed Build Macros in CCSv4.) Click Next.
-
Browse to TI/Common/macros.ini in your AAX SDK directory and click Finish.
-
This will define an "SDK_SOURCE_ROOT" Linked Resource path variable and Managed Build macro, which associates the CCS workspace with a single AAX SDK installation.
- Note
- A side effect of this is that you cannot use projects from multiple distinct AAX SDK installations in the same CCS workspace.
-
To verify that the correct path has been set, go to Window > Preferences... and look in General > Workspace > Linked Resources, and C/C++ > Build > Build Variables (C/C++ > Managed Build > Macros for CCSv4.)
Importing projects into your workspace
To import projects into your workspace:
-
In the IDE, go to Project > Import Existing CCS/CCE Eclipse Project
-
In Select search-directory, select the root of your AAX SDK installation
-
The projects in the resulting Projects list will automatically be selected
-
Click Finish, and then wait while the projects are imported.
In order to import CCSv4 projects into later versions of Code Composer Studio it is necessary to add a .cdtproject file to the project. If you don't have this file in your project, then you can copy it from any other existing project which was created using CCSv5 or later. Otherwise you will most likely see something similar to this error:
"Error: Import failed for project 'xxxx' because its meta-data cannot be interpreted."
If you try to build this newly imported CCSv4 project in a later version of Code Composer Studio then you will get the warning:
"This project was created using a version of compiler that is not currently installed: 7.0.5 [C6000]. Another version of the compiler will be used during build: 7.4.6. Please install the compiler of the required version, or migrate the project to one of the available compiler versions by adjusting project properties."
This warning may be cleared by changing Properties > General > Compiler Version from TI v7.0.x to the current version (e.g. TI v7.4.x). After that the
"Output format" field, which is next one to the
"Compiler version" field and is typically grayed out, will become active. You should choose "eabi (ELF)" there. Otherwise Code Composer the build will fail with errors:
-
"--dynamic=lib not supported when producing TI-COFF output files"
-
"--export=_auto_init_elf not supported when producing TI-COFF output"
- Note
- After successful convertion of the project and successful build, the remeasurement of cycle count should be done, because it may change. Most likely it will decrease, as compared to the version which was built with CCSv4, but that is not guaranteed. Also the size of the DLL may increase, which may require reducing code size in order to properly instantiate the plug-in.
Creating new projects
New project setup
Use the following settings in the "New Project..." wizard. Defaults are in italics.
-
Project Type:
C6000
-
Output type:
Executable
-
Device Variant:
Generic C67x+ Device
-
Device Endianness:
little
-
Code Generation Tools:
7.4.6
or later (7.0.5
for CCSv4)
-
Output format: eabi (ELF) (in CCSv4 this field will be grayed out.)
-
Linker Command File:
CommonPlugIn_LinkerCmd.cmd
(see note below)
-
Runtime Support Library: <automatic>
- Note
- You can edit the Linker Command File setting to use the
SDK_SOURCE_ROOT
macro by manually editing the project's .project XML file or by adding the file to your project using a relative path. See the SDK sample plug-in projects for an example.
Recommended settings for AAX plug-in projects
Tool Settings
C6000 Compiler
Include Options
-include_path "${SDK_SOURCE_ROOT}/Interfaces"
-include_path "${SDK_SOURCE_ROOT}/[Plug-in directory]"
The SDK_SOURCE_ROOT
macro is defined via the macros.ini file, located in the SDK's /TI/CCSv4 directory. If you encounter errors using this macro, import the file using File > Import... > CCS > Managed Build Macros.
Tool Settings
C6000 Compiler
Command Files
-cmd_file "${SDK_SOURCE_ROOT}\\TI\\CCSv4\\CommonPlugIn_CompilerCmd.cmd"
This file contains additional compiler commands that should be common to all AAX plug-in projects
Tool Settings
C6000 Linker
Basic Options
-o "${ConfigDir}/${PackageName}/Contents/Resources/${ProjName}.dll"
This path will ensure that your compiled TI DLL is placed in the appropriate location inside your AAX plug-in bundle.
Tool Settings
C6000 Linker
Runtime Environment
(No "Initialization model" options set)
Build Settings
Artifact name
${ConfigDir}/${PackageName}/Contents/Resources/${ProjName}
This path will ensure that your compiled TI DLL is placed in the appropriate location inside your AAX plug-in bundle.
Build Settings
Artifact extension
dll
AAX TI libraries should use the .dll extension
AAX TI libraries should use the Elf binary parser only
Macros
Project
User Macros
ConfigDir = ${OutDir}/${ConfigName}
IntDir = ${ConfigDir}/int/${PackageName}/TI/${ProjName}
OutDir = ${ProjDirPath}/../../WinBuild
PackageName = [Plug-in name]
These macros are used by the other settings here to ensure proper path set-up and artifact naming. Don't worry that ConfigName
shows up as undefined - it will be defined as Debug/Release at compilation.
Recommended Release configuration settings
Tool Settings
C6000 Compiler
Basic Options
-symdebug:none
-O3
Tool Settings
C6000 Compiler
Predefined Symbols
-define=NDEBUG
Tool Settings
C6000 Compiler
Optimizations
-os
-on2
-op3
Tool Settings
C6000 Compiler
Assembler Options
-keep_asm
Other useful project settings
Tool Settings
C6000 Compiler
Predefined Symbols
-define _DEBUG
This option is useful for differentiating cycle count reporting for Debug vs. Release builds.
Tool Settings
C6000 Compiler
Directory Specifier
-ft "${IntDir}"
-fr "${IntDir}"
-fs "${IntDir}"
Useful for collecting intermediate files
Tool Settings
C6000 Linker
Basic Options
-m "${IntDir}/${ProjName}.map"
Useful for placing the map file alongside all other intermediates
Tool Settings
C6000 Linker
File Search Path
-l (nothing)
You can exclude libc.a, which is included by default, from this option unless you require C library features.
Adding files and folders
In CCS, dragging files into the project, using "Add Files to Project...", or using "Link Files to Project..." will either copy the file into the project directory or create an absolute path to the file. This is usually not the desired behavior. Use the following steps to add a file using a relative path:
-
Right click on the project you'd like to add files to, and select New > File (NOT "Source File" or "Header File").
-
Click "Advanced >>".
-
Check the box that says "Link to the file in the system". Click "Variables..."
-
Select the appropriate variable (usually either
SDK_SOURCE_ROOT
or SOURCE_ROOT
) and click "Extend..."
-
Find the file you want to add. Click OK. Click Finish.
Note that, when adding folders, everything in the folder will be built by default. You can exclude files to work around this behavior.
Settings for exported symbols
-
There is a compiler option in Code Composer Studio that will add an underscore to the exported entrypoint's name. We recommend keeping this option disabled in order to avoid ambiguity between the exported symbol name and the function name as it appears in your source code.
-
If you encounter undefined symbol errors when linking to a DSP library that uses a C-style interface then add the extern "C" keyword before the lib function prototypes. This should resolve the majority of such linker errors.
The TMS320C6000 C++ compiler
One of the primary goals of AAX is to provide a platform-agnostic development architecture in which products can easily be developed and re-used across a wide variety of platforms. However, it is still occasionally necessary to write platform-specific code. This section will document methods for producing code that is specific to the TI C6727 platform using the TMS320C6000 C++ compiler.
C++ standard support
The TMS320C6000 compiler supports C++ as defined in the ISO/IEC 14882:1998 standard. The exceptions to the standard are as follows:
-
Complete C++ standard library support is not included. C subset and basic language support is included.
-
These C++ headers for C library facilities are not included:
-
<clocale>
-
<csignal>
-
<cwchar>
-
<cwctype>
-
<ciso646>
-
These C++ headers are the only C++ standard library header files included:
-
No support for
bad_cast
or bad_type_id
is included in the typeinfo header.
-
Run-time type information (RTTI) is disabled by default. RTTI can be enabled with the -rtti compiler option.
-
The
reinterpret_cast
type does not allow casting a pointer to member of one class to a pointer to member of a another class if the classes are unrelated.
-
Two-phase name binding in templates, as described in tesp.res and temp.dep of the standard, is not implemented.
-
The export keyword for templates is not implemented.
-
A typedef of a function type cannot include member function cv-qualifiers.
-
A partial specialization of a class member template cannot be added outside of the class definition.
Predefined environment symbols
The following symbols are predefined by the compiler on the TI architecture, and should be used in code concerned with cross-platform support:
-
_TMS320C6X
Identifies that the chip is a C6000 variant. This is the symbol that we commonly use to distinguish whether code is being compiled for AAX-Native (Mac/Windows) or AAX-TI.
-
_TMS320C6700_PLUS
Identifies that the chip is a C6700-plus variant
Although you should not require them for AAX development, equivalent assembly predefines are as follows:
-
.TMS320C6X
Identifies that the chip is a C6000 variant
-
.TMS320C6700_PLUS
Identifies that the chip is a C6700-plus variant
Loop controls
The TI compiler supports several pragmas that can be used to give the compiler additional information about loops.
-
#pragma MUST_ITERATE( min, max, multiple )
This pragma helps the compiler optimize loops. min is the minimum number of times the loop will execute, max is the maximum number of times the loop will execute, and modulo is used if the loop will only execute a certain multiple of some number.
-
#pragma PROB_ITERATE( min , max )
If extreme cases prevent the use of MUST_ITERATE
, PROB_ITERATE
allows you to specify the usual number of times a loop executes. For example, PROB_ITERATE
could be applied to a loop that executes for eight iterations in the majority of cases but that sometimes may execute more or less than eight iterations.
-
pragma UNROLL( n )
Helps the compiler use SIMD instructions, where n
is the unrolling factor. By specifying UNROLL(1)
you can prevent the compiler from automatically unrolling a loop. In general, we recommend using MUST_ITERATE
instead unless you have specifically identified a situation where manually unrolling a loop improves performance.
DigiShell test tool (DSH)
DigiShell is a software tool that provides a general framework for running tests on Avid audio hardware. As a command-line application, DigiShell may be driven as part of a standard, automated test suite for maximum test coverage. DSH supports loading all types of AAX plug-ins including Native and DSP, and is especially useful when running performance and cancellation tests of AAX-TI types. DigiShell is included in Pro Tools Development Builds as dsh.exe
(Windows) or as dsh
in the CommandLineTools
directory (Mac).
More information on DSH test tool can be found in
DSH Guide.
Hardware Debugging
Requirements
Relocatable ELF DLLs (TI algorithms) can be debugged with some help from the DIDL loader, the TI Shell Manager, and a script called DLLView_Elf_Avid.js.
These are the minimum requirements for hardware debugging for TI plug-ins:
-
Code Composer Studio version 4.2 or later
-
XDS510 hardware debugger
-
JTAG-enabled HDX card
We recommend using Spectrum Digital's XDS510 USB Plus JTAG Emulator, as it is the only one our internal developers have used and tested in-house. Both Spectrum Digital and TI have useful technical reference/installation guides, both of which can be found on the AAX Developer Forum under the 'Development Tools' discussion.
How it works
The ridl ELF loader inside DIDL stores a module and segment list containing the paths of all loaded modules and where their segments are loaded. The TI Shell Manager gets a serialized version of this table and loads it to a block of external memory on the chip at a known location. The DLLView_Elf_Avid.js script queries this memory via the debugger and extracts the paths of the modules and the ELF segment load locations, which it then passes on to the GEL_SymbolAddELFRel
scripting console command (new to CCSv4.2). You can also use that command directly at the console.
Connecting a JTAG Emulator
A JTAG-enabled HDX development card includes a "riser" PCB section extending about a centimeter above the production card PCB. This riser includes two JTAG connectors. The two connectors correspond to the two banks of 9 DSPs on the HDX card. Assuming that you are instantiating your plug-in for debugging on the first available DSP, you will want to connect your JTAG emulator to the connector that is closest to the card's user-visible ports. This connector corresponds to the first 9 DSPs on the card.
Linking to TIShell.out
Hardware debugging, as well as several other debugging facilities, requires that the DSP plug-in project is linked to TIShell.out in Code Composer Studio.
To link a plug-in project to TIShell.out, follow these steps:
-
Open the plug-in project's properties window and navigate to the C/C++ Build > Tool Settings > C6000 Linker > File Search Path properties pane.
-
Add "TIShell.out" to the "Include library file" (
-l
) property list.
-
Under "Add <dir> to library search path" (
-i
), add the file path of the Pro Tools build you will be using to test the plug-in. This directory should already include the build's TIShell.out file.
-
Repeat this process for each Configuration of the plug-in project that you will be testing.
-
Add "[path to AAX SDK root]\\TI" to the project's list of source file include directories
Adding the HDX Target Descriptor File
To add the HDX Target Descriptor File:
-
In the IDE, go to Window > Preferences, CCS > Debug. Point the "Shared target configuration directory" to /TI/Common in your AAX SDK source tree
-
In the IDE, go to Window > Show View > Target Configurations.
-
Click refresh if you don't see the configuration file
-
Right click Raven_C672x_XDS510_USB.ccxml, and click "Set as Default".
Setting up the DLLView script
Once you have successfully installed the XDS510, you will have to do a little bit of setup with CCS. Before starting this process, verify that you are running CCSv4.2 or later and the C6000 code generation tools v7.4 or later (or 7.0.5 for CCSv4). CCS should recognize the installed emulator and prompt you to download the necessary drivers. Once completed, you will then want to setup your DLLView script.
To set up the DLLView script:
-
In the IDE, open the Scripting Console under View > Scripting Console
-
At the Scripting console, type one of the following to load the DLLView script (insert your own source tree path, and make sure to load the version that corresponds to your installed CCS version):
Code Composer Studio 4: loadJSFile "[PATH TO AAX SDK]/TI/CCSv4/dllView_Elf_Avid.js" true
Code Composer Studio 5 and later: loadJSFile "[PATH TO AAX SDK]/TI/CCSv5/dllView_Elf_Avid.js" true
You should now see a new menu item under the Scripts menu: "DLLView -Load Pro Tools Plug-In Symbols" This should load every time CCS starts.
Loading Symbols for Debugging
You will need to get your code loaded and running on the TI before you load symbols. You can do this directly through Pro Tools, or by using our DigiShell test tool. If using the DigiShell test tool, load the DAE dish and then a plug-in via the following commands:
load_dish DAE
Loads the DAE dish
run
Lists available plug-ins with their index and spec
run<index>
Instantiates the <index> plug-in
Use the DLLView script to load symbols for ELF DLLs. After setting up the DLLView script and connecting to the desired chip in the Debug pane, run the "DLLView -Load Pro Tools Plug-In Symbols" script from the Scripts menu in Code Composer Studio.
- Note
- The chip will need to be Suspended in the debugger in order to load symbols.
To load symbols for debugging:
-
In CCS, Launch the TI Debugger (Target > Launch TI Debugger)
-
Connect the debug target to the appropriate chip
-
Suspend the chip
-
Run Scripts > DLLView -Load Pro Tools Plug-In Symbols.
- Note
- This script can take a moment to load; look at the Scripting Console to view its progress if you like
-
This script may print a warning about TIShell.out not existing. This warning is benign for plug-in debugging since the TIShell symbols are not required in this case.
This will load symbols for all symbol-rich modules running on the chip(s) connected to the debugger. If you load or unload plug-ins after this, you can simply repeat the "DLLView -Load Pro Tools Plug-In Symbols" command, which will synchronize the debugger with the current configuration.
- Note
- When running a plug-in in Pro Tools, the first DSP chip is reserved for the HDX mixer. Therefore the first available DSP chip for plug-in instantiation is
C672x_1
. Under DSH, the first available DSP chip is C672x_0
.
Breaking on first entry into algorithm
To break on the first entry into the plug-in's processing routine, use the manual single-buffer processing mode in DSH:
piproctrigger manual
run<index>
Attach debugger, suspend the chip, load symbols, set breakpoint, resume
piproctrigger auto
Breaking in the on-chip algorithm initialization callback
It is not currently possible to hit a breakpoint in the optional on-chip algorithm initialization callback for a plug-in. If you need to troubleshoot this callback then you should use tracing to print debug information to a log file.
Tracing
Avid's AAX DSP platforms provide tracing functionality based on Avid's
DigiTrace tool.
To enable trace logging for TI plug-ins, use the
AAX_TRACE or
AAX_TRACE_RELEASE macros defined in
AAX_Assert.h
. A separate macro,
AAX_ASSERT, is also available for conditional tracing. These macros are cross-platform and will function whether the algorithm is running on the TI or on the host.
Tracing requirements
-
The AAX_ASSERT and AAX_TRACE macros are debug-only and will not provide tracing output from release builds of your plug-in. AAX_TRACE_RELEASE may be used for tracing in both debug and release configurations.
-
These macros require that the
DTF_AAXPLUGINS
facility is enabled in the DigiTrace configuration file. You can toggle this facility to enable or disable AAX algorithm-level tracing.
-
In order for tracing to be successful on TI platforms, your plug-in's ELF DLL must dynamically link against TIShell.out, a component that is installed alongside the Pro Tools application. This file includes the 'glue' that is required in order for the linker to resolve the DigiTrace entrypoint symbol in the DLL.
To link your plug-in project to TIShell.out in Code Composer Studio, follow the steps listed in
Linking to TIShell.out .
Tracing example
int32_t
MyExamplePlugIn_AlgorithmInit ( SExample_Alg_Context const *
{
"MyExamplePlugIn_AlgorithmInit called for action : %d",
inAction );
return 0;
}
#define AAX_CALLBACK
Definition: AAX.h:285
#define kAAX_Trace_Priority_Normal
Definition: AAX_Assert.h:226
#define AAX_TRACE_RELEASE(iPriority,...)
Print a trace statement to the log.
Definition: AAX_Assert.h:232
AAX_EComponentInstanceInitAction
Selector indicating the action that occurred to prompt a component initialization callback.
Definition: AAX_Enums.h:795
Listing 2: Adding trace code on TI
Usage notes
-
When running on the DSP, the actual handling of each tracing call occurs in a separate thread. This can lead to incorrect data reporting if volatile data, such as a pointer to an audio sample, is passed in to the tracing statement as a parameter.
-
DSP tracing is most reliable when using debug TI builds and when all TI compiler optimizations have been disabled
-
Known and resolved issues with DSP tracing are logged on the Known Issues page
Testing in Pro Tools
The System Usage window
The System Usage window in Pro Tools includes some features specifically targeted at testing DSP plug-ins, and particularly for testing shuffle events. Starting in Pro Tools 10, the System Usage window includes the following test features:
-
Shift + Drag DSP Meter - This shuffles everything on the chosen chip to another chip, which allows you to quickly test shuffle for a given chip.
-
Hover mouse over DSP - Presents a tooltip to show the running plug-ins on a chip
-
Cmd+Option+Shift Hover - Detailed debugging tooltip info
-
Cmd+Option+Shift Click - Forces a full shuffle of all chips / cards
-
Click on empty chip - Reserves a DSP to prevent allocation on that chip
DSP information tooltip
Pro Tools can display additional information for DSP plug-ins using some debug tooltips that are hidden in the plug-in window header and the System Usage window.
The tooltip in the plug-in window header displays information about the particular plug-in instance that is currently shown in the window. To display this tooltip, hold Command-Option-Shift (Mac) or Control-Alt-Shift (Windows) and hover the mouse cursor over the DSP > Native button in the plug-in header.
The tooltip in the System Usage window displays usage information for each DSP chip in the system. You can reveal this tooltip for a particular chip by mousing over the chip's usage meter while holding Command-Option-Shift (Mac) or Control-Alt-Shift (Windows). This tooltip shows the chip's total allocated cycles, internal, and external memory.
The information in these tooltips is generally targeted at systems-level debugging, but can prove useful for some plug-in troubleshooting as well.
Figure 1: DSP tooltip in the Pro Tools plug-in window header.
Figure 2: DSP tooltip in the Pro Tools System Usage window.
Common Issues with TI Development
Data structure compatibility
AAX DSP plug-ins use a set of custom data structures to exchange information with host. In order to preserve a consistent binary interface between the plug-in's host and algorithm, the layout of these structures must be identical on both platforms. Each structure must have the same size when compiled by both the host platform compiler and the TI DSP compiler, and any members that are referenced by both the host code and the DSP code must reside at the same offset within the struct on both platforms.
In order to satisfy this requirement, it is essential that an AAX plug-in's algorithm context structure and any other data structures that are passed between the host and the DSP use appropriate alignment. Data structures are usually aligned to 32-bit boundaries, and both Intel and TI compilers use identical struct alignment and packing for most cases. However, this behavior is not explicitly defined in the C standard.
Furthermore, different compilers may use different sizes for some built-in data types. It is therefore very important to use explicitly-sized types such as int32_t
and float
rather than ambiguous types such as bool
or int
. One particularly tricky data type is pointers, which may be compiled as 64-bit values on a 64-bit Intel system but as 32-bit values on the TI DSP.
Here are some specific scenarios when an unexpected difference in alignment or data type size may occur and cause an ABI incompatibility between a plug-in's host and DSP components:
Nested structures
It can be particularly difficult to debug alignment issues in nested data structures. One reason is that nested structs do not necessarily have the same alignment as the parent struct. A nested structure will have the alignment that is set preceding its declaration, not the alignment of the structure in which it is contained.
Aside from avoiding nested structs entirely, one way to avoid potential issues is to make sure that nested structs always contain a double. This will guarantee that the structure is double-word aligned. We have also found that placing nested structs near the beginning of the parent struct results in more consistent alignment between Intel and TI compilers, even in cases where the actual alignment of each member is strictly ambiguous according to the standard.
Another important rule of thumb with nested structs is to define them inline in the enclosing structure. We have found that including one data structure as a member in another data structure will only be reliably aligned between Visual Studio and the TI compiler tools if the member structure's type is defined in-line. This does not appear to be an issue between clang and the TI compiler - the data structure alignment for the nested structure is consistent between those two compilers regardless of the location of the internal structure's definition.
#include AAX_ALIGN_FILE_ALG
struct SomeStruct
{
float a;
float b;
};
#include AAX_ALIGN_FILE_RESET
#include AAX_ALIGN_FILE_ALG
class SomeClass
{
public:
SomeStruct s;
};
#include AAX_ALIGN_FILE_RESET
Listing 3: Problematic code: nested struct not defined in-line
#include AAX_ALIGN_FILE_ALG
class SomeClass
{
public:
struct SomeStruct
{
float a;
float b;
} s;
};
#include AAX_ALIGN_FILE_RESET
Listing 4: Fixed code: nested struct defined in-line
Usage of pragma pack
If you use pragmas to align your structs, then you should know that in most cases it will only decrease the natural struct alignment of a compiler. That means that if you have
#pragma pack(8)
struct x
{
char a;
float b;
};
Listing 5: Example of usage of #pragma pack
where it has no effect
then struct x most likely won't be aligned to the 8 byte boundary. Therefore the pack pragma is not really useful for addressing alignment issues. Instead of using pack, one way to guarantee that a structure is double-word aligned, is to include at least one double member.
#pragma pack(8)
struct x
{
float a;
double b;
};
Listing 6: Example of usage of #pragma pack
where it actually affects the alignment of the structure
In this case data will be double-word aligned.
Dynamic allocation of memory in structures and algorithm
The problem with dynamic allocation is that it's difficult to enforce specific alignment of the resulting block beyond the natural alignment of the structure. Newly allocated blocks are not double-word aligned by default. This prevents double-word memory access optimizations (see
Additional data type optimizations) from working.
float* floatBlock = new float[100];
delete[] floatBlock;
float* floatBlock2 = alignMalloc<float>(100, 8);
void alignFree(void *p)
Definition: AAX_Alignment.h:30
Listing 7: Problems which may arise when using dynamic allocation of memory in algorithm
Incorrect use of pointer data
In general, you should avoid storing pointers to anything in any data structures that are passed between the host and the DSP. There are many possible problems and bugs that can be caused by this, for example:
-
Often the memory map of packets can change out from under the plug-in
-
It is easy to accidentally reference data in the wrong memory space when setting pointer values
-
Pointer data types are not explicitly sized (see below.)
One alternative to using raw data pointers is to store data offsets into a coefficient array rather than using direct pointers to other structure elements. A solution such as this that does not involve pointer data types will almost always end up being easier to implement, easier to troubleshoot, and easier to maintain than a solution that uses pointer data.
That said, if you must use pointer data types in any data structures that are passed between the AAX host and DSP components then you should be very careful to avoid the problems listed above.
Pointer data size incompatibility
Problems due to pointer data size incompatibility can be particularly difficult to debug. Pointer data types are not explicitly sized in C, and, starting with the 64-bit Pro Tools 11 release, pointers will have different lengths for host and TI binaries. This can cause subtle portability problems in certain circumstances, if proper care is not taken.
Consider the following state block:
struct SMyPlugInStateBlock
{
float mInGain_Smoothed;
some_t* mPointerP;
float mOutGain_Smoothed;
};
Notice the pointer mPointerP
(the type that it points to is irrelevant for this discussion). Perhaps it is a pointer that can reference different sets of coefficients, or perhaps it points to some sort of global variable. In any case, this pointer is 64-bits long on the host, and 32-bits long on TI.
In most cases, this won't cause a problem because the host simply allocates a bit more space for the state block than the TI needs and fills the allocated memory with 0s. But consider the case where we overload
ResetFieldData() to set
mOutGain_Smoothed
to something other than 0:
{
switch (inFieldIndex)
{
case (eMyAlgFieldIndex_State):
{
memset(inData, 0, inDataSize);
SMyPlugInStateBlock* stateP = static_cast<SMyPlugInStateBlock*>(inData);
stateP->mOutGain_Smoothed = mOutGain_Target;
break;
}
default:
{
break;
}
}
return result;
}
int32_t AAX_Result
Definition: AAX.h:337
AAX_CIndex AAX_CFieldIndex
Not used by AAX plug-ins (except in AAX_FIELD_INDEX macro)
Definition: AAX.h:349
@ AAX_SUCCESS
Definition: AAX_Errors.h:39
AAX_Result ResetFieldData(AAX_CFieldIndex inFieldIndex, void *oData, uint32_t inDataSize) const AAX_OVERRIDE
Called by the host to reset a private data field in the plug-in's algorithm.
We might be doing this if mOutGain_Smoothed
was a smoothing parameter and we want to start it at the target gain value (rather than having it smooth from 0.0 at instantiation). But if the Host and TI can't agree on where in the state block mOutGain_Smooth is located, then the result will be unexpected behavior that is difficult to debug.
The most direct way to avoid this problem is to use an explicitly-sized 32-bit type for any pointers in your state block:
struct SMyPlugInStateBlock
{
float mInGain_Smoothed;
uint32_t mPointerP;
float mOutGain_Smoothed;
};
It will be necessary to use reinterpret_cast<float*>(stateP->mPointerP)
to recast the pointer to a pointer data type on the TI, but that should not result in any extra processing cycles.
Alignment Reference
These are the data type sizes and default alignments for some common compilers when compiling for 64-bit binary formats:
| TI | MS Visual C++ | C++ Builder | GCC |
char | 1 byte | 1-byte aligned | 1 byte | 1-byte aligned | 1 byte | 1-byte aligned | 1 byte | 1-byte aligned |
short | 2 bytes | 2-byte aligned | 2 bytes | 2-byte aligned | 2 bytes | 2-byte aligned | 2 bytes | 2-byte aligned |
int | 4 bytes | 4-byte aligned | 4 bytes | 4-byte aligned | 4 bytes | 4-byte aligned | 4 bytes | 4-byte aligned |
long | 4 bytes | 4-byte aligned | 8 bytes | 8-byte aligned | 8 bytes | 8-byte aligned | 8 bytes | 8-byte aligned |
long long | 8 bytes | 8-byte aligned | 8 bytes | 8-byte aligned | 8 bytes | 8-byte aligned | 8 bytes | 8-byte aligned |
bool | 1 byte | 1-byte aligned | 1 byte | 1-byte aligned | 1 byte | 1-byte aligned | 1 byte | 1-byte aligned |
float | 4 bytes | 4-byte aligned | 4 bytes | 4-byte aligned | 4 bytes | 4-byte aligned | 4 bytes | 4-byte aligned |
double | 8 bytes | 8-byte aligned | 8 bytes | 8-byte aligned | 8 bytes | 8-byte aligned | 8 bytes | 8-byte aligned |
long double | 8 bytes | 8-byte aligned | 8 bytes | 8-byte aligned | 8 bytes | 8-byte aligned | 16 bytes | 16-byte aligned |
pointer | 4 bytes | 4-byte aligned | 8 bytes | 8-byte aligned | 8 bytes | 8-byte aligned | 8 bytes | 8-byte aligned |
Also here are some useful links to web resources on the topic:
TI Optimization Guide
Optimizing AAX real-time algorithms for Avid's TI-based platforms is very similar to optimizing real-time algorithms for any architecture. When developers think about optimization, they often think "I want to make my code run faster". In reality, however, optimization is about making the processor do less. After all, the processor's clock rate is fixed and can only perform a limited number of instructions in a set amount of time. Therefore, our focus in this section will be on helping the compiler produce code with shorter execution paths and make full use of the TI chip's architecture.
Modern compilers have become extremely powerful at being able to optimize code, which is fortunate given the complicated architectures of today's DSP products. In this section we will not focus on instruction-level "optimizations" like the one below, which will automatically be done by the compiler. Instead of making our code faster, which it won't, little "tricks" like this really just make code harder to read:
Listing 8: The kind of optimization that you won't be seeing in this section
Rather, we will focus on refactoring audio processing algorithms to be more efficient and on giving the TI compiler better information about the code, pointers, and data it is working with so it can perform more effective compile-time optimizations.
Finally, our optimization efforts will focus on the worst-case code path. For example, developers often try to optimize algorithms by conditionally bypassing portions of code that may be disabled by particular parameter states. This is counter-productive, because the system has to assume a plug-in's worst-case execution performance regardless of how much time the plug-in is actually using. Therefore, in the context of real-time algorithms running on AAX DSP platforms, it is best to only worry about worst-case execution time.
- Note
- The optimizations described in this section assume that you are using version 7 or higher of TI's C6000 Code Generation Tools (CGTools). We strongly recommend using v7.0.5 as earlier versions throw linking errors.
Optimization quick start
Here is a quick outline of the general optimization steps for an AAX DSP algorithm:
-
Before beginning your DSP optimizations, make sure that your Native algorithm has basic optimizations in place. In our experience, beginning the TI optimization process with a slow or needlessly precise Native algorithm will result in a long porting process. Here are some suggestions for common Native optimizations:
-
Identify unnecessary double precision
-
Identify tables that have too high of granularity
-
Make sure your compiler Release settings enable the compiler to optimize fully and give full optimization comments:
-k -s -pm -op3 -os -o3 -mo -mw –consultant –verbose -mv67p
-
Use the load/update/store design pattern to reduce memory accesses in inner loops
-
Move any processing that does not directly depend on the audio signal out of the real-time algorithm
-
Declare non-changing variables and pointers (both local and in parameter lists) as
const
-
Declare non-aliased pointers (both local variables and function parameters) as AAX_RESTRICT
-
Change any
long
variables to int
, and change double
variables to float
if the reduced precision does not affect signal integrity (usually defined as cancellation with the plug-in's Native algorithm.)
-
Restructure inner processing loops so that they do not contain large conditional statements or other branches
-
Declare any functions that are called within the innermost processing loop as
inline
in order to allow the inner loops to pipeline
-
Add loop count information when known, using
#pragma MUST_ITERATE(min,max,quant)
Compiler and linker options
As with any complex environment, many performance gains on the TI rely on the appropriate compiler and linker options. The options documented here will allow CGTools to apply its optimization logic to your algorithm.
When tweaking compiler options on the TI, keep in mind that, like on any CPU, it is useless to optimize Debug code or to profile its performance. This is especially true on TI processors because of the fact that generated Debug and Release assembly is almost completely different, assuming that heavy optimization options were chosen for the Release configuration.
In general, all recommended compiler options should be set correctly in the AAX SDK's example plug-in projects, and these settings may be used as a guide for your own plug-in projects. See the SDK files CommonPlugIn_CompilerCmd.cmd and CommonPlugIn_LinkerCmd.cmd for the latest recommended settings.
Overview of optimization-related compiler options
-
-g
Full symbolic debug. This setting should be used in debug configurations to make stepping through code easier. It should not be defined in release configurations, as it will prevent the compiler from being able to fully optimize code.
-
-k
Keep generated .asm files. This should be turned on in release configuraions so that you can use the ASM output as feedback when making optimization decisions and performance improvements.
-
-d"_DEBUG"
Defines the _DEBUG
preprocessor macro that alters how certain code is generated (asserts, stdlib, etc). This should be turned on in debug configurations only. Note that TI does not require NDEBUG to be defined in release configurations.
- Note
- This will eventually be deprecated in favor of the pre-defined "_TMS320C6X" macro.
-
-mv67p
Specifies that the compiler should build code for the C67x+ chip variant we are using, which has some improvements beyond the original C67x. This option should be enabled in all build configurations that target the HDX platform.
-
-s
Specifies Opt-C/ASM interlisting. This interweaves modified C-code and ASM in the .ASM file produced by the -k
option. You should use -s
in release configurations so that the ASM file can be read more easily.
- Note
- Do NOT use the
-ss
option in release configurations. This option will negatively affect optimization
-
-pm
Program mode compilation. Instructs the C compiler to compile all files in the same compilation unit, so that it can optimize code further using information from all files being compiled. See
Program Mode optimization (-pm) for more information.
-
-op3
A modifier for the -pm option, this specifies that there are no external variable references in the project. This option is appropriate for TI algorithms, which do have an external function reference (the process entry point) but do not have external variable references. This option allows the compiler to further optimize global variables without worrying whether they will be accessed outside of the compilation unit. See
Program Mode optimization (-pm) for more information
-
-o3
File-level optimization. This flag gives the compiler full ability to optimize C-code by reordering instructions, inlining functions, and performing other optimizations. Note that the resulting ASM code will be very difficult to parse back into the original C and will make debugging very difficult, so this flag should only be used for Release code. See
Optimization flags (-o) for more information.
-
-mo
Use Function Subsections. This instructs the compiler to place all functions into their own separate subsection in the linker map. This allows the linker to remove unused functions in order to reduce memory usage.
-
-mw
Generate a single iteration view of SP loops. This flag adds important information to the ASM output file that is useful when optimizing your code for pipelined loops.
-
–verbose
Output verbose status messages when compiling files. Though not very useful for humans, verbose output will produce some key information that text parsers can use, such as compiler versions and other details.
Overview of optimization-related linker options
-
–relocatable
Generate a relocatable non-executable.
-
-m"file.map"
Generate a map file. This file contains useful information about the memory footprint of your plug-in, which is useful for fixing large plug-ins that may not have fit into available program memory.
-
-w
Warn about output sections. This flag generates very useful information that tells you if there might be a problem with memory output sections you are trying to generate.
-
-x
Exhaustively read libraries. This is a useful flag if you do not want to worry about the order in which you specify required libraries.
Optimization flags (-o)
-
Register (-o0
)
This option allows for some performance gains over non-optimized code by allocating variables to registers, inlining functions declared inline, etc.
-
Local (-o1
)
This option enables local optimizations, with very similar results to the register-level optimizations of -o0.
-
Function (-o2
)
This is the standard optimization level, and provides large gains over unoptimized code. This optimization level allows function-level optimizations such as software pipelining, loop optimization/unrolling, etc.
-
File (-o3
)
This option can provide some speedup beyond function-level optimizations, but also mutilates assembly code beyond recognition. At this optimization level the compiler will remove unused functions, simplify code in the case of unused return values, auto-inline small functions, etc.
Like the corresponding Visual Studio options,-o0
and -o1
allow you to step through code line-by-line for debugging, at the cost of reduced performance. -o2
and -o3
sacrifice the ability to step through code and watch memory in favor of optimized code.
Program Mode optimization (-pm)
Program mode optimization gives the compiler further optimization information by compiling all files at once rather than individually. Thus global constants, function implementations, etc. can be made known to the entire program at compilation. This allows the compiler to inline functions more effectively and to determine loop unrolling based on constant loop iterators.
There are a few -pm
options:
-
-pm -op0
Contains functions and variables that are called or modified from outside the source code provided to the compiler.
-
-pm -op1
Contains variables modified from outside the source code provided to the compiler but does not use functions called from outside the source code.
This option is not appropriate for AAX plug-in algorithms, because the algorithm component will be exported and called from outside the compiled source code.
-
-pm -op2
Contains no functions or variables that are called or modified from outside the source code provided to the compiler.
This option is not appropriate for AAX plug-in algorithms, because the algorithm component will be exported and called from outside the compiled source code.
-
-pm -op3
Contains functions that are called from outside the source code provided to the compiler but does not use variables modified from outside the source code.
This is the recommended Program Mode optimization level for TI plug-ins. This optimization level requires that no global variables are used outside of the algorithm callback. In general, any such variables should be passed in to a TI algorithm via the algorithm's context structure.
Compiler options to avoid
The following information was taken from the TMS320C6000 Programmer's Guide:
-
-g/-s/-ss
These options limit the amount of optimization across C statements, leading to larger code size and slower program execution.
-
-mu
This option disables software pipelining for debugging. If a reduction in code size is necessary, use the -ms2
/-ms3
options. These options will disable software pipelining among their other code size optimizations.
-
-mz
This option is obsolete. When using 3.00+ compilers, this option will decrease performance and increase code size.
The load-update-store pattern
The load-update-store pattern is one of the cornerstones of a fast iterative algorithm. This pattern specifies that locally accessed data should be loaded into memory at the start of processing, accessed during processing, and stored or saved after processing has completed. By using this pattern you will move memory reads and writes outside of your plug-in's innermost processing loop, which reduces data dependencies and shortens the critical inner loop.
As an example, consider the following unoptimized filter code:
inline void
ProcessDirectFormII(float* input, float* output, float* state, float*
coefs, int nsamp)
{
for(int i = 0; i < nsamp; ++i)
{
output[i] = input[i]*coefs[eB0] + state[0];
state[0] = input[i]*coefs[eB1] + state[1] - output[i]*coefs[eA0];
state[1] = input[i]*coefs[eB2] - output[i]*coefs[eA1];
}
}
Listing 9: Unoptimized filter algorithm
Notice that in this code there are at least 15 memory accesses per loop iteration! This algorithm will be very inefficient as the value of nsamp
increases.
The compiler should be able to optimize this algorithm to some extent by pulling certain memory accesses outside of the loop. However, the compiler cannot completely optimize the loop because it must assume that the input/output/state/coefs pointers are aliased in memory. We will discuss the const
and restrict
keywords later, which are ways to give the compiler additional information it can use to optimize this loop. However, for now let's focus back on the basic design of this code.
Using load-update-store, we can refactor this loop to pull the memory accesses outside of the loop:
void
ProcessDirectFormII (float* input, float* output, float* state, float *
coefs, int nsamp)
{
float coefA0 = coefs [eA0];
float coefA1 = coefs [eA1];
float coefB0 = coefs [eB0];
float coefB1 = coefs [eB1];
float coefB2 = coefs [eB2];
float state0 = state [0];
float state1 = state [1];
float output;
for (int i = 0; i < nsamp; ++i)
{
output = input [i]* coefB0 + state0;
state0 = input [i]* coefB1 + state1 - output * coefA0;
state1 = input [i]* coefB2 - output * coefA1;
output [i] = output;
}
state [0] = state0;
state [1] = state1;
}
Listing 10: Refactored filter algorithm with load-update-store pattern applied. Not fully optimized.
Though the code initially appears longer, you will notice that we have reduced the loop to only 4 memory accesses! Though we have an additional 9 memory accesses outside the loop, they will only occur once per function call, resulting in significant savings at higher values of nsamp
.
- Note
- we are not finished with this loop yet, because we can make some very significant gains by using the
restrict
and const
keywords, as discussed in the section on C keywords.
Before moving on from load-update-store, let's consider how this pattern should be applied to different categories of data that may be provided in an AAX DSP processing context:
-
Coefficients and parameters
Coefficients and parameters are read-only by definition. As such, they should be loaded into a local variable at the beginning of the algorithm callback and should not be modified further.
-
Private state
State parameters are writable and may be changed by the algorithm. Therefore, private state data should be loaded into a local variable copy, then stored back into memory after the local copy is updated.
-
Output
Output is write-only, so all calculations may be performed on a local variable and then stored into memory once per loop.
Case study: IIR filter implemenation on TI 672x DSPs
In this section we will examine various IIR filter implementations as a specific example of the considerations that must be made when optimizing DSP code for the 672x.
The TI 67xx family of DSPs is notably different from some other typical DSP processors, such as the 56k and the Intel FPU, in that the TI DSP does not have an implicit higher-precision multiply-accumulate. It is of course capable of double precision accumulation, but this must be coded explicitly. In some ways, this is similar to the Intel SSE processing unit, which jetisonned the 80-bit floating point stack used in the Intel FPU. The lack of higher precision accumulation in TI (and SSE) can sometimes result in unacceptable quantization noise performance for single precision filter implementations. Luckily, with the right choice of filter structure or coding for explicit double precision accumulation, excellent results can be achieved.
On fixed-point DSPs such as 56k, Direct Form I (DF1) implementation is the standard due to moderately good fixed point scaling properties, decent noise performance, and simple implementation. However, on a 672x DSP a single precision DF1 filter can have terrible noise performance (depending on the filter coefficients and the audio material being processed.) A degenerate case is a DF1 highpass filter processing low frequency material; in DF1, the feedforward coefficients subtract the previous sample from the current sample, and for low frequency material this produces very small numbers with low precision. Single precision DF2 structures also produce similarly poor results in this respect.
One option to improve upon these results is to use double precision throughout the 672x filter implementation. However, this results in a heavy cycle performance penalty due to the high cost of double operations on the TI DSP. Another, often better, option is to use single precision coefficients and state, with double precision accumulation:
float in, b0, b1, a1, state1;
double accum ;
accum = double (b0) * double (in) +
double (b1) * double (state1) +
double (a1) * double (accum);
state1 = in;
Listing 11: Mixed-precision DF1 filter implementation
The TI compiler will implement this using the mpysp2dp instruction, since it knows that the operands started out as single precision and end up as double precision. This is considerably faster than going to a full double precision implementation, but it is still relatively slow compared to straight single precision. Making the state double precision will improve noise performance further, with some increase in cycle usage.
Another option that generally gets good results is the single precision DF2 Transpose (DF2T) filter. On TI the DF2T implementation is fast and generally has good noise performance. If you are looking for a simple recommendation that should work well enough for most applications, DF2T is a good choice.
The optimized C filter library available from TI uses the DF2 structure in its implementation. Even though DF2 has some limitations, this is a good starting point for seeing how to optimize filter code on TI; peak performance on TI is 2.25 cycles per biquad, so it's pretty amazing what can be done (to achieve that level of performance multiple series or parallel biquads need be put in a tight loop.) We have adapted some of this filter code to DF2T, and still achieved fairly similar cycle performance.
If the single precision DF2T noise performance is not good enough for your application, then either double precision or one of the myriad other filter structures, such as State Space, Gold-Rader, Lattice or Zolzer, should do the job. In fact, there is one relatively new filter structure which we think stands out, called the Direct Wave Form (DWF) filter. Details about this filter structure can be found in Direct Wave Form Digital Filter Structure: an Easy Alternative for the Direct Form by Jean H.F. Ritzerfel. According to the author the noise performance is 3dB within optimal, it's relatively efficient (5 multiplies per biquad), free of limit cycles, has simple coefficient generation and low coefficient quantization sensitivity. It might just be the perfect filter structure, but we'll let you be the judge of that; keep in mind that all filter structures have some tradeoffs, and the recommendations made here might not be the best for your particular application.
Understanding CGTools-generated ASM files
The ability to read the ASM files that are generated by CGTools is essential when optimizing a TI algorithm. Specifically, the information in these files will allow you to determine if anything is preventing software pipelining from occurring, which is the single most effective form of optimization on the C6727.
To view your project's ASM file, turn on the -k
compiler option ("Keep Generated .asm Files", found under Build Options > Compiler > Assembly in the Code Composer Studio IDE.) By default, ASM files will be placed in the same directory as the corresponding source file.
- Note
- You should only examine ASM listings of Release code that has been optimized by the compiler. Debug code should not be optimized.
Each ASM file for a TI algorithm callback should contain text that marks the start of the assembly listing for the processing loop. For example:
;**********************************************************************
;* FUNCTION NAME:
;*____________________________________________________________________*
;* Regs Modified: A0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14, _*
;*________________A15,B0,B1,B2,B3,B4,B5,B6,B7,B8,B9,B10,B11,B12, _____*
;* _______________B13,SP,A16,A17,A18,A19,A20,A21,A22,A23,A24,A25, ____*
;*________________A26,A27,A28,A29,A30,A31,B16,B17,B18,B19,B20,B21, ___*
;* _______________B22,B23,B24,B25,B26,B27,B28,B29,B30, B31 ___________*
;* Regs Used____: A0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14, _*
;* _______________A15,B0,B1,B2,B3,B4,B5,B6,B7,B8,B9,B10,B11,B12, _____*
;* _______________B13,DP,SP,A16,A17,A18,A19,A20,A21,A22,A23,A24, _____*
;* _______________A25,A26,A27,A28,A29,A30,A31,B16,B17,B18,B19,B20, ___*
;* _______________B21,B22,B23,B24,B25,B26,B27,B28,B29,B30,B31 ________*
;* Local Frame Size: 0 Args + 148 Auto + 44 Save = 192 byte __________*
;**********************************************************************
Listing 12: CGTools-generated header for a processing loop assembly listing
Within this listing, you are looking for several things:
-
Function calls
-
Branches or control code
-
Software pipelining notes
Function calls
[!B0] CALL .S1 __divd ; |213|
|| [!B0] MVKH .S2 0x40080000 ,B5 ; |213|
|| [ B0] MV .L1X B10 ,A4 ; |213|
$C$RL9 : ; CALL OCCURS {__divd} ; |213|
Listing 13: Function call in a CGTools-generated assembly listing
Function calls, such as the call in the listing above, cannot be effectively pipelined. If you find a function call figure out what C instruction it is caused by. Sometimes a function call will be made implicitly, such as when casting from float to int or when doing division. All function calls should be removed from the processing loop or inlined in order for the compiler to optimize effectively.
Branches
NOP 1
B .S1 $C$L5 ; |213|
NOP 4
MPYDP .M1X A5:A4 ,B5:B4 ,A11:A10 ; |213|
|| LDW .D2T2 *+ SP (124) ,B5 ; |218|
; BRANCH OCCURS { $C$L5 } ; |213|
Listing 14: Branch in a CGTools-generated assembly listing
Branches can also prevent loop pipelining. If you find a branch in your algorithm's assembly, determine whether it is preventing the compiler from pipelining a loop. If it is preventing pipelining, you must figure out how to rewrite the conditional in your C code so that it will not be compiled into a branch.
Software pipelining notes
For each loop the compiler finds and is able to pipeline, the .ASM file should contain a section similar to the one below:
;*--------------------------------------------------------------------*
;* SOFTWARE PIPELINE INFORMATION
;*
;* Loop source line : 68
;* Loop opening brace source line : 69
;* Loop closing brace source line : 124
;* Loop Unroll Multiple : 2x
;* Known Minimum Trip Count : 1
;* Known
Max Trip Count Factor : 1
;* Loop Carried Dependency Bound (^) : 15
;* Unpartitioned Resource Bound : 20
;* Partitioned Resource Bound (*) : 20
;* Resource Partition :
;* A- side B- side
;* .L units 0 0
;* .S units 0 1
;* .D units 20* 20*
;* .M units 7 5
;* .X cross paths 5 6
;* .T address paths 20* 20*
;* Long read paths 5 1
;* Long write paths 0 0
;* Logical ops (. LS) 5 4 (.L or .S unit )
;* Addition ops (. LSD) 0 1 (.L or .S or .D unit )
;* Bound (.L .S .LS) 3 3
;* Bound (.L .S .D .LS .LSD) 9 9
;*
;* Searching for software pipeline schedule at ...
;* ii = 20 Schedule found with 3 iterations in parallel
T Max(const T &iValue1, const T &iValue2)
Definition: AAX_MiscUtils.h:203
Listing 15: Pipelined loop header in a CGTools-generated assembly listing
These are the important items to note in this listing:
-
Loop Carried Dependency Bound
and Partitioned Resource Bound
The maximum of these numbers is the minimum number of clock cycles one instance of the loop will require in its current form. You can reduce these numbers by performing some of the optimizations listed in this guide.
-
Loop Unroll Multiple
This line will appear if the compiler is partially unrolling the loop to improve performance.
If a loop section instead displays Disqualified loop
: then some of the conditions required to enable software pipelining have not been met:
-
-o2
or -o3
optimizations must be enabled
-
The loop cannot contain a function call. Make all called functions inline.
-
The loop cannot contain any branches or jumps, often caused by large conditional statements
-
Software pipelining will not work with nested loops; only the innermost loop will be pipelined. You should completely unroll the inner loop or refactor the algorithm so that the loop can be pipelined
C keywords
There are a few keywords in C that give the compiler additional information about the variables you declare and parameters you pass into functions. This allows the compiler to further optimize the code it is compiling, which can result in significant performance gains.
const
Effective use of const
lets the compiler know whether pointers, scalars, or objects will remain constant in memory.
void
ProcessDirectFormII (
const float * const input,
float * const output,
float * const state,
const float * const coefs ,
int nsamp )
{
const float coefA0 = coefs [ eA0 ];
const float coefA1 = coefs [ eA1 ];
const float coefB0 = coefs [ eB0 ];
const float coefB1 = coefs [ eB1 ];
const float coefB2 = coefs [ eB2 ];
float state0 = state [0];
float state1 = state [1];
for (int i =0; i< nsamp; ++i)
{
const float output = input [i]* coefB0 + state0 ;
state0 = input [i]* coefB1 + state1 - output * coefA0 ;
state1 = input [i]* coefB2 - output * coefA1;
output [i] = output;
}
state [0] = state0;
state [1] = state1;
}
Listing 16: Refactored filter algorithm with load-update-store pattern and const keyword applied.
It is especially important to note that the declaration of const float output
was moved inside the loop. Why did we do this? Because we see that output is constant over an iteration of the loop, but it does change between iterations. By declaring it const
inside the loop body we remove the data dependency that existed in output and allow the loop to optimize more effectively.
As demonstrated by this change to const float output
, const
is useful for manually breaking dependencies in DSP code. Variable re-use introduces unnecessary data dependencies in code, which can be avoided by using individual local const variables.
restrict
The restrict
keyword tells the compiler that a specific pointer is not aliased, meaning that none of the memory locations accessed by the pointer are read or written to by any other variable within its local scope. This keyword is very important when optimizing TI code that involves pointers, as all AAX algorithms do due to the nature of the algorithm context structure.
restrict
was introduced with the C99 standard. AAX plug-ins use the AAX_RESTRICT
keyword, which is a cross-platform macro for the C99 standard restrict.
- Note
- Now that MSVC has added C99 support to its compiler,
AAX_RESTRICT
will eventually be deprecated in favor of the restrict
keyword.
The following example demonstrates the use of restrict in our filter code.
void
ProcessDirectFormII (
const float * const AAX_RESTRICT input,
float * const AAX_RESTRICT output,
float * const AAX_RESTRICT state,
const float * const AAX_RESTRICT coefs ,
int nsamp )
{
const float coefA0 = coefs [ eA0 ];
const float coefA1 = coefs [ eA1 ];
const float coefB0 = coefs [ eB0 ];
const float coefB1 = coefs [ eB1 ];
const float coefB2 = coefs [ eB2 ];
float state0 = state [0];
float state1 = state [1];
for (int i =0; i< nsamp; ++i)
{
const float output = input [i]* coefB0 + state0;
state0 = input [i]* coefB1 + state1 - output * coefA0;
state1 = input [i]* coefB2 - output * coefA1;
output [i] = output;
}
state [0] = state0;
state [1] = state1;
}
Listing 17: Refactored filter algorithm with load-update-store pattern and const and restrict keywords applied.
Keywords to avoid
There are some keywords which do more harm than good, but are still being used either due to legacy code or developer superstitions. These keywords should not be used in AAX plug-ins.
Data types
The TI C672x+ is a 32-bit floating point DSP platform, and has a few peculiarities that you should be aware of.
-
Use int
instead of long
Integers of type long int
are 40 bits wide on TI, and are very inefficient. Always use the int
data type (or, even better, the C99-standard int32_t
) instead.
-
Use float
instead of double
Double-precision floating-point data types have a significant performance penalty on TI processors. Use float
instead of double
wherever possible, as long as this substitution does not affect signal integrity or cancellation.
-
Use unsigned values when referencing memory
In general, explicitly typed pointers should always be used to reference memory. If you do have need of a generic memory representation, use an unsigned integer to avoid implicit conversion costs.
Unintended data type conversions
When developing for the TI platform it is important to keep an eye out for unintended type conversions, and especially for implicit double-precision instructions. The following points are helpful for both program efficiency and for future maintenance of the code, since they clarify the developer's understanding of how the code should operate, e.g. by specifying that a cast is occurring, and make it obvious that steps such as data type conversions are an intentional part of the algorithm.
-
Explicitly declare constants as single-precision. For example, use
0.0f
instead of 0.0
. Often a compiler will be able to do this automatically at compile time, but it is better to be explicit with your intended precision.
-
If any casts are required in your code, make them explicit. For example,
float output = (float)doubleVar
as opposed to float output = doubleVar
.
-
Use single-precision math.h functions (such as
fabsf()
) instead of the double-precision equivalents (fabs()
).
-
Do not directly reference memory addresses using integer data types; instead, use a pointer data type. If an integer data type is required, use an unsigned 32-bit type.
To help ensure that you are not violating these principles, always be aware of any warnings generated by the compiler. In particular, do not ignore warnings related to "implicit conversion from 'double' to 'float'" or "implicit conversion from 'double' to 'int'"; these warnings may indicate that you are declaring a double when a float would be just as good.
In the final stages of optimization, examine the generated assembly code to make sure there are no unintended double-precision instructions or memory accesses.
Additional data type optimizations
The AAX SDK includes cross-platform macros that can be used to convert two single-precision float loads to one double-precision load. The coefficient smoothing case study below includes an example use case for these macros.
const float * pTable = &SmoothCoefTable[address];
float firstCoef =
AAX_LO(*pTable);
float secondCoef =
AAX_HI(*pTable);
#define AAX_LO(x)
These macros are used on TI to convert 2 single words accesses to one double word access to provide a...
Definition: AAX_MiscUtils.h:69
#define AAX_HI(x)
Definition: AAX_MiscUtils.h:70
#define AAX_ALIGNMENT_HINT(a, b)
Currently only functional on TI, these word alignments will provide better performance on TI.
Definition: AAX_MiscUtils.h:53
Listing 18: Example of using AAX macros for converting two float
loads to one double
load.
In this example the
AAX_ALIGNMENT_HINT macro checks whether data is aligned on a 8-byte boundary, then the double word is loaded, and finally the
AAX_LO and
AAX_HI macros get the double word's first and second (
float
) parts.
If SmoothCoefTable
consists of floats and is 8-byte aligned, then this scenario will work fine for loads when address
is even. This raises the question about how to load double word from &SmoothCoefTable[address]
, when address
is odd. Since this kind of optimization is most useful for loading data from external memory, where the CPU savings of a single double word load vs two 32-bit loads is greatest, then one trick which can help is to trade off memory (as external memory is plentiful) for performance. Specifically, SmoothCoefTable
can be orginized in a such way that for every member of this table, except the first and the last ones, there will be two consequent entries.
const int32_t size = 4;
const float SmoothCoefTable[size] = {
-0.1, -0.2, -0.3, -0.4
}
const float SmoothCoefTable[size*2 - 2] = {
-0.1, -0.2,
-0.2, -0.3,
-0.3, -0.4,
-0.4, 0.0
}
Listing 19: Example of restructuring the table so that it can be easily used in the optimization scenario given above.
In this case the number of loads will be halved at the cost of doubling the size of the table. If the table is located in external memory then the additional memory requirement can be an excellent trade-off for the performance gained.
Case study: Efficient parameter smoothing at single and double precision
Coefficient smoothing ("de-zippering") can often be one of the most difficult parts of a plug-in to optimize for real-time operation. This is especially true in cases when full double-precision smoothing filters have been used in a plug-in's Native code, with the possibility of very small coefficients. In these cases it can be difficult to optimize the smoothing code while also satisfying requirements for audio data parity between the plug-in's Native and DSP configurations.
double * const AAX_RESTRICT deZipper = dzCoefsP->mDeZip [ch ][0];
const double * AAX_RESTRICT coefs = myCoefsP->mBiqCoefsBuf [0];
for (int i = 0; i < eNumBiquads * eNumCoefs ; ++i)
{
double dz = deZipper [i];
dz += zeroCoef * ( coefs [i] - deZipper [i]);
}
Listing 20: Example of double-precision smoothing.
In this section we will describe three specific approaches that may be taken to perform optimized real-time smoothing without compromising sound quality.
Method 1: Clamped single-precision smoothing
The simplest approach for optimization of a double-precision smoothing filter is to replace it with modified single-precision smoothing. Unfortunately, we have found that this approach can lead to glitches and instability at higher sample rates when adjusting controls due to transient innacuraccies in the smoothing.
double * const AAX_RESTRICT deZipper = dzCoefsP->mDeZip [ch ][0];
const double * AAX_RESTRICT coefs = myCoefsP->mBiqCoefsBuf [0];
for (int i = 0; i < eNumBiquads * eNumCoefs ; ++i)
{
float dz = deZipper [i];
dz += zeroCoef * ( coefs [i] - deZipper [i]);
deZipper [i] = (dz == deZipper [i]) ? coefs [i] : dz;
}
Listing 21: Example of clamped single-precision smoothing.
Method 2: Mixed-precision smoothing
To resolve the stability issues at high sample rates, the state may be accumulated at double-precision. This results in mixed-precision operations that are much faster on TI DSPs than full double-precision calculations, though still slower than single-precision.
float * const AAX_RESTRICT deZipper = dzCoefsP->mDeZip [ch ][0];
double * const AAX_RESTRICT deZipState = dzCoefsP->mDZState [ch][0];
const float * AAX_RESTRICT coefs = myCoefsP->mBiqCoefsBuf [0];
# pragma UNROLL ( CBiquad::eNumCoefs )
for(int i = 0; i < eNumBiquads * eNumCoefs ; i ++)
{
double dz = deZipState [i];
dz += zeroCoef * ((coefs [i]) - ( deZipper [i]));
deZipState [i] = dz;
deZipper [i] = float (dz);
}
Listing 22: Example of mixed-precision smoothing.
Method 3: Loop unrolling and double-word memory accesses
Further performance gains can be made by unrolling the loop and using double word memory accesses. This code is faster, but is still not as fast as full single-precision.
float * const AAX_RESTRICT deZipper = dzCoefsP->mDeZip [ch][0];
double * const AAX_RESTRICT deZipState = dzCoefsP->mDZState [ch][0];
const float * AAX_RESTRICT coefs = myCoefsP->mBiqCoefsBuf [0];
{
double dz0 = deZipState [i];
double dz1 = deZipState [i+1];
dz0 += zeroCoef * (
AAX_LO ( coefs [i]) -
AAX_LO ( deZipper [i]));
dz1 += zeroCoef * (
AAX_HI ( coefs [i]) -
AAX_HI ( deZipper [i]));
deZipState [i] = dz0;
deZipper [i] = float (dz0);
deZipState [i+1] = dz1;
deZipper [i+1] = float (dz1);
}
Listing 23: Example of loop unrolling and double-precision memory accesses for smoothing optimization.
Coefficient smoothing example summary
-
Full single-precision smoothing (method 1) is an excellent and simple solution for gain coefficients and other scalar values which are not extremely sensitive to coefficient quantization at small values. This method does not always reach the target value, so clamping should be used to ensure signal integrity.
-
Mixed-precision smoothing (method 2) uses slightly more CPU, but gives full double precision accuracy. This approach should generally be used for EQs and other sensitive coefficients.
-
Further low-level optimizations are also possible via manual loop unrolling and double-precision memory access (method 3).
Refactoring conditionals and branches
- Note
- For more detailed information on how to reduce or eliminate the use of branches in algorithms, see section 5.2 of the Hand-Tuning Loops and Control Code on the TMS320C6000 guide provided by TI.
An important technique in refactoring algorithms to enhance loop performance is to reduce or eliminate conditionals and branches in code. The TI compiler focuses a lot of its optimization energy on keeping its pipeline full of inside loops. However, it cannot pipeline a loop if the one of the following is true:
-
The loop contains a branch
-
The loop contains a function call
-
The loop is too long
To demonstrate this, we will again begin with an unoptimized example:
for ( int i = 0; i < numSamples ; ++i)
{
if (! bypass )
{
const float filtOutput1 = input [i] * coef0 + state0 * coef1 ;
const float filtOutput2 = filtOutput1 * coef2 + state1 * coef3 ;
output [i] = filtOutput2 ;
}
else
{
output [i] = input [i];
}
}
Listing 24: Another unoptimized filter algorithm.
Though trivial, this example illustrates the problem with conditionals inside of loops. In TI assembly, conditional code usually translates into code branches, which prevents loops from pipelining effectively see
Understanding CGTools-generated ASM files. Let's refactor the loop in our example to reduce the size of its conditional branch:
for (int i = 0; i < numSamples ; ++i)
{
const float filtOutput1 = input [i] * coef0 + state0 * coef1 ;
const float filtOutput2 = filtOutput1 * coef2 + state1 * coef3 ;
output [i] = filtOutput2 ;
if ( bypass )
{
output [i] = input [i];
}
}
Listing 25: Filter algorithm with a refactored conditional branch.
At first, it may seem wasteful to perform the filter calculation if bypass
will simply throw away the result. In reality, however, the opposite is true: as a real-time algorithm, this code is constrained by its maximum, worst-case cycle count. It is important to understand this point: essentially, the cycle count of the plug-in is always its worst-case performance.
By reducing the algorithm's maximum cycle count we are therefore reducing waste, even though we are increasing the plug-in's cycle count when it is bypassed. In fact, the ideal scenario for most algorithms is to use only one code path (and, consequentially, a single deterministic cycle count) despite the fact that this can result in worse performance for some specific states. To state this fundamental principle in a different way:
The performance of specific states in an AAX DSP algorithm is not relevant if there is another possible state with worse performance.
Going back to our optimized example, you may also notice that the conditional still exists. Doesn't this create a branch in the assembly code as well and prevent pipelining?
In the case of very brief conditionals such as this, the answer is usually no. On TI processors, most instructions can be executed conditionally, depending on the value of a control register. Thus, the single assignment (output = input)
inside this conditional will reduce to a few conditional instructions without having to execute a branch. As a result, the TI compiler will be able to efficiently pipeline this loop.
That said, it is occasionally necessary to eliminate conditionals entirely. One effective solution for these situations is to execute the branched logic algorithmically rather than conditionally. To demonstrate this approach, here is our filter example again, this time with the the conditional completely eliminated from the loop:
for (int i = 0; i < numSamples ; ++i)
{
const float filtOutput1 = input [i] * coef0 + state0 * coef1 ;
const float filtOutput2 = filtOutput1 * coef2 + state1 * coef3 ;
output [i] = (! bypass ) * filtOutput2 + bypass * input [i];
}
Listing 26: Filter algorithm with branching logic executed algorithmically.
This code is shorter and completely eliminates the conditional from inside the loop body. However, there is an associated cost in readability, in that it is not initially obvious how exactly bypass
affects the output. This is of course a tradeoff that you will need to consider on a case-by-case basis. In general, we encourage you to consider this technique only when you have verified in the assembly code that simply reducing the size of the conditional is not enough to achieve effective instruction pipelining.
Another useful technique for optimizing loops is to use
pragma MUST_ITERATE
and
pragma PROB_ITERATE
(see more about these pragmas in
Loop controls), which help the compiler guess the number of iterations for the loop. It is extremely useful when you know the exact number of the iterations, and this number never changes during plug-in processing. For example, this is applicable for the loops which iterate through the audio samples in the input and output buffers. The number of input samples is always constant for an AAX DSP plug-in algorithm; the buffer length must be described with the option
AAX_eProperty_DSP_AudioBufferLength for each DSP component in the plug-in's description.
The following code example shows an algorithm processing function template. For convenience, this function template takes the audio buffer length as a template parameter:
template<int kAudioWindowSize>
Example_AlgorithmProcessFunction( SExample_Alg_Context * const inInstancesBegin [], const void * inInstancesEnd)
{
for (SExample_Alg_Context * const * walk = inInstancesBegin; walk != inInstancesEnd; ++walk)
{
SExample_Alg_Context* const AAX_RESTRICT contextP = *walk;
const float * const AAX_RESTRICT inputP = contextP->mInputPP;
float * const AAX_RESTRICT outputP = contextP->mOutputPP;
#pragma MUST_ITERATE( kAudioWindowSize, kAudioWindowSize, kAudioWindowSize )
for (int32_t i = 0; i < kAudioWindowSize; ++i)
{
outputP[i] = inputP[i];
}
}
}
Listing 27: Optimizing loop using pragma MUST_ITERATE.
Note that the audio buffer length property takes a
AAX_EAudioBufferLengthDSP value. The values of this enum are set to the power-of-two for each buffer length, so in this case the
kAudioWindowSize
value would be set to match
2 << AAX_eProperty_DSP_AudioBufferLength
when compiling this algorithm callback into the TI DLL
The same optimization can be used for the loops that iterate through input/output channels, as demonstrated by the DemoDist example plug-in.
Case study: pipeline refactoring in Avid's EQ3 and Dyn3 plug-ins
While optimizing the "stock" Pro Tools equalization and dynamics processors we came across many real-world optimization scenarios that will be applicable to a broad variety of plug-ins. In this section we will consider specific techniques that we used to enable software pipelining of these algorithms by the TI compiler, including an in-depth look at the pseudo-speculative execution approach used in our Dyn3 plug-in's polynomial gain calculation loop.
Move individual processing operations into separate loops
Oftentimes a sample-by-sample iterative loop that is not software pipelining can be broken up into individual loops that incrementally apply changes to the audio buffer. These smaller loops have a much better chance of being successfully pipelined by the compiler. In EQ3, moving our biquad audio processing stages to dedicated loops that do not include coefficient smoothing or other tasks resulted in large performance gains.
Avoid pipeline dependencies
The goal of the above optimization is to allow the compiler to successfully pipeline each iterative loop. However, even a pipelined loop may be optimized further. One of the best ways of optimizing loops is to keep the processor busy while pipeline dependencies are cleared.
For example, in EQ3 we found that it was better to perform the plug-in's input and output meter calculations in the same loop rather than separating them out into individual loops. This is because each meter calculation has a dependency on its previous value, which puts a dependency in the pipeline. Doing both at the same time gives the process more to do while waiting for the next value. In Dyn3 we had similar results merging table lookup, attack, and release loops into a single iterative loop. As long as the loop is still successfully pipelined by the compiler, these "larger" loops tended to have much better performance due to the reduction in blocking dependencies.
Detailed example of loop optimization in Dyn3
At this point it will be helpful to go into greater detail about our optimizations for Dyn3's polynomial gain calculation loop, because the increase in performance was quite large and is fairly representative of other algorithms. The unoptimized code took 43 cycles to execute one iteration of the loop. After rearranging the code it now takes 6 cycles. The basic problem was numerous pipeline dependencies: the Loop Carried Dependency Bound was 42 cycles, yet the Partitioned Resource Bound was 4 cycles. In other words, if all of these dependencies were removed the loop could potentially execute in 4 cycles.
2760 ;* SOFTWARE PIPELINE INFORMATION
2761 ;*
2762 ;* Loop source line : 199
2763 ;* Loop opening brace source line : 200
2764 ;* Loop closing brace source line : 213
2765 ;* Known Minimum Trip Count : 4
2768 ;* Loop Carried Dependency Bound (^) : 42
2769 ;* Unpartitioned Resource Bound : 4
2770 ;* Partitioned Resource Bound (*) : 4
2785 ;*
2786 ;* Searching for software pipeline schedule at ...
2787 ;* ii = 42 Did not find schedule
2788 ;* ii = 43 Schedule found with 1 iterations in parallel
2789 ;* Done
for (int i =0; i< kAudioWindowSize ; i++)
{
const float * smoothCoeffs = stateP -> mSmoothedPoly ;
float logEnv = logEnvArray [i];
logEnv -= smoothThrLow ;
if( logEnv >= 0.0 f)
smoothCoeffs += eCpdPolyOrder ;
if( logEnv >= 0.0 f)
logEnv -= smoothThrLowDelta ;
if( logEnv >= 0.0 f)
smoothCoeffs += eCpdPolyOrder ;
const float filteredLogEnv = smoothCoeffs [ eCpdPolyCoeffsC ] +
logEnv *( smoothCoeffs [ eCpdPolyCoeffsB ] +
smoothCoeffs [ eCpdPolyCoeffsA ]* logEnv );
filtLogEnvArray [i] = filteredLogEnv + smoothedMakeupGain ;
}
Listing 28: Dyn3's unoptimized polynomial gain calculation loop and asm listing.
-
logEnv -= smoothThrLow
depends on the result of logEnvArray[i]
-
if(logEnv >= 0.0f)
depends on the result of logEnv -= smoothThrLow
-
logEnv -= smoothThrLowDelta
depends on the result of logEnv -= smoothThrLow
-
Thrid if(logEnv >= 0.0f)
depends on the result of logEnv -= smoothThrLowDelta
-
Second smoothCoeffs += eCpdPolyOrder
depends on the result of the first smoothCoeffs += eCpdPolyOrder
-
logEnv*smoothCoeffs[eCpdPolyCoeffsB]
depends on the result of logEnv -= smoothThrLowDelta
-
smoothCoeffs[eCpdPolyCoeffs], etc.
depend on the result of the second smoothCoeffs += eCpdPolyOrder
-
filteredLogEnv+smoothedMakeupGain
depends on the result of filteredLogEnv = smoothCoeffs[eCpdPolyCoeffsC]
-
filtLogEnvArray[i]
depends on the result of filteredLogEnv + smoothedMakeupGain
And I don't think that even covers every case, but you get the idea. The bottom line is there is no way this loop can pipeline well. In contrast, here is the optimized code and listing file output once these dependencies have been removed:
2476 ;* Loop opening brace source line : 167
2477 ;* Loop closing brace source line : 179
2446 ;* Known Minimum Trip Count : 4
2482 ;* Loop Carried Dependency Bound (^) : 1
2483 ;* Unpartitioned Resource Bound : 4
2484 ;* Partitioned Resource Bound (*) : 4
2512 ;* ii = 6 Schedule found with 5 iterations in parallel
for (int i =0; i< cProcessingBlockSize ; i++)
{
float logEnv = logEnvArray [i];
float logEnvThrHi = logEnv - smoothThrHigh ;
const float gainSlope = smoothThrSlope +
logEnv * smoothSlope ;
const float gainKnee = smoothKneeC +
logEnvThrHi *( smoothKneeB +
smoothKneeA * logEnvThrHi );
const bool bKnee = ( logEnv > smoothThrLow );
const bool bSlope = ( logEnv > smoothThrHigh );
float filteredLogEnv = bKnee ? gainKnee : 0.0f;
filteredLogEnv = bSlope ? gainSlope : filteredLogEnv ;
filtLogEnvArray [i] = filteredLogEnv ;
}
Listing 29: Dyn3's optimized polynomial gain calculation loop and asm listing
In this case gainSlope
is only dependent on the loading of logEnv
, so that can begin almost immediately. GainKnee
must wait for logEnvThrHi
, but gainSlope
can be calculated during that time. bKnee
and bSlope
are also only dependent on logEnv
, and start right away. The main dependency is filteredLogEnv
which is dependent on bKnee
and gainKnee
and then bSlope
and gainSlope
. Anyhow, this is far fewer dependencies. Here is another version which runs in exactly the same number of cycles. (In fact, under the hood it may be creating the same asm code; we have not compared instruction-by-instruction.)
for (int i =0; i< kAudioWindowSize ; i++)
{
float logEnv = logEnvArray [i];
float logEnvThrHi = logEnv - smoothThrHigh ;
const bool bKnee = ( logEnv > thrLow );
const bool bSlope = ( logEnv > thrHigh );
float filteredLogEnv = bKnee ?
kneeC + logEnvThrHi *( kneeB + kneeA * logEnvThrHi ) :
0.0 f;
filteredLogEnv = bSlope ?
thrSlope + logEnv * slope :
filteredLogEnv ;
filtLogEnvArray [i] = filteredLogEnv ;
}
Listing 30: An alternative optimization for Dyn3's polynomial gain calculation loop.
But what about Native?
You might expect this altered code to execute well on a TI DSP but poorly on x86. However, keep in mind that a large degree of speculative execution is used on Intel's processors. This means that pipeline dependencies due to conditionals can be broken because multiple paths are executed. In these cases, only one of the results is used and the others are thrown away. In other words, if you saw pseudo code showing the literal execution of the unoptimized code above on Intel then it would probably look a lot like the optimized code. The lesson? For TI it is important to rearrange your code so that essentially it implements speculative execution as much as possible, and if applied correctly this optimization should not negatively impact your plug-in's native performance.
Case study: Additional optimization lessons from EQ3 and Dyn3
The pipeline optimization example above is just one example, and the following techniques also helped us achieve many-fold increases in performance. Note that many of these techniques are discussed in greater detail in the sections above.
Watch the assembly listing
In the process of optimizing these plug-ins we found their asm listing files very helpful, especially the Loop Carried Dependency Bound and the Partitioned Resource Bound information. The listing file shows how many cycles the code is taking to execute, and we could make an estimate of how far away we were from the optimal implementation by seeing how well the pipeline is being utilized.
Divide processing tasks over multiple calls
In the old RTAS version of EQ3 the coefficients were updated (smoothed) every 8 samples. Initially, this was changed to every 4 samples in the AAX version in order to easily work with 4-sample blocks on HDX. However, we were able to achieve better results by adding "ping pong" logic that alternates between smoothing the first and second half of the coefficients on each pass. To make this work in our odd-banded EQ we had to pad the smoothing coefficients by one biquad's worth to make an even number of biquads, but regardless of this inefficiency we still achieved performance gains.
Eliminate branches that block pipelining
Eliminating large conditional branches is critical to optimal performance on TI. This can be an especially tempting pitfall for developers who are used to coding only for x86 processors.
Consider the "ping pong" optimization described above. This logic does not break pipelining because the conditional logic that checks the state of the flag does not result in a large branch; once the ping pong value is set, the exact same logic operates in every processing callback. If instead we used an if statement to determine which "side" should execute, this would prevent pipelining optimizations and would seriously impact performance.
Remove double-precision operations where they are not required
Here is some coefficient smoothing code from our pre-optimization EQ3 algorithm. This code was embedded in the inner biquad processing loop:
# pragma UNROLL ( CBiquad::eNumCoefs )
for (int k = 0; k < CBiquad::eNumCoefs; ++k)
{
double &dz = deZipper[k];
step[k] = zeroCoef * ( coefs[k] - dz);
}
# pragma UNROLL ( CBiquad::eNumCoefs )
for(int k = 0; k < CBiquad::eNumCoefs; ++k)
{
double nm1_dz = deZipper[k];
nm1_dz += step[k];
biquadCoefs[k] = static_cast< float > ( nm1_dz );
deZipper[k] = nm1_dz ;
}
void DeDenormal(double &iValue)
Clamps very small floating point values to zero.
Definition: AAX_Denormal.h:225
Listing 31: Unoptimized coefficient smoothing in EQ3
To optimize this code, we converted the logic to use single-precision de-zipper values. However, this resulted in a sonic difference due to the fact that the smoothed coefficients would not necessarily ramp all the way to the correct target value. To solve that we added a conditional "clamp" that halts the smoothing once there is no difference between the 32-bit smoothed value and the target value. On examination of the assembler output, we found that this conditional pipelines very well.
# pragma UNROLL ( CBiquad::eNumCoefs )
for(int i = 0; i < (cMaxNumBiquadsWithPad / 2) * CBiquad::eNumCoefs; ++i)
{
float dz = deZipper[i];
dz += zeroCoef * ( coefs[i] - deZipper[i]);
deZipper[i] = (dz == deZipper[i]) ? coefs[i] : dz;
}
Listing 32: Optimized coefficient smoothing in EQ3
Make coefficients contiguous
We were able to achieve significant performance gains in iterative loops like the smoothing code shown above by ensuring that all of the coefficients that would be accessed by the loop are contiguous in memory. In addition, note that in the optimized code there is only one loop, which iterates NumBiquads*NumCoefs
times. This optimization is possible due to the fact that each filter's coefficients are contiguous in the coefs
array.
Use AAX_RESTRICT wherever applicable
We have found that the restrict
keyword is vital for optimal performance on TI DSPs. For example, the parameter smoothing logic in our Dyn3 plug-in was reduced from 18 cycles to 3 cycles per loop iteration simply by the addition of this keyword to the applicable pointer variables.
For more information about the
restrict
keyword, see
restrict.
Be aware of shell overhead
In the TI Shell there is code that loops through every buffered coefficient FIFO before every sample buffer in order to swap the algorithm's context field pointers to a new set of coefficients if one is available. This uses a nominal number of cycles per buffered port, which can add up very quickly in small plug-ins.
For example, before our optimizations EQ3 used eight individual buffered coefficient blocks. On investigation, we found that the shell overhead from managing these buffers added up to be roughly equivalent to the algorithm's total processing cycles! To work around this we merged the 8 coefficient blocks into one large block. The trade-off of this optimization is that more work must be done on the host to re-generate and copy the whole coefficient state every time any parameter changes, so this is an optimization that should be applied only when appropriate for the individual plug-in.For example, before our optimizations EQ3 used eight individual buffered coefficient blocks. On investigation, we found that the shell overhead from managing these buffers added up to be roughly equivalent to the algorithm's total processing cycles! To work around this we merged the 8 coefficient blocks into one large block. The trade-off of this optimization is that more work must be done on the host to re-generate and copy the whole coefficient state every time any parameter changes, so this is an optimization that should be applied only when appropriate for the individual plug-in.
Watch for opportunities to merge or eliminate operations
Keep an eye out for unnecessary processing stages performed by your algorithm. Gain stages, phase toggles, and "dummy" coefficients are particularly good candidates for this kind of optimization. For example:
-
In our EQ3 plug-in, we found that we could achieve significant performance improvement by merging the plug-in's input and output gain stages with the overall gain of the first and last biquads. As a side benefit, this reduced the total quantization noise in the algorithm.
-
In our Dyn3 plug-in, we found that we were applying smoothing logic to filter coefficients that would always be zero.
-
When we looked more closely at Dyn3 we found that we were also computing and discarding sidechain filter information for the LFE, which is not part of the sidechain
Read the TI documentation
There are many helpful optimization resources available from Texas Instruments. Out of all of the TI optimization documents we encountered, we found the Hand-Tuning Loops and Control Code on the TMS320C6000 guide to be the most helpful and complete.
Optimization on the HDX platform
Interrupt latency
Besides the large latency due to context switching (lots of data file registers to store) and the pipeline (many stages), interrupts can be disabled around pipelined loops, which cannot be interrupted. This can be controlled with the -mi=X compiler option, which will disallow unsafe pipelining for loops that are longer than X cycles. See TI's documentation (SPRU187O Section 2.12) for more details and references regarding this behavior.
External memory access
A loop which performs many reads and writes may require access to external memory. In this scenario, the loop may take 10's or even 100's of times longer to execute than the compiler expects it to!
There are two options for dealing with this:
-
Search and destroy these loops individually
-
Move all the data used by the loop to internal RAM.
-
Use HDX's DMA facilities for external memory accesses.
-
#pragma FUNC_INTERRUPT_THRESHOLD
can be used to disable pipelining on a case by case basis.
-
For modules that are known to have these loops but are not worth hand optimizing, then turn off pipelined loop optimization altogether. (
-mu aka –disable_software_pipelining
).
- Note
- This is only a problem in the C67(0-2)x ISAx used on the HDX platform. In The C64xx and C674x ISA, there is an SPLOOP command which can buffer the branches within pipelined loops to allow them to be interruptable.
Code Composer Studio optimization tools
Compiler Consultant
The Compiler Consultant tool can be used to suggest additional optimizations.
To enable the Compiler Consultant in Code Composer Studio, do the following:
-
Set an optimization level of
-o2
or -o3
(Found in CCSv4 under Build Options > Compiler > Basic)
-
Set the –consultant:
Generate Compiler Consultant Advise
switch (Found in CCSv4 under Build Options > Compiler > Feedback)
Optimization information file
Optimization information files can be generated in Code Composer Studio by selecting the option Build Options > Compiler > Feedback > Opt Info File. Optimization information files have an .nfo extension and are placed into the project's intermediate build products directory. In general, these files list function call-graph information and describe whether or not individual functions can be inlined.
Error Codes
The following appendices document error codes that are specific to plug-in hosting in Pro Tools HDX and other AAX platforms based on the TI DSP environment.
-138xx: DHM Core DSP errors
These errors relate to routing and assignment problems on Pro Tools HDX hardware. Plug-ins should never be able to trigger these error codes, which indicate low-level problems in the system.
|
Table 1: DHM Core DSP error codes |
|
Value | Definition |
-13801 | ePSError_CTIDSP_WrongSampleRate |
-13802 | ePSError_CTIDSP_NoFreeStreams |
-13803 | ePSError_CTIDSP_StreamCreationTimeout |
-13804 | ePSError_CTIDSP_StreamDestruction |
-13805 | ePSError_CTIDSP_InactiveStream |
-13806 | ePSError_CTIDSP_StreamCorrupted |
-13807 | ePSError_CTIDSP_QueueFull |
-13808 | ePSError_CTIDSP_NullPointer |
-13809 | ePSError_CTIDSP_WrongStreamID |
-13810 | ePSError_CTIDSP_ImageError |
-13811 | ePSError_CTIDSP_ResetError |
-13812 | ePSError_CTIDSP_ImageVerify |
-13813 | ePSError_CTIDSP_DSPAlreadyInBootOrReset |
-13814 | ePSError_CTIDSP_TriggerInterrupt |
-13815 | ePSError_CTIDSP_BufferSizeNotAligned |
-13816 | ePSError_CTIDSP_TimeoutWaitingForHPIC |
-13817 | ePSError_CTIDSP_SetUHPIError |
-13818 | ePSError_CTIDSP_UHPINotReady |
-140xx: AAX Host errors
These errors relate to logic failures in the AAX host software. These errors can be due to plug-in bugs or system configuration problems.
|
Table 2: AAX Host Software error codes |
|
Value | Definition |
-14001 | kAAXH_Result_Warning |
-14003 | kAAXH_Result_UnsupportedPlatform |
-14004 | kAAXH_Result_EffectNotRegistered |
-14005 | kAAXH_Result_IncompleteInstantiationRequest |
-14006 | kAAXH_Result_NoShellMgrLoaded |
-14007 | kAAXH_Result_UnknownExceptionLoadingTIPlugIn |
-14008 | kAAXH_Result_EffectComponentsMissing |
-14009 | kAAXH_Result_BadLegacyPlugInIDIndex |
-14010 | kAAXH_Result_EffectFactoryInitedTooManyTimes |
-14011 | kAAXH_Result_InstanceNotFoundWhenDeinstantiating |
-14012 | kAAXH_Result_FailedToRegisterEffectPackage |
-14013 | kAAXH_Result_PlugInSignatureNotValid |
-14014 | kAAXH_Result_ExceptionDuringInstantiation |
-14015 | kAAXH_Result_ShuffleCancelled |
-14016 | kAAXH_Result_NoPacketTargetRegistered |
-14017 | kAAXH_Result_ExceptionReconnectingAfterShuffle |
-14018 | kAAXH_Result_EffectModuleCreationFailed |
-14019 | kAAXH_Result_AccessingUninitializedComponent |
-14020 | kAAXH_Result_TIComponentInstantiationPostponed |
-14021 | kAAXH_Result_FailedToRegisterEffectPackageNotAuthorized |
-14022 | kAAXH_Result_FailedToRegisterEffectPackageWrongArchitecture |
-14023 | kAAXH_Result_PluginBuiltAgainstIncompatibleSDKVersion |
-14023 | kAAXH_Result_PluginBuiltAgainstIncompatibleSDKVersion |
-14100* | kAAXH_Result_InvalidArgumentValue |
-14101* | kAAXH_Result_NameNotFoundInPageTable |
-141xx: TI System errors
These errors relate to logic failures in the TI management software and generally indicate a failure in the HDX system services such as buffered message queues, context management, and callback timing.
|
Table 3: TI system error codes |
|
Value | Definition |
-14101 | eTISysErrorNotImpl |
-14102 | eTISysErrorMemory |
-14103 | eTISysErrorParam |
-14104 | eTISysErrorNull |
-14105 | eTISysErrorCommunication |
-14106 | eTISysErrorIllegalAccess |
-14107 | eTISysErrorDirectAccessOfFifoBlocksUnsupported |
-14108 | eTISysErrorPortIdOutOfBounds |
-14109 | eTISysErrorPortTypeDoesNotSupportDirectAccess |
-14110 | eTISysErrorFIFOFull |
-14111 | eTISysErrorRPCTimeOutOnDSP |
-14112 | eTISysErrorShellMgrChip_SegsDontMatchAddrs |
-14113 | eTISysErrorOnChipRPCNotRegistered |
-14114 | eTISysErrorUnexpectedBufferLength |
-14115 | eTISysErrorUnexpectedEntryPointName |
-14116 | eTISysErrorPortIDTooLargeForContextBlock |
-14117 | eTISysErrorMixerDelayNotSupportedForPlugIns |
-14118 | eTISysErrorShellFailedToStartUp |
-14119 | eTISysErrorUnexpectedCondition |
-14120 | eTISysErrorShellNotRunningWhenExpected |
-14121 | eTISysErrorFailedToCreateNewPIInstance |
-14122 | eTISysErrorUnknownPIInstance |
-14123 | eTISysErrorTooManyInstancesForSingleBufferProcessing |
-14124 | eTISysErrorNoDSPs |
-14125 | eTISysBadDSPID |
-14126 | eTISysBadPIContextWriteBlockSize |
-14128 | eTISysInstanceInitFailed |
-14129 | eTISysSameModuleLoadedTwiceOnSameChip |
-14130 | eTISysCouldNotOpenPlugInModule |
-14130 | eTISysCouldNotOpenPlugInModule |
-14131 | eTISysPlugInModuleMissingDependcies |
-14132 | eTISysPlugInModuleLoadableSegmentCountMismatch |
-14133 | eTISysPlugInModuleLoadFailure |
-14134 | eTISysOutOfOnChipDebuggingSpace |
-14135 | eTISysMissingAlgEntryPoint |
-14136 | eTISysInvalidRunningStatus |
-14137 | eTISysExceptionRunningInstantiation |
-14138 | eTISysTIShellBinaryNotFound |
-14139 | eTISysTimeoutWaitingForTIShell |
-14140 | eTISysSwapScriptTimeout |
-14141 | eTISysTIDSPModuleNotFound |
-14142 | eTISysTIDSPReadError |
-142xx: DIDL errors
These errors all relate to the dynamic library loading system that manages ELF DLL binaries on Pro Tools HDX hardware. For example, a eDIDL_FileNotFound
error will be raised if the ELF DLL name specified by an Effect's Describe code does not match any DLL that is present in the plug-in's bundle.
|
Table 4: DIDL error codes |
|
Value | Definition |
-14201 | eDIDL_FileNotFound |
-14202 | eDIDL_FileNotOpen |
-14203 | eDIDL_FileAlreadyOpen |
-14204 | eDIDL_InvalidElfFile |
-14205 | eDIDL_ImageNotFound |
-14206 | eDIDL_SymbolNotFound |
-14207 | eDIDL_DependencyNotLoaded |
-14208 | eDIDL_BadAlignment |
-14209 | eDIDL_NotImplemented |
-144xx: HDX hardware errors
These errors relate to failures on the HDX hardware itself. Plug-ins should never be able to trigger these error codes, which indicate low-level problems in the system.
|
Table 5: HDX hardware error codes |
|
Value | Definition |
-14401 | eBerlinImageError |
-14402 | eBerlinImageWriteError |
-14403 | eBerlinInvalidArgs |
-14404 | eBerlinCantGetTMSChannel |
-14405 | eBerlinChunkWriteError |
-14406 | eBerlinChunkReadError |
-14407 | eBerlinInvalidReqID |
-14408 | eBerlinDSPInResetError |
-14409 | eBerlinDSPTimeOut |
-14410 | eBerlinIncorrectTdmCableWiring |
-14411 | eBerlinInvalidClock |
-145xx: DHM isochronous audio engine errors
These errors relate to failures within the HDX audio engine software. Plug-ins should never be able to trigger these error codes, which indicate low-level problems in the system.
|
Table 6: DHM isochronous audio engine error codes |
|
Value | Definition |
-14500 | eDsiIsochEngineGenericError |
-14501 | eDsiIsochEngineWrongChannelNumber |
-14502 | eDsiIsochEngineTxRingFull |
-14503 | eDsiIsochEngineRxRingNotReady |
-14504 | eDsiIsochEngineWrongNumberOfSamplesRequest |
-14505 | eDsiIsochEngineUnrecognizedSampleRate |
-14506 | eDsiIsochEngineUnsupportedSampleSizeBytes |
-14507 | eDsiIsochEngineUnsupportedNumberOfChannels |
-14508 | eDsiIsochEngineUnsupportedSampleRate |
-14509 | eDsiIsochEngineDMAAlreadyEnabled |
-14510 | eDsiIsochEngineDMAAlreadyDisabled |
-14511 | eDsiIsochEngineInterruptHandlerAlreadyInstalled |
-14512 | eDsiIsochEngineBadCardRecord |
-14513 | eDsiIsochEngineCantSetValueDuringStreaming |
-14514 | eDsiIsochEngineStreamingAlreadyStarted |
-14515 | eDsiIsochEngineStreamingAlreadyStopped |
-14516 | eDsiIsochEngineStreamingCantBeStarted |
-14517 | eDsiIsochEngineUnsupportedSamplesPerInterrupt |
-14518 | eDsiIsochEngineCantSetSamplesPerInterrupt |
-14519 | eDsiIsochEngineInterruptLoopAlreadyExists |
-14520 | eDsiIsochEngineGlobalDMADisabled |
-14521 | eDsiIsochEngineActiveInterruptMaskAlreadyEnabled |
-14522 | eDsiIsochEngineSDI0Errors |
-30xxx: Dynamically-generated error codes
Errors in the -30xxx range are dynamically generated codes, and thus the same failure point could generate a different error code depending on the order in which errors occurred. These kinds of error codes are used heavily by the TI Shell Manager, the host component that interacts with the on–DSP shell environment.
If one of these error codes is being generated by the TI Shell Manager (the most common case) then you should be able to get more information about the failure by enabling the following
DigiTrace logging facility:
DTF_TISHELLMGR=file@DTP_NORMAL
or, within the DSH tool:
enable_trace_facility [DTF_TISHELLMGR, DTP_NORMAL]
This should result in a log with more information such as the name of the failing plug-in, the dynamically generated error code, and a string description of its meaning. Depending on the failure case, the DAE dish command getlastdsploaderror
can also sometimes be used to retrieve the description string for a dynamically-generated error if it was the last error generated during the DSP loading operation.