AAX SDK  2.4.1
Avid Audio Extensions Development Kit

How to write AAX plug-ins for Avid's TI DSP-based platforms.

Contents

Overview of TI DSP Algorithms in AAX

Avid's hardware-accelerated audio systems allow AAX plug-ins to offload their real-time processing tasks to a dedicated processor, guaranteeing reliable performance at ultra-low latency. Avid's TI DSP-based products utilize Texas Instruments DSP chips to host plug-ins in a managed shell environment.
The AAX host handles all system-level communications and resources on the DSP and provides a consistent API to manage communication between the plug-in's real-time algorithm and its other components. This design allows AAX plug-ins to use the same communication methods whether they are running natively, on a TI-based accelerated system, or in some other distributed environment.
Each AAX plug-in contains a real-time algorithm callback. For TI DSP-based platforms, this callback is compiled into a relocatable ELF DLL. This library is loaded onto the appropriate DSP by the host, and may share the DSP with other plug-ins if the host determines that the required system resources are available. A real-time execution environment called the TI Shell is also loaded onto each DSP. The TI Shell manages the DSP's memory and interrupts and guarantees reliable real-time performance even at single sample operation.

Getting Started with HDX DSP

This section provides a quick overview of what you will need for creating an AAX DSP plug-in to run on Avid's TI-based HDX DSP platforms.

The HDX DSP Platform

HDX DSP is Avid's core mixer and plug-in accelerator platform. Avid's HDX and Pro Tools | Carbon systems both use the HDX DSP platform, with multiple TI C6727 DSPs each clocked at 350 MHz. These DSPs utilize a 32-bit floating-point architecture, with the option to perform 64-bit double-precision operations at some performance cost. Each HDX card includes 18 DSPs and is connected to the host system over a high bandwidth PCIe connection, while each Pro Tools | Carbon system includes 8 DSPs and is connected to the host system over a Gigabit Ethernet connection.

DSP characteristics: instruction processing

The C6727 DSP utilizes a VLIW architecture and contains dual data paths. Each data path includes four independent functional units, so the DSP can accommodate up to 8 parallel instructions per cycle. To take advantage of this architecture, the TI compiler relies heavily on instruction pipelining for optimization.

DSP characteristics: audio buffers

In order to realize the maximum possible performance benefit from this architecture, the algorithm routine for a single HDX DSP plug-in is always called with the same buffer size. By guaranteeing that each algorithm will be called with a consistent buffer size, the TI compiler is able to properly account for any possible iterative instruction pipelining, resulting in large performance gains.
HDX DSP uses a four-sample processing quantum by default for plug-in instances. Plug-ins that require additional processing time per callback, e.g. to mitigate the overhead cost of the chip's DMA facilities, may optionally request a 16, 32, or 64-sample quantum. Note that at higher block sizes, the number of potential I/O channels available to plug-ins on a chip will be reduced.
Host Compatibility Notes:
32 and 64-sample quantum is available in Pro Tools 10.2 and higher

DSP characteristics: memory

Each DSP on the HDX DSP platform includes 16 MB of external RAM and 256 kB of internal RAM. The DSP has the ability to execute code from either internal or external RAM, though the real-time performance cost of external RAM accesses is significant. The chip's internal RAM is addressable at the core clock rate.
Each DSP also has a program cache of 32 kB. Plug-in code is loaded into this cache from internal memory, so for best performance your plug-in should not use more than 32 kB for its program code. You can look at the CCS-generated .map file to find your plug-in's program code size.

SDRAM performance

Asynchronous access to data in the C6727's SDRAM is very slow, requiring 50 cycles/word to read and 15 cycles/word to write. This is primarily due to clock domain bridging, lack of data caching, and the fact that data from the core is given a low priority in order to avoid stalling real-time DMA transfers.

Executing program code from external memory

The TI C6727 supports executing program code from external memory. When executing from uncached external memory, expect cycle counts to increase by a factor of 4x to 5x compared with the equivalent internal-memory code. Assuming that no cache thrashing occurs, subsequent calls will be cached and thus the program's location in either external or internal memory will produce similar cycle counts.
Note
The CCSv4 Profiler contains a bug that produces incorrect cycle counts for cached external-memory program code. Therefore, when gathering cycle count data for a plug-in that stores its program data in external memory, an RTI-based timing method should be used.

System characteristics: DSP/host data transfers

Plug-ins loaded onto the HDX DSP platform may transfer arbitrarily large data blocks between the DSP and the host, within the limits of available DSP memory and system bandwidth.

DSP/host bandwidth

Neither AAX nor the HDX DSP platform include any explicit plug-in bandwidth limiting constraints. If a plug-in's data transfer requests bump up against the physical bandwidth limit for the system then this will delay the blocking data transfer request on the host, as the transfer will be held off for higher-priority operations on the DSP, and may also delay automation data from reaching other plug-ins on the affected DSPs in the same group.
The recommended upper limit for DSP/host data transfer requests in an individual plug-in when running on an HDX PCIe card is 10 MB/s, divided by the maximum number of plug-in instances that will run on a single chip. On the HDX card, DSPs are wired to the FPGA crossbar in groups of three, with a data bandwidth of approximately 67 MB/s for each group. The overall system bandwidth for each DSP is therefore approximately 20 MB/s. This bandwidth is shared by all data reads and writes, including custom data transfer requests as well as plug-in and mixer automation and metering data.
This limit is significantly lower on Pro Tools | Carbon. Carbon uses a single group for all eight DSPs, so the overall system bandwidth for each DSP is approximately 8 MB/s. In addition, data transfers between Carbon and the host system must be executed over a Gigabit Ethernet connection with up to 75% of its bandwidth already reserved for AVB audio data. This leaves 250 Gb/s for all other command traffic. If your plug-in utilizes frequent or large DSP/host data transfers then be sure to test it on Pro Tools | Carbon to verify whether it is compatible.

DSP/host data transfer characteristics

The minimum data transfer size for all host-to-DSP communications for HDX PCIe cards is 128 bytes. This limit applies to all host-to-DSP data transfers, including data sent to buffered ports, unbuffered ports, and private data blocks (via the AAX Direct Data interface.)
Since each transfer has a minimum size of 128 bytes, the use of many small packets does not increase transfer efficiency or save system bandwidth. Quite the opposite: updating a single 64-byte packet would require less bandwidth than updating two 4-byte packets in an HDX PCIe system, since the former would require only one 128-byte transfer while the latter will require two.
On Pro Tools | Carbon there is no minimum data transfer size. That said, for best performance on Carbon it is still recommended to minimize the number of data packets that are sent.

TI Shell characteristics: Memory allocation

Memory resource availability

The TI Shell code that is loaded onto each DSP uses approximately 56 kB of internal memory, leaving 200 kB of internal memory per DSP. This memory is shared between the plug-ins on the chip and holds the plug-ins' code and data, per-instance blocks declared in Describe(), and instance overhead.
As a general guideline, plug-in instances should not use more than 200 / n kB of internal memory, where n is the number of instances of your plug-in that will run on a single chip based on its cycle count requirements. If each plug-in instance on the chip requires more internal memory than this then the plug-in may need to declare an explicit number of instances that can run per chip based on this memory usage rather than declaring its cycle count utilization.

Shared and per-instance memory allocation

When a plug-in instance is created on a DSP, its program code is loaded onto that DSP. This copy of the program code is then re-used for all subsequent instances of the effect that are loaded onto the DSP. Static and global data are also shared between all instances of an effect on the DSP. Other allocations, such as coefficient and private data blocks, are per-instance.
Host Compatibility Notes:
Beginning in Pro Tools 11, AAX DSP algorithms also support optional temporary data spaces that can be described in the Describe module and are shared among all instances on a DSP. This is an alternative to declaring large data blocks on the stack for better memory management and to prevent stack overflows. Please refer to AAX_IComponentDescriptor::AddTemporaryData() for usage instructions.

Placing data into external memory

An AAX plug-in may optionally request that its private data or program code be placed into external memory. Because standard access calls to the DSP's SDRAM are very slow, it is strongly recommended that all of a plug-in's real-time data be placed in internal RAM, and the TI Shell will load a plug-in's program code and all private plug-in data blocks into internal memory by default.
Requesting more than 256 kB of data in internal memory for plug-in data plus the memory required by the TI Shell will lead to undefined behavior, so it is important to explicitly request external memory for plug-in data when appropriate.
For private data blocks that should be loaded into external memory, use the AAX_ePrivateDataOptions_External flag when calling AAX_IComponentDescriptor::AddPrivateData() . This flag will be ignored by the host, so Native AAX plug-ins will have the same functionality with or without this property.
To load program code, static data, or global variables into external memory, use the TI SECTION pragmas. For example, #pragma CODE_SECTION_(".extmem") can be used before function definitions that are either initialization code, or infrequently used background code. For static variables, use #pragma DATA_SECTION_(".extmemdata") before each variable definition.

DMA support

Because of the slower access time of external RAM, you should consider using a DMA transfer for recurring transfers, and possibly even for larger one-time transfers. This is of particular relevance for data reads, which must traverse the various clock domains and priority switches twice (address send, and then data return.)
The TI Shell supports three DMA modes: Scatter (for transfers from internal to external memory), Gather (for transfers from external to internal memory), and Burst (contiguous block copies). The Scatter mode can accomplish transfer speeds of up to 2.1 DSP cycles/byte transferred, while the Gather mode can accomplish 2.7 cycles/byte transferred.
The Scatter and Gather DMA facilities use a linear buffer for internal memory and a FIFO for external memory. It is possible to transfer to or from multiple offsets within the external memory FIFO using an offset table, which can contain up to 65,536 (2^16) entries. The offset (burst) length may be 4, 8, 16, 32, or 64 bytes long.
The TI Shell also supports a Burst DMA mode which implements linear data reads or writes.
For more information on DMA support and for example code, see \ExamplePlugIns\DemoGain_DMA in the SDK.

TI Shell characteristics: Data packet services

In addition to supporting direct transfers of arbitrary data via DMA, the TI Shell also supports a packetized data delivery mechanism for host-to-DSP data transfers. Packet delivery ports may be either unbuffered or buffered, and are described using the AAX_EDataInPortType parameter in AAX_VComponentDescriptor::AddDataInPort().

Unbuffered ports

Unbuffered ports use a straightforward implementation that delivers posted packets to the algorithm as soon as possible. In an unbuffered port, newer packets will always override older packets. Therefore, an algorithm may not receive every packet that was posted to an unbuffered port, but it will always receive the most up-to-date information possible.
Unbuffered ports deliver their data without blocking or synchronizing with the algorithm's execution. Although bus arbitration guarantees that a read from the algorithm callback will not occur in the middle of a write from the host, it is important to note that the data in an unbuffered port may change during algorithm execution.

Buffered ports

Buffered data ports store incoming packets in a host-managed queue. This queue acts as a buffer and provides the host with more flexibility in how it delivers packets. A key feature of buffered data ports is that new data will never be delivered to these ports during algorithm execution.
The behavior of buffered data ports varies depending on the host platform. In HDX DSP plug-ins, Buffered data ports use a FIFO to queue data packets as they are posted. New packets are dequeued and delivered to the algorithm individually, with the next packet arriving before each algorithm render callback.

Data port overhead and restrictions

Each HDX DSP supports a maximum of 164 buffered data ports, which matches the maximum I/O limit for each DSP. System overhead costs associated with using the on-chip packet services are as follows:

Memory Overhead

CPU overhead

Unbuffered ports do not incur any additional CPU overhead.
Individual buffered ports incur non-trivial CPU overhead. For example, in Pro Tools 10.2 each buffered port requires 5 cycles of overhead per render callback. This overhead can quickly add up in "small" plug-ins that contain many buffered data ports. Therefore, we strongly recommend that plug-ins use consolidated coefficient packets when possible in order to minimize this overhead. This optimization can result in large performance gains for callbacks that require 1000 or fewer cycles to operate.
The trade-off of this optimization is that more work ends up being done on the host and more data must be transmitted to the algorithm, since the entire coefficient packet must be re-calculated and re-sent every time any of its input parameters change. This is usually beneficial trade-off to make, especially given the 128-Byte per-transfer minimum for HDX PCIe cards discussed above. However, care must be taken in extreme cases such as when packet delivery threatens to bump up against the maximum recommended bandwidth for host/DSP data transfers, especially on Pro Tools | Carbon.

TI Shell characteristics: Instance allocation

Multi-shell packing

With a few exceptions, AAX DSP plug-ins will share DSPs with other plug-ins. This occurs transparently to the plug-in due to the fact that all system resource management is handled by the TI Shell.
When a new plug-in instance is created, the TI Shell and AAX host will attempt to intelligently allocate it to a DSP based on both memory and CPU resource requirements. If one plug-in on the chip requires a large amount of memory and very few processing cycles, it may be packed with another plug-in that does not require much memory but that is very CPU intensive.
The exceptions to this model are plug-ins that use DMA, register for a background processing callback, register a maximum number of instances per chip or use a processor affinity constraint when reporting CPU requirements. With the exception of a processor affinity, these plug-ins will receive dedicated DSPs to which only additional instances of the same plug-in type will be added.
Host Compatibility Notes:
Beginning with Pro Tools 10.2, the TI shell supports a "processor affinity" property, which indicates that a DSP ProcessProc should be preferentially loaded onto the same DSP as other instances from the same DLL binary. This is a requirement for some designs that must share global data between different processing configurations.

Note that this property should only be used when absolutely required, as it will constrain the DSP manager and reduce overall DSP plug-in instance counts on the system.

DSP Shuffles

A DSP shuffle will occur in Pro Tools when the engine must re-allocate DSP resources in order to make more processing power available. A shuffle will force the re-instantiation of the plug-in's DSP algorithm component, potentially on a new chip, while leaving the plug-in's host objects intact. During a shuffle, the engine will perform the following steps:
  1. Disconnect audio from an effect
  2. Call instance initialization with the removing instance flag on the old location
  3. Repeat for all instances of all DSP Effects in the system
  4. Load the effect in the new location
  5. Re-send the last packets to all data-in ports
  6. Call private data init for any private data
  7. Call instance init with the 'adding instance' flag, in the new location
  8. Begin audio processing
  9. Reconnect audio
  10. Repeat the instantiation and connection process for all instances of all DSP Effects in the system
Note that the system may perform some audio processing with each new instance before all of the Effect instances in the system have been re-instantiated.

Additional TI Shell services

Background processing

AAX plug-ins may request idle time from the main TI Shell thread. This results in a true idle context callback which can be used for non-critical background processing tasks on the DSP. This facility restricts the DSP to only allocate plug-in instances of the same type.
A plug-in's background processing callback is not provided with a reference to the plug-in's data structures and must therefore access plug-in data via global variables. The background process will be interrupted by system events and the audio render callback. For more information and an example on how to create a plug-in that relies on background processing, see \ExamplePlugins\DemoGain_Background in the SDK.

Requirements for HDX DSP Plug-Ins

Plug-in description

To support HDX DSP platforms, a plug-in must add a TI ProcessProc (real-time processing entrypoint) for each of its algorithms. This is done via a call to AAX_IComponentDescriptor::AddProcessProc_TI(), which is parametrized with the names of both the algorithm's TI DLL and of its exported entrypoint.
At minimum, the TI ProcessProc requires the following AAX Properties:

Performance measurement and reporting

In order to determine each algorithm's resource requirements, the host collects cycle count information from the plug-in via the plug-in's Describe callback. Each plug-in Effect is responsible for correctly reporting its algorithms' cycle counts for each accelerated platform that it supports. For plug-ins that use DMA or background threads, a maximum per-chip instance count is also required.
Note
All reported values must represent the algorithm's worst case performance.
Each of these values are reported as properties of a given algorithm ProcessProc and are provided by the plug-in via AAX_IComponentDescriptor::AddProcessProc_TI(). If an effect does not report its cycle count usage then it will be limited to a single instance per TI chip. This can be useful during development, but is not a supported mode for general use; all shipped plug-ins must correctly report their cycle requirements.
The DigiShell utility can be used to accurately measure plug-in cycle count requirements. For more information about DigiShell, see DSH Guide.

Shared vs. per-instance cycles

Because a single call into a plug-in is used to process multiple instances of that effect on that chip, two cycle count properties must be reported for each TI algorithm:
  1. AAX_eProperty_TI_SharedCycleCount
    This property describes the algorithm's one-time processing overhead that doesn't change as instances are added to a chip.
  2. AAX_eProperty_TI_InstanceCycleCount
    This property describes the additional cycle counts that each instance adds to the base shared overhead.
Many plug-ins exhibit different performance characteristics for both of these metrics depending on the plug-in's state. When reporting a plug-in's shared and per-instance cycle count requirements it is important to ensure that the reported values are the maximum possible requirements of the algorithm.
Often a plug-in will experience its worst-case per-instance processing load in one configuration and its worst-case shared processing load in another configuration. In this situation, the plug-in's reported cycle count requirements should reflect the state in which the sum of the two metrics is highest.
It's a common practice to not describe AAX_eProperty_TI_InstanceCycleCount and AAX_eProperty_TI_SharedCycleCount for the plug-ins during development and debugging process of the DSP plug-ins. This is acceptable, although in this case the one instance of such a plug-in will require the whole chip. In AAX SDK example plug-ins this is implemented using AAX_TI_BINARY_IN_DEVELOPMENT macros. If defined, it turns off the cycle count properties for the plug-in.

Measuring shared cycles

Measuring shared cycle counts requires instantiating multiple instances of an effect and observing how the processing time changes as instances are added. The shared and instance cycle counts are then calculated by performing a linear regression on the number of uncached cycle counts as the number of plug-in instances on the chip increases.
Note that these values will differ between debug and release builds of an algorithm, so a plug-in's describe function should report the correct cycle count values based on the relevant build configuration.
DigiShell includes the ability to measure shared cycle counts using the DAE.cyclesshared command. For more information about performance profiling using DigiShell, see Cycle count performance test.
Note
HDX DSP requires reporting of an algorithm's worst-case cycle counts.
Because HDX PCIe and Pro Tools | Carbon use the same HDX DSP platform, either product may be used to take plug-in cycle count measurements.

DMA and background thread performance reporting

For algorithms that use DMA or background thread facilities, the maximum number of algorithm instances that will fit on a chip is difficult to predict from cycle counts alone. Due to the asynchronous behavior and limited capacity of the DMA system, the DMA system may begin to miss its deadlines before the CPU is fully loaded. In addition, due to differences in background processing requirements between algorithms, an effect's background process may begin to miss its deadlines and be starved before the interrupt-time audio processing is at capacity. Plug-ins that use these facilities must therefore report the maximum number of instances that will run reliably at a given sample rate, in addition to reporting their shared and per-instance cycle counts as above.
Some other plug-ins may also wish to report the maximum number of instances that will run reliably at each sample rate. For example, plug-ins that use a lot of host/DSP data bandwidth may need to limit the number of instances per DSP chip in order to run successfully on Pro Tools | Carbon.
Maximum reliable instance counts are reported using an additional property, AAX_eProperty_TI_MaxInstancesPerChip. A plug-in should register separate components for the following three sample rate ranges in order to register distinct values for this property:
  1. Sample rates from 42kHz to 50kHz
  2. Sample rates from 84kHz to 100kHz
  3. Sample rates from 168kHz to 200kHz
Notes regarding DMA and background thread performance reporting:

Dynamic resource usage

All resources used by an AAX DSP plug-in algorithm are considered static. Plug-ins may not dynamically change the amount of memory or DSP cycles that are allocated to them after these metrics are provided in Describe.
The ability to dynamically change DSP cycle count requirements at run time is provided in the AAX SDK but is not currently supported by any host.

Plug-in compilation and packaging

Exported symbols

Each HDX DSP algorithm (ELF DLL) may contain multiple entrypoints. A single DLL may be used for all of your plug-in's entrypoints and program code, or you may divide your plug-in's entrypoints and program code between multiple DLLs.
Your plug-in must export one "C"-style callback for each algorithm ProcessProc that your plug-in registers. This entrypoint must conform to the standard AAX real-time algorithm callback prototype:
# include "elf_linkage_aax_ccsv5.h" // Includes required TI_EXPORT definition
extern "C"
TI_EXPORT
void
MyEffect_AlgorithmProcessFunction(
SMyEffect_Alg_Context * const inInstancesBegin [],
const void * inInstancesEnd)
Listing 1.1: The standard AAX real-time algorithm callback prototype
See Settings for exported symbols if you are running into problems linking your AAX DSP binary.

Packaging

The ELF DLLs for an AAX DSP plug-in must be placed in the ./Content/Resources directory within the plug-in bundle.

TI Development Tools

Development for TI algorithms is primarily performed in TI's Code Composer Studio. Code Composer Studio (CCS) is a full-featured, Eclipse-based IDE providing JTAG hardware debugger support, a hardware simulator, and a suite of profiling tools. Most importantly, CCS includes an excellent C compiler that is capable of providing highly optimized DSP instructions without too much tuning.
Note
As of this writing, Code Composer Studio for Mac does not support the C6000 series processor. CCS for Windows is required for AAX DSP plug-in development. See MacOS Host Support CCSv7 on the Texas Instruments wiki for current compatibility information.

Code Composer Studio

The AAX SDK supports Code Composer Studio versions 4 ("CCSv4") and higher ("CCSv5", etc.), with hardware debugging support beginning in version 4.2. As of the writing of this documentation, CCS versions 4, 5, and 7 have been tested by Avid.
Note
This documentation was originally written for CCSv4 and was later updated with instructions for updating from CCSv4 to CCSv5. Versions 5 and higher use a different project file format from version 4; when this documentation describes changes required for version 5 then these changes will also be required by other later versions which use this new project format.

Installation

  1. Download and install the latest Code Composer Studio from TI's website.

    Note
    Windows 10 requires Code Composer Studio version 6.1.3 or higher
    As of Code Composer Studio version 7 TI does not charge for licenses. You can simply download the tool and start using it. Along with this the end user license agreement has changed to a simple TSPA compatible license. For more information see the TI web site.
  2. The default installation will work fine, but a custom install will be smaller. You only need support for the C6000 chipset and the Spectrum Digital JTAG drivers, so you can deselect all the other chipsets and JTAG drivers.
  3. Go to TI's Code Generation Tools page. You will need to log in.
  4. Download and install the C6000 Code Generation Tools v7.0.x or later, using the typical installation settings. For AAX DSP development you will only need support for the C6000 chipset and, if you will be using a hardware debugger, for the Spectrum Digital JTAG drivers, so you may deselect all the other chipsets and JTAG drivers.

    1. Launch CCS and go to Help > Install New Software...
    2. In the opened dialog select "Code Generation Tools Updates" in the "Work with:" drop-down list.
    3. Select "TI Compiler Updates" > "C6000 Compiler Tools [version]".
    4. Press Next and continue installation using the "typical" installation settings.

    As of the publishing of this version of the AAX SDK Avid is internally using v7.4.6. Avid has tested 7.4.4 and 7.4.6, but we assume that later revisions will work as well. The latest CGTools version available as of this writing is v7.4.21.

    For more information about configuring your CCS workspace with CGTools v7.4.x, see Workspace setup

Workspace setup

The idea of a CCS workspace is similar to a Visual Studio solution file. Note that workspaces tend to store absolute paths and developer-specific info, so you may wish to avoid checking them in to your source control server.

Setting up workspace-global macros

To set up workspace global macros:
  1. When you open CCS for the first time, select a directory for your "workspace". As mentioned above, we recommend that this be outside of your source tree.
    Note
    Pay attention that you can not reuse your Code Composer Studio workspace after updating to a later versions. In particular, we have found the CCSv4 workspaces are incompatible with CCSv5. After updating your system to a later Code Composer Studio version you must create a new workspace and import your existing projects into this new workspace.
  2. Go to File > Import... and select Code Composer Studio > Build Variables (CCS > Managed Build Macros in CCSv4.) Click Next.
  3. Browse to TI/Common/macros.ini in your AAX SDK directory and click Finish.
  4. This will define an "SDK_SOURCE_ROOT" Linked Resource path variable and Managed Build macro, which associates the CCS workspace with a single AAX SDK installation.
    Note
    A side effect of this is that you cannot use projects from multiple distinct AAX SDK installations in the same CCS workspace.
  5. To verify that the correct path has been set, go to Window > Preferences... and look in General > Workspace > Linked Resources, and C/C++ > Build > Build Variables (C/C++ > Managed Build > Macros for CCSv4.)

Importing projects into your workspace

To import projects into your workspace:
  1. In the IDE, go to Project > Import Existing CCS/CCE Eclipse Project
  2. In Select search-directory, select the root of your AAX SDK installation
  3. The projects in the resulting Projects list will automatically be selected
  4. Click Finish, and then wait while the projects are imported.
In order to import CCSv4 projects into later versions of Code Composer Studio it is necessary to add a .cdtproject file to the project. If you don't have this file in your project, then you can copy it from any other existing project which was created using CCSv5 or later. Otherwise you will most likely see something similar to this error:

"Error: Import failed for project 'xxxx' because its meta-data cannot be interpreted."

If you try to build this newly imported CCSv4 project in a later version of Code Composer Studio then you will get the warning:

"This project was created using a version of compiler that is not currently installed: 7.0.5 [C6000]. Another version of the compiler will be used during build: 7.4.6. Please install the compiler of the required version, or migrate the project to one of the available compiler versions by adjusting project properties."

This warning may be cleared by changing Properties > General > Compiler Version from TI v7.0.x to the current version (e.g. TI v7.4.x). After that the "Output format" field, which is next one to the "Compiler version" field and is typically grayed out, will become active. You should choose "eabi (ELF)" there. Otherwise Code Composer the build will fail with errors:
Note
After successful convertion of the project and successful build, the remeasurement of cycle count should be done, because it may change. Most likely it will decrease, as compared to the version which was built with CCSv4, but that is not guaranteed. Also the size of the DLL may increase, which may require reducing code size in order to properly instantiate the plug-in.

Creating new projects

New project setup

Use the following settings in the "New Project..." wizard. Defaults are in italics.
Note
You can edit the Linker Command File setting to use the SDK_SOURCE_ROOT macro by manually editing the project's .project XML file or by adding the file to your project using a relative path. See the SDK sample plug-in projects for an example.

Recommended settings for AAX plug-in projects

Tool Settings
C6000 Compiler
Include Options
-include_path "${SDK_SOURCE_ROOT}/Interfaces"
-include_path "${SDK_SOURCE_ROOT}/[Plug-in directory]"
The SDK_SOURCE_ROOT macro is defined via the macros.ini file, located in the SDK's /TI/CCSv4 directory. If you encounter errors using this macro, import the file using File > Import... > CCS > Managed Build Macros.
Tool Settings
C6000 Compiler
Command Files
-cmd_file "${SDK_SOURCE_ROOT}\\TI\\CCSv4\\CommonPlugIn_CompilerCmd.cmd"
This file contains additional compiler commands that should be common to all AAX plug-in projects
Tool Settings
C6000 Linker
Basic Options
-o "${ConfigDir}/${PackageName}/Contents/Resources/${ProjName}.dll"
This path will ensure that your compiled TI DLL is placed in the appropriate location inside your AAX plug-in bundle.
Tool Settings
C6000 Linker
Runtime Environment
(No "Initialization model" options set)


Build Settings
Artifact name
${ConfigDir}/${PackageName}/Contents/Resources/${ProjName}
This path will ensure that your compiled TI DLL is placed in the appropriate location inside your AAX plug-in bundle.
Build Settings
Artifact extension
dll
AAX TI libraries should use the .dll extension
Binary Parser
Elf Parser
AAX TI libraries should use the Elf binary parser only
Macros
Project
User Macros
ConfigDir = ${OutDir}/${ConfigName}
IntDir = ${ConfigDir}/int/${PackageName}/TI/${ProjName}
OutDir = ${ProjDirPath}/../../WinBuild
PackageName = [Plug-in name]
These macros are used by the other settings here to ensure proper path set-up and artifact naming. Don't worry that ConfigName shows up as undefined - it will be defined as Debug/Release at compilation.

Recommended Release configuration settings

Tool Settings
C6000 Compiler
Basic Options
-symdebug:none
-O3
Tool Settings
C6000 Compiler
Predefined Symbols
-define=NDEBUG
Tool Settings
C6000 Compiler
Optimizations
-os
-on2
-op3
Tool Settings
C6000 Compiler
Assembler Options
-keep_asm

Other useful project settings

Tool Settings
C6000 Compiler
Predefined Symbols
-define _DEBUG
This option is useful for differentiating cycle count reporting for Debug vs. Release builds.
Tool Settings
C6000 Compiler
Directory Specifier
-ft "${IntDir}"
-fr "${IntDir}"
-fs "${IntDir}"
Useful for collecting intermediate files
Tool Settings
C6000 Linker
Basic Options
-m "${IntDir}/${ProjName}.map"
Useful for placing the map file alongside all other intermediates
Tool Settings
C6000 Linker
File Search Path
-l (nothing)
You can exclude libc.a, which is included by default, from this option unless you require C library features.

Adding files and folders

In CCS, dragging files into the project, using "Add Files to Project...", or using "Link Files to Project..." will either copy the file into the project directory or create an absolute path to the file. This is usually not the desired behavior. Use the following steps to add a file using a relative path:  
  1. Right click on the project you'd like to add files to, and select New > File (NOT "Source File" or "Header File").
  2. Click "Advanced >>".
  3. Check the box that says "Link to the file in the system". Click "Variables..."
  4. Select the appropriate variable (usually either SDK_SOURCE_ROOT or SOURCE_ROOT) and click "Extend..."
  5. Find the file you want to add. Click OK. Click Finish.
Note that, when adding folders, everything in the folder will be built by default. You can exclude files to work around this behavior.

Settings for exported symbols

The TMS320C6000 C++ compiler

One of the primary goals of AAX is to provide a platform-agnostic development architecture in which products can easily be developed and re-used across a wide variety of platforms. However, it is still occasionally necessary to write platform-specific code. This section will document methods for producing code that is specific to the TI C6727 platform using the TMS320C6000 C++ compiler.

C++ standard support

The TMS320C6000 compiler supports C++ as defined in the ISO/IEC 14882:1998 standard. The exceptions to the standard are as follows:

Predefined environment symbols

The following symbols are predefined by the compiler on the TI architecture, and should be used in code concerned with cross-platform support:
Although you should not require them for AAX development, equivalent assembly predefines are as follows:

Loop controls

The TI compiler supports several pragmas that can be used to give the compiler additional information about loops.

DigiShell test tool (DSH)

DigiShell is a software tool that provides a general framework for running tests on Avid audio hardware. As a command-line application, DigiShell may be driven as part of a standard, automated test suite for maximum test coverage. DSH supports loading all types of AAX plug-ins including Native and DSP, and is especially useful when running performance and cancellation tests of AAX-TI types. DigiShell is included in Pro Tools Development Builds as dsh.exe (Windows) or as dsh in the CommandLineTools directory (Mac).
More information on DSH test tool can be found in DSH Guide.

Hardware Debugging

Requirements

Relocatable ELF DLLs (TI algorithms) can be debugged with some help from the DIDL loader, the TI Shell Manager, and a script called DLLView_Elf_Avid.js.
These are the minimum requirements for hardware debugging for TI plug-ins:
We recommend using Spectrum Digital's XDS510 USB Plus JTAG Emulator, as it is the only one our internal developers have used and tested in-house. Both Spectrum Digital and TI have useful technical reference/installation guides, both of which can be found on the AAX Developer Forum under the 'Development Tools' discussion.

How it works

The ridl ELF loader inside DIDL stores a module and segment list containing the paths of all loaded modules and where their segments are loaded. The TI Shell Manager gets a serialized version of this table and loads it to a block of external memory on the chip at a known location. The DLLView_Elf_Avid.js script queries this memory via the debugger and extracts the paths of the modules and the ELF segment load locations, which it then passes on to the GEL_SymbolAddELFRel scripting console command (new to CCSv4.2). You can also use that command directly at the console.

Connecting a JTAG Emulator

A JTAG-enabled HDX development card includes a "riser" PCB section extending about a centimeter above the production card PCB. This riser includes two JTAG connectors. The two connectors correspond to the two banks of 9 DSPs on the HDX card. Assuming that you are instantiating your plug-in for debugging on the first available DSP, you will want to connect your JTAG emulator to the connector that is closest to the card's user-visible ports. This connector corresponds to the first 9 DSPs on the card.

Linking to TIShell.out

Hardware debugging, as well as several other debugging facilities, requires that the DSP plug-in project is linked to TIShell.out in Code Composer Studio.
To link a plug-in project to TIShell.out, follow these steps:
  1. Open the plug-in project's properties window and navigate to the C/C++ Build > Tool Settings > C6000 Linker > File Search Path properties pane.
  2. Add "TIShell.out" to the "Include library file" (-l) property list.
  3. Under "Add <dir> to library search path" (-i), add the file path of the Pro Tools build you will be using to test the plug-in. This directory should already include the build's TIShell.out file.
  4. Repeat this process for each Configuration of the plug-in project that you will be testing.
  5. Add "[path to AAX SDK root]\\TI" to the project's list of source file include directories

Adding the HDX Target Descriptor File

To add the HDX Target Descriptor File:
  1. In the IDE, go to Window > Preferences, CCS > Debug. Point the "Shared target configuration directory" to /TI/Common in your AAX SDK source tree
  2. In the IDE, go to Window > Show View > Target Configurations.
  3. Click refresh if you don't see the configuration file
  4. Right click Raven_C672x_XDS510_USB.ccxml, and click "Set as Default".

Setting up the DLLView script

Once you have successfully installed the XDS510, you will have to do a little bit of setup with CCS. Before starting this process, verify that you are running CCSv4.2 or later and the C6000 code generation tools v7.4 or later (or 7.0.5 for CCSv4). CCS should recognize the installed emulator and prompt you to download the necessary drivers. Once completed, you will then want to setup your DLLView script.
To set up the DLLView script:
  1. In the IDE, open the Scripting Console under View > Scripting Console
  2. At the Scripting console, type one of the following to load the DLLView script (insert your own source tree path, and make sure to load the version that corresponds to your installed CCS version):
    Code Composer Studio 4: loadJSFile "[PATH TO AAX SDK]/TI/CCSv4/dllView_Elf_Avid.js" true Code Composer Studio 5 and later: loadJSFile "[PATH TO AAX SDK]/TI/CCSv5/dllView_Elf_Avid.js" true
You should now see a new menu item under the Scripts menu: "DLLView -Load Pro Tools Plug-In Symbols" This should load every time CCS starts.

Loading Symbols for Debugging

You will need to get your code loaded and running on the TI before you load symbols. You can do this directly through Pro Tools, or by using our DigiShell test tool. If using the DigiShell test tool, load the DAE dish and then a plug-in via the following commands:
load_dish DAE
Loads the DAE dish
run
Lists available plug-ins with their index and spec
run<index>
Instantiates the <index> plug-in
Use the DLLView script to load symbols for ELF DLLs. After setting up the DLLView script and connecting to the desired chip in the Debug pane, run the "DLLView -Load Pro Tools Plug-In Symbols" script from the Scripts menu in Code Composer Studio.
Note
The chip will need to be Suspended in the debugger in order to load symbols.
To load symbols for debugging:
  1. In CCS, Launch the TI Debugger (Target > Launch TI Debugger)
  2. Connect the debug target to the appropriate chip
  3. Suspend the chip
  4. Run Scripts > DLLView -Load Pro Tools Plug-In Symbols.
    Note
    This script can take a moment to load; look at the Scripting Console to view its progress if you like
    This script may print a warning about TIShell.out not existing. This warning is benign for plug-in debugging since the TIShell symbols are not required in this case.
This will load symbols for all symbol-rich modules running on the chip(s) connected to the debugger. If you load or unload plug-ins after this, you can simply repeat the "DLLView -Load Pro Tools Plug-In Symbols" command, which will synchronize the debugger with the current configuration.
Note
When running a plug-in in Pro Tools, the first DSP chip is reserved for the HDX mixer. Therefore the first available DSP chip for plug-in instantiation is C672x_1. Under DSH, the first available DSP chip is C672x_0.

Breaking on first entry into algorithm

To break on the first entry into the plug-in's processing routine, use the manual single-buffer processing mode in DSH:
piproctrigger manual
run<index>
Attach debugger, suspend the chip, load symbols, set breakpoint, resume
piproctrigger auto

Breaking in the on-chip algorithm initialization callback

It is not currently possible to hit a breakpoint in the optional on-chip algorithm initialization callback for a plug-in. If you need to troubleshoot this callback then you should use tracing to print debug information to a log file.

Tracing

Avid's AAX DSP platforms provide tracing functionality based on Avid's DigiTrace tool.
To enable trace logging for TI plug-ins, use the AAX_TRACE or AAX_TRACE_RELEASE macros defined in AAX_Assert.h. A separate macro, AAX_ASSERT, is also available for conditional tracing. These macros are cross-platform and will function whether the algorithm is running on the TI or on the host.

Tracing requirements

To link your plug-in project to TIShell.out in Code Composer Studio, follow the steps listed in Linking to TIShell.out .

Tracing example

int32_t
MyExamplePlugIn_AlgorithmInit ( SExample_Alg_Context const *
inInstance , AAX_EComponentInstanceInitAction inAction )
{
"MyExamplePlugIn_AlgorithmInit called for action : %d",
inAction );
return 0;
}
#define AAX_CALLBACK
Definition: AAX.h:285
#define kAAX_Trace_Priority_Normal
Definition: AAX_Assert.h:226
#define AAX_TRACE_RELEASE(iPriority,...)
Print a trace statement to the log.
Definition: AAX_Assert.h:232
AAX_EComponentInstanceInitAction
Selector indicating the action that occurred to prompt a component initialization callback.
Definition: AAX_Enums.h:795
Listing 2: Adding trace code on TI

Usage notes

Testing in Pro Tools

The System Usage window

The System Usage window in Pro Tools includes some features specifically targeted at testing DSP plug-ins, and particularly for testing shuffle events. Starting in Pro Tools 10, the System Usage window includes the following test features:

DSP information tooltip

Pro Tools can display additional information for DSP plug-ins using some debug tooltips that are hidden in the plug-in window header and the System Usage window.
The tooltip in the plug-in window header displays information about the particular plug-in instance that is currently shown in the window. To display this tooltip, hold Command-Option-Shift (Mac) or Control-Alt-Shift (Windows) and hover the mouse cursor over the DSP > Native button in the plug-in header.
The tooltip in the System Usage window displays usage information for each DSP chip in the system. You can reveal this tooltip for a particular chip by mousing over the chip's usage meter while holding Command-Option-Shift (Mac) or Control-Alt-Shift (Windows). This tooltip shows the chip's total allocated cycles, internal, and external memory.
The information in these tooltips is generally targeted at systems-level debugging, but can prove useful for some plug-in troubleshooting as well.
DSP tooltip in the Pro Tools plug-in window header
Figure 1: DSP tooltip in the Pro Tools plug-in window header.
DSP tooltip in the System Usage window
Figure 2: DSP tooltip in the Pro Tools System Usage window.

Common Issues with TI Development

Data structure compatibility

AAX DSP plug-ins use a set of custom data structures to exchange information with host. In order to preserve a consistent binary interface between the plug-in's host and algorithm, the layout of these structures must be identical on both platforms. Each structure must have the same size when compiled by both the host platform compiler and the TI DSP compiler, and any members that are referenced by both the host code and the DSP code must reside at the same offset within the struct on both platforms.
In order to satisfy this requirement, it is essential that an AAX plug-in's algorithm context structure and any other data structures that are passed between the host and the DSP use appropriate alignment. Data structures are usually aligned to 32-bit boundaries, and both Intel and TI compilers use identical struct alignment and packing for most cases. However, this behavior is not explicitly defined in the C standard.
Furthermore, different compilers may use different sizes for some built-in data types. It is therefore very important to use explicitly-sized types such as int32_t and float rather than ambiguous types such as bool or int. One particularly tricky data type is pointers, which may be compiled as 64-bit values on a 64-bit Intel system but as 32-bit values on the TI DSP.
Here are some specific scenarios when an unexpected difference in alignment or data type size may occur and cause an ABI incompatibility between a plug-in's host and DSP components:

Nested structures

It can be particularly difficult to debug alignment issues in nested data structures. One reason is that nested structs do not necessarily have the same alignment as the parent struct. A nested structure will have the alignment that is set preceding its declaration, not the alignment of the structure in which it is contained.
Aside from avoiding nested structs entirely, one way to avoid potential issues is to make sure that nested structs always contain a double. This will guarantee that the structure is double-word aligned. We have also found that placing nested structs near the beginning of the parent struct results in more consistent alignment between Intel and TI compilers, even in cases where the actual alignment of each member is strictly ambiguous according to the standard.
Another important rule of thumb with nested structs is to define them inline in the enclosing structure. We have found that including one data structure as a member in another data structure will only be reliably aligned between Visual Studio and the TI compiler tools if the member structure's type is defined in-line. This does not appear to be an issue between clang and the TI compiler - the data structure alignment for the nested structure is consistent between those two compilers regardless of the location of the internal structure's definition.
#include AAX_ALIGN_FILE_ALG
struct SomeStruct
{
float a;
float b;
};
#include AAX_ALIGN_FILE_RESET
// Somewhere else...
#include AAX_ALIGN_FILE_ALG
class SomeClass
{
public:
SomeStruct s; // Don't do this! Inconsistent between Visual Studio and TI
// other stuff...
};
#include AAX_ALIGN_FILE_RESET
Listing 3: Problematic code: nested struct not defined in-line
#include AAX_ALIGN_FILE_ALG
class SomeClass
{
public:
struct SomeStruct
{
float a;
float b;
} s; // This is fine - consistent between Visual Studio, clang, and TI
// other stuff...
};
#include AAX_ALIGN_FILE_RESET
Listing 4: Fixed code: nested struct defined in-line

Usage of pragma pack

If you use pragmas to align your structs, then you should know that in most cases it will only decrease the natural struct alignment of a compiler. That means that if you have
#pragma pack(8)
struct x
{
char a;
float b;
};
Listing 5: Example of usage of #pragma pack where it has no effect
then struct x most likely won't be aligned to the 8 byte boundary. Therefore the pack pragma is not really useful for addressing alignment issues. Instead of using pack, one way to guarantee that a structure is double-word aligned, is to include at least one double member.
#pragma pack(8)
struct x
{
float a;
double b;
};
Listing 6: Example of usage of #pragma pack where it actually affects the alignment of the structure
In this case data will be double-word aligned.

Dynamic allocation of memory in structures and algorithm

The problem with dynamic allocation is that it's difficult to enforce specific alignment of the resulting block beyond the natural alignment of the structure. Newly allocated blocks are not double-word aligned by default. This prevents double-word memory access optimizations (see Additional data type optimizations) from working.
// blocks are not aligned to 8-byte boundaries by default. This prevents double-word
// memory access optimizations from working
float* floatBlock = new float[100];
delete[] floatBlock;
// Though AAX_Alignment.h does include some aligned memory allocators to counteract the alignment
// problem, their use is still strongly discouraged.
float* floatBlock2 = alignMalloc<float>(100, 8);
alignFree(floatBlock2);
void alignFree(void *p)
Definition: AAX_Alignment.h:30
Listing 7: Problems which may arise when using dynamic allocation of memory in algorithm

Incorrect use of pointer data

In general, you should avoid storing pointers to anything in any data structures that are passed between the host and the DSP. There are many possible problems and bugs that can be caused by this, for example:
One alternative to using raw data pointers is to store data offsets into a coefficient array rather than using direct pointers to other structure elements. A solution such as this that does not involve pointer data types will almost always end up being easier to implement, easier to troubleshoot, and easier to maintain than a solution that uses pointer data.
That said, if you must use pointer data types in any data structures that are passed between the AAX host and DSP components then you should be very careful to avoid the problems listed above.

Pointer data size incompatibility

Problems due to pointer data size incompatibility can be particularly difficult to debug. Pointer data types are not explicitly sized in C, and, starting with the 64-bit Pro Tools 11 release, pointers will have different lengths for host and TI binaries. This can cause subtle portability problems in certain circumstances, if proper care is not taken.
Consider the following state block:
struct SMyPlugInStateBlock
{
float mInGain_Smoothed;
some_t* mPointerP;
float mOutGain_Smoothed;
};
Notice the pointer mPointerP (the type that it points to is irrelevant for this discussion). Perhaps it is a pointer that can reference different sets of coefficients, or perhaps it points to some sort of global variable. In any case, this pointer is 64-bits long on the host, and 32-bits long on TI.
In most cases, this won't cause a problem because the host simply allocates a bit more space for the state block than the TI needs and fills the allocated memory with 0s. But consider the case where we overload ResetFieldData() to set mOutGain_Smoothed to something other than 0:
AAX_Result MyPlugIn_Parameters::ResetFieldData (AAX_CFieldIndex inFieldIndex, void * inData, uint32_t inDataSize) const
{
AAX_Result result;
switch (inFieldIndex)
{
case (eMyAlgFieldIndex_State):
{
memset(inData, 0, inDataSize);
SMyPlugInStateBlock* stateP = static_cast<SMyPlugInStateBlock*>(inData);
stateP->mOutGain_Smoothed = mOutGain_Target;
result = AAX_SUCCESS;
break;
}
default:
{
result = AAX_CEffectParameters::ResetFieldData(inFieldIndex, inData, inDataSize);
break;
}
}
return result;
}
int32_t AAX_Result
Definition: AAX.h:337
AAX_CIndex AAX_CFieldIndex
Not used by AAX plug-ins (except in AAX_FIELD_INDEX macro)
Definition: AAX.h:349
@ AAX_SUCCESS
Definition: AAX_Errors.h:39
AAX_Result ResetFieldData(AAX_CFieldIndex inFieldIndex, void *oData, uint32_t inDataSize) const AAX_OVERRIDE
Called by the host to reset a private data field in the plug-in's algorithm.
We might be doing this if mOutGain_Smoothed was a smoothing parameter and we want to start it at the target gain value (rather than having it smooth from 0.0 at instantiation). But if the Host and TI can't agree on where in the state block mOutGain_Smooth is located, then the result will be unexpected behavior that is difficult to debug.
The most direct way to avoid this problem is to use an explicitly-sized 32-bit type for any pointers in your state block:
struct SMyPlugInStateBlock
{
float mInGain_Smoothed;
uint32_t mPointerP;
float mOutGain_Smoothed;
};
It will be necessary to use reinterpret_cast<float*>(stateP->mPointerP) to recast the pointer to a pointer data type on the TI, but that should not result in any extra processing cycles.

Alignment Reference

These are the data type sizes and default alignments for some common compilers when compiling for 64-bit binary formats:
TI MS Visual C++ C++ Builder GCC
char 1 byte 1-byte aligned 1 byte 1-byte aligned 1 byte 1-byte aligned 1 byte 1-byte aligned
short 2 bytes 2-byte aligned 2 bytes 2-byte aligned 2 bytes 2-byte aligned 2 bytes 2-byte aligned
int 4 bytes 4-byte aligned 4 bytes 4-byte aligned 4 bytes 4-byte aligned 4 bytes 4-byte aligned
long 4 bytes 4-byte aligned 8 bytes 8-byte aligned 8 bytes 8-byte aligned 8 bytes 8-byte aligned
long long 8 bytes 8-byte aligned 8 bytes 8-byte aligned 8 bytes 8-byte aligned 8 bytes 8-byte aligned
bool 1 byte 1-byte aligned 1 byte 1-byte aligned 1 byte 1-byte aligned 1 byte 1-byte aligned
float 4 bytes 4-byte aligned 4 bytes 4-byte aligned 4 bytes 4-byte aligned 4 bytes 4-byte aligned
double 8 bytes 8-byte aligned 8 bytes 8-byte aligned 8 bytes 8-byte aligned 8 bytes 8-byte aligned
long double 8 bytes 8-byte aligned 8 bytes 8-byte aligned 8 bytes 8-byte aligned 16 bytes 16-byte aligned
pointer 4 bytes 4-byte aligned 8 bytes 8-byte aligned 8 bytes 8-byte aligned 8 bytes 8-byte aligned
Also here are some useful links to web resources on the topic:

TI Optimization Guide

Optimizing AAX real-time algorithms for Avid's TI-based platforms is very similar to optimizing real-time algorithms for any architecture. When developers think about optimization, they often think "I want to make my code run faster". In reality, however, optimization is about making the processor do less. After all, the processor's clock rate is fixed and can only perform a limited number of instructions in a set amount of time. Therefore, our focus in this section will be on helping the compiler produce code with shorter execution paths and make full use of the TI chip's architecture.
Modern compilers have become extremely powerful at being able to optimize code, which is fortunate given the complicated architectures of today's DSP products. In this section we will not focus on instruction-level "optimizations" like the one below, which will automatically be done by the compiler. Instead of making our code faster, which it won't, little "tricks" like this really just make code harder to read:
int y = x;
y = y >> 1; // y = y / 2;
Listing 8: The kind of optimization that you won't be seeing in this section
Rather, we will focus on refactoring audio processing algorithms to be more efficient and on giving the TI compiler better information about the code, pointers, and data it is working with so it can perform more effective compile-time optimizations.
Finally, our optimization efforts will focus on the worst-case code path. For example, developers often try to optimize algorithms by conditionally bypassing portions of code that may be disabled by particular parameter states. This is counter-productive, because the system has to assume a plug-in's worst-case execution performance regardless of how much time the plug-in is actually using. Therefore, in the context of real-time algorithms running on AAX DSP platforms, it is best to only worry about worst-case execution time.
For more information about using TI's toolset to profile your code's performance, see Cycle count performance test.
Note
The optimizations described in this section assume that you are using version 7 or higher of TI's C6000 Code Generation Tools (CGTools). We strongly recommend using v7.0.5 as earlier versions throw linking errors.

Optimization quick start

Here is a quick outline of the general optimization steps for an AAX DSP algorithm:
  1. Before beginning your DSP optimizations, make sure that your Native algorithm has basic optimizations in place. In our experience, beginning the TI optimization process with a slow or needlessly precise Native algorithm will result in a long porting process. Here are some suggestions for common Native optimizations:
    • Identify unnecessary double precision
    • Identify tables that have too high of granularity
  2. Make sure your compiler Release settings enable the compiler to optimize fully and give full optimization comments:
    -k -s -pm -op3 -os -o3 -mo -mw –consultant –verbose -mv67p
  3. Use the load/update/store design pattern to reduce memory accesses in inner loops
  4. Move any processing that does not directly depend on the audio signal out of the real-time algorithm
  5. Declare non-changing variables and pointers (both local and in parameter lists) as const
  6. Declare non-aliased pointers (both local variables and function parameters) as AAX_RESTRICT
  7. Change any long variables to int, and change double variables to float if the reduced precision does not affect signal integrity (usually defined as cancellation with the plug-in's Native algorithm.)
  8. Restructure inner processing loops so that they do not contain large conditional statements or other branches
  9. Declare any functions that are called within the innermost processing loop as inline in order to allow the inner loops to pipeline
  10. Add loop count information when known, using #pragma MUST_ITERATE(min,max,quant)

Compiler and linker options

As with any complex environment, many performance gains on the TI rely on the appropriate compiler and linker options. The options documented here will allow CGTools to apply its optimization logic to your algorithm.
When tweaking compiler options on the TI, keep in mind that, like on any CPU, it is useless to optimize Debug code or to profile its performance. This is especially true on TI processors because of the fact that generated Debug and Release assembly is almost completely different, assuming that heavy optimization options were chosen for the Release configuration.
In general, all recommended compiler options should be set correctly in the AAX SDK's example plug-in projects, and these settings may be used as a guide for your own plug-in projects. See the SDK files CommonPlugIn_CompilerCmd.cmd and CommonPlugIn_LinkerCmd.cmd for the latest recommended settings.

Overview of optimization-related compiler options

Overview of optimization-related linker options

Optimization flags (-o)

Like the corresponding Visual Studio options,-o0 and -o1 allow you to step through code line-by-line for debugging, at the cost of reduced performance. -o2 and -o3 sacrifice the ability to step through code and watch memory in favor of optimized code.

Program Mode optimization (-pm)

Program mode optimization gives the compiler further optimization information by compiling all files at once rather than individually. Thus global constants, function implementations, etc. can be made known to the entire program at compilation. This allows the compiler to inline functions more effectively and to determine loop unrolling based on constant loop iterators.
There are a few -pm options:

Compiler options to avoid

The following information was taken from the TMS320C6000 Programmer's Guide:

The load-update-store pattern

The load-update-store pattern is one of the cornerstones of a fast iterative algorithm. This pattern specifies that locally accessed data should be loaded into memory at the start of processing, accessed during processing, and stored or saved after processing has completed. By using this pattern you will move memory reads and writes outside of your plug-in's innermost processing loop, which reduces data dependencies and shortens the critical inner loop.
As an example, consider the following unoptimized filter code:
inline void
ProcessDirectFormII(float* input, float* output, float* state, float*
coefs, int nsamp)
{
// eB0 .. eB2 and eA0, eA1 are just integer enums to partition
// the filter coefficients into A and B
for(int i = 0; i < nsamp; ++i)
{
output[i] = input[i]*coefs[eB0] + state[0];
state[0] = input[i]*coefs[eB1] + state[1] - output[i]*coefs[eA0];
state[1] = input[i]*coefs[eB2] - output[i]*coefs[eA1];
}
}
Listing 9: Unoptimized filter algorithm
Notice that in this code there are at least 15 memory accesses per loop iteration! This algorithm will be very inefficient as the value of nsamp increases.
The compiler should be able to optimize this algorithm to some extent by pulling certain memory accesses outside of the loop. However, the compiler cannot completely optimize the loop because it must assume that the input/output/state/coefs pointers are aliased in memory. We will discuss the const and restrict keywords later, which are ways to give the compiler additional information it can use to optimize this loop. However, for now let's focus back on the basic design of this code.
Using load-update-store, we can refactor this loop to pull the memory accesses outside of the loop:
void
ProcessDirectFormII (float* input, float* output, float* state, float *
coefs, int nsamp)
{
// eB0 .. eB2 and eA0, eA1 are just integer enums to partition
// the filter coefficients into A and B
// ---- LOAD ----
float coefA0 = coefs [eA0];
float coefA1 = coefs [eA1];
float coefB0 = coefs [eB0];
float coefB1 = coefs [eB1];
float coefB2 = coefs [eB2];
float state0 = state [0];
float state1 = state [1];
float output;
// ---- UPDATE ----
for (int i = 0; i < nsamp; ++i)
{
output = input [i]* coefB0 + state0;
state0 = input [i]* coefB1 + state1 - output * coefA0;
state1 = input [i]* coefB2 - output * coefA1;
output [i] = output;
}
// ---- STORE ----
state [0] = state0;
state [1] = state1;
}
Listing 10: Refactored filter algorithm with load-update-store pattern applied. Not fully optimized.
Though the code initially appears longer, you will notice that we have reduced the loop to only 4 memory accesses! Though we have an additional 9 memory accesses outside the loop, they will only occur once per function call, resulting in significant savings at higher values of nsamp.
Note
we are not finished with this loop yet, because we can make some very significant gains by using the restrict and const keywords, as discussed in the section on C keywords.
Before moving on from load-update-store, let's consider how this pattern should be applied to different categories of data that may be provided in an AAX DSP processing context:

Case study: IIR filter implemenation on TI 672x DSPs

In this section we will examine various IIR filter implementations as a specific example of the considerations that must be made when optimizing DSP code for the 672x.
The TI 67xx family of DSPs is notably different from some other typical DSP processors, such as the 56k and the Intel FPU, in that the TI DSP does not have an implicit higher-precision multiply-accumulate. It is of course capable of double precision accumulation, but this must be coded explicitly. In some ways, this is similar to the Intel SSE processing unit, which jetisonned the 80-bit floating point stack used in the Intel FPU. The lack of higher precision accumulation in TI (and SSE) can sometimes result in unacceptable quantization noise performance for single precision filter implementations. Luckily, with the right choice of filter structure or coding for explicit double precision accumulation, excellent results can be achieved.
On fixed-point DSPs such as 56k, Direct Form I (DF1) implementation is the standard due to moderately good fixed point scaling properties, decent noise performance, and simple implementation. However, on a 672x DSP a single precision DF1 filter can have terrible noise performance (depending on the filter coefficients and the audio material being processed.) A degenerate case is a DF1 highpass filter processing low frequency material; in DF1, the feedforward coefficients subtract the previous sample from the current sample, and for low frequency material this produces very small numbers with low precision. Single precision DF2 structures also produce similarly poor results in this respect.
One option to improve upon these results is to use double precision throughout the 672x filter implementation. However, this results in a heavy cycle performance penalty due to the high cost of double operations on the TI DSP. Another, often better, option is to use single precision coefficients and state, with double precision accumulation:
float in, b0, b1, a1, state1;
double accum ;
accum = double (b0) * double (in) +
double (b1) * double (state1) +
double (a1) * double (accum);
state1 = in;
Listing 11: Mixed-precision DF1 filter implementation
The TI compiler will implement this using the mpysp2dp instruction, since it knows that the operands started out as single precision and end up as double precision. This is considerably faster than going to a full double precision implementation, but it is still relatively slow compared to straight single precision. Making the state double precision will improve noise performance further, with some increase in cycle usage.
Another option that generally gets good results is the single precision DF2 Transpose (DF2T) filter. On TI the DF2T implementation is fast and generally has good noise performance. If you are looking for a simple recommendation that should work well enough for most applications, DF2T is a good choice.
The optimized C filter library available from TI uses the DF2 structure in its implementation. Even though DF2 has some limitations, this is a good starting point for seeing how to optimize filter code on TI; peak performance on TI is 2.25 cycles per biquad, so it's pretty amazing what can be done (to achieve that level of performance multiple series or parallel biquads need be put in a tight loop.) We have adapted some of this filter code to DF2T, and still achieved fairly similar cycle performance.
If the single precision DF2T noise performance is not good enough for your application, then either double precision or one of the myriad other filter structures, such as State Space, Gold-Rader, Lattice or Zolzer, should do the job. In fact, there is one relatively new filter structure which we think stands out, called the Direct Wave Form (DWF) filter. Details about this filter structure can be found in Direct Wave Form Digital Filter Structure: an Easy Alternative for the Direct Form by Jean H.F. Ritzerfel. According to the author the noise performance is 3dB within optimal, it's relatively efficient (5 multiplies per biquad), free of limit cycles, has simple coefficient generation and low coefficient quantization sensitivity. It might just be the perfect filter structure, but we'll let you be the judge of that; keep in mind that all filter structures have some tradeoffs, and the recommendations made here might not be the best for your particular application.

Understanding CGTools-generated ASM files

The ability to read the ASM files that are generated by CGTools is essential when optimizing a TI algorithm. Specifically, the information in these files will allow you to determine if anything is preventing software pipelining from occurring, which is the single most effective form of optimization on the C6727.
To view your project's ASM file, turn on the -k compiler option ("Keep Generated .asm Files", found under Build Options > Compiler > Assembly in the Code Composer Studio IDE.) By default, ASM files will be placed in the same directory as the corresponding source file.
Note
You should only examine ASM listings of Release code that has been optimized by the compiler. Debug code should not be optimized.
Each ASM file for a TI algorithm callback should contain text that marks the start of the assembly listing for the processing loop. For example:
;**********************************************************************
;* FUNCTION NAME: // [Your algorithm's ProcessProc symbol] ___________*
;*____________________________________________________________________*
;* Regs Modified: A0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14, _*
;*________________A15,B0,B1,B2,B3,B4,B5,B6,B7,B8,B9,B10,B11,B12, _____*
;* _______________B13,SP,A16,A17,A18,A19,A20,A21,A22,A23,A24,A25, ____*
;*________________A26,A27,A28,A29,A30,A31,B16,B17,B18,B19,B20,B21, ___*
;* _______________B22,B23,B24,B25,B26,B27,B28,B29,B30, B31 ___________*
;* Regs Used____: A0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14, _*
;* _______________A15,B0,B1,B2,B3,B4,B5,B6,B7,B8,B9,B10,B11,B12, _____*
;* _______________B13,DP,SP,A16,A17,A18,A19,A20,A21,A22,A23,A24, _____*
;* _______________A25,A26,A27,A28,A29,A30,A31,B16,B17,B18,B19,B20, ___*
;* _______________B21,B22,B23,B24,B25,B26,B27,B28,B29,B30,B31 ________*
;* Local Frame Size: 0 Args + 148 Auto + 44 Save = 192 byte __________*
;**********************************************************************
Listing 12: CGTools-generated header for a processing loop assembly listing
Within this listing, you are looking for several things:
  1. Function calls
  2. Branches or control code
  3. Software pipelining notes

Function calls

[!B0] CALL .S1 __divd ; |213|
|| [!B0] MVKH .S2 0x40080000 ,B5 ; |213|
|| [ B0] MV .L1X B10 ,A4 ; |213|
$C$RL9 : ; CALL OCCURS {__divd} ; |213|
Listing 13: Function call in a CGTools-generated assembly listing
Function calls, such as the call in the listing above, cannot be effectively pipelined. If you find a function call figure out what C instruction it is caused by. Sometimes a function call will be made implicitly, such as when casting from float to int or when doing division. All function calls should be removed from the processing loop or inlined in order for the compiler to optimize effectively.

Branches

NOP 1
B .S1 $C$L5 ; |213|
NOP 4
MPYDP .M1X A5:A4 ,B5:B4 ,A11:A10 ; |213|
|| LDW .D2T2 *+ SP (124) ,B5 ; |218|
; BRANCH OCCURS { $C$L5 } ; |213|
Listing 14: Branch in a CGTools-generated assembly listing
Branches can also prevent loop pipelining. If you find a branch in your algorithm's assembly, determine whether it is preventing the compiler from pipelining a loop. If it is preventing pipelining, you must figure out how to rewrite the conditional in your C code so that it will not be compiled into a branch.

Software pipelining notes

For each loop the compiler finds and is able to pipeline, the .ASM file should contain a section similar to the one below:
 
;*--------------------------------------------------------------------*
;* SOFTWARE PIPELINE INFORMATION
;*
;* Loop source line : 68
;* Loop opening brace source line : 69
;* Loop closing brace source line : 124
;* Loop Unroll Multiple : 2x
;* Known Minimum Trip Count : 1
;* Known Max Trip Count Factor : 1
;* Loop Carried Dependency Bound (^) : 15
;* Unpartitioned Resource Bound : 20
;* Partitioned Resource Bound (*) : 20
;* Resource Partition :
;* A- side B- side
;* .L units 0 0
;* .S units 0 1
;* .D units 20* 20*
;* .M units 7 5
;* .X cross paths 5 6
;* .T address paths 20* 20*
;* Long read paths 5 1
;* Long write paths 0 0
;* Logical ops (. LS) 5 4 (.L or .S unit )
;* Addition ops (. LSD) 0 1 (.L or .S or .D unit )
;* Bound (.L .S .LS) 3 3
;* Bound (.L .S .D .LS .LSD) 9 9
;*
;* Searching for software pipeline schedule at ...
;* ii = 20 Schedule found with 3 iterations in parallel
T Max(const T &iValue1, const T &iValue2)
Definition: AAX_MiscUtils.h:203
Listing 15: Pipelined loop header in a CGTools-generated assembly listing
These are the important items to note in this listing:
If a loop section instead displays Disqualified loop: then some of the conditions required to enable software pipelining have not been met:
For more information about pipelining and loop/branch optimization, see Refactoring conditionals and branches.

C keywords

There are a few keywords in C that give the compiler additional information about the variables you declare and parameters you pass into functions. This allows the compiler to further optimize the code it is compiling, which can result in significant performance gains.

const

Effective use of const lets the compiler know whether pointers, scalars, or objects will remain constant in memory.
Let's add the const keyword to the filter function from our example of The load-update-store pattern.
void
ProcessDirectFormII (
const float * const input, // read - only
float * const output, // read - write
float * const state, // read - write
const float * const coefs , // read - only
int nsamp )
{
// eB0 .. eB2 and eA0, eA1 are just integer enums to partition
// the filter coefficients into A and B
// ---- LOAD ----
const float coefA0 = coefs [ eA0 ];
const float coefA1 = coefs [ eA1 ];
const float coefB0 = coefs [ eB0 ];
const float coefB1 = coefs [ eB1 ];
const float coefB2 = coefs [ eB2 ];
float state0 = state [0];
float state1 = state [1];
// ---- UPDATE ----
for (int i =0; i&lt; nsamp; ++i)
{
const float output = input [i]* coefB0 + state0 ;
state0 = input [i]* coefB1 + state1 - output * coefA0 ;
state1 = input [i]* coefB2 - output * coefA1;
output [i] = output;
}
// ---- STORE ----
state [0] = state0;
state [1] = state1;
}
Listing 16: Refactored filter algorithm with load-update-store pattern and const keyword applied.
It is especially important to note that the declaration of const float output was moved inside the loop. Why did we do this? Because we see that output is constant over an iteration of the loop, but it does change between iterations. By declaring it const inside the loop body we remove the data dependency that existed in output and allow the loop to optimize more effectively.
As demonstrated by this change to const float output, const is useful for manually breaking dependencies in DSP code. Variable re-use introduces unnecessary data dependencies in code, which can be avoided by using individual local const variables.

restrict

The restrict keyword tells the compiler that a specific pointer is not aliased, meaning that none of the memory locations accessed by the pointer are read or written to by any other variable within its local scope. This keyword is very important when optimizing TI code that involves pointers, as all AAX algorithms do due to the nature of the algorithm context structure.
restrict was introduced with the C99 standard. AAX plug-ins use the AAX_RESTRICT keyword, which is a cross-platform macro for the C99 standard restrict.
Note
Now that MSVC has added C99 support to its compiler, AAX_RESTRICT will eventually be deprecated in favor of the restrict keyword.
The following example demonstrates the use of restrict in our filter code.
void
ProcessDirectFormII (
const float * const AAX_RESTRICT input,
float * const AAX_RESTRICT output,
float * const AAX_RESTRICT state,
const float * const AAX_RESTRICT coefs ,
int nsamp )
{
// eB0 .. eB2 and eA0, eA1 are just integer enums to partition
// the filter coefficients into A and B
// ---- LOAD ----
const float coefA0 = coefs [ eA0 ];
const float coefA1 = coefs [ eA1 ];
const float coefB0 = coefs [ eB0 ];
const float coefB1 = coefs [ eB1 ];
const float coefB2 = coefs [ eB2 ];
float state0 = state [0];
float state1 = state [1];
// ---- UPDATE ----
for (int i =0; i&lt; nsamp; ++i)
{
const float output = input [i]* coefB0 + state0;
state0 = input [i]* coefB1 + state1 - output * coefA0;
state1 = input [i]* coefB2 - output * coefA1;
output [i] = output;
}
// ---- STORE ----
state [0] = state0;
state [1] = state1;
}
Listing 17: Refactored filter algorithm with load-update-store pattern and const and restrict keywords applied.
Note
  • This example applies restrict to the algorithm's input and output audio buffer pointers. These pointers do not alias each other in most algorithms, but this may not be the case for all algorithms and should be verified by the developer before applying restrict.

  • The restrict keyword is somewhat redundant when used with the load-update-store pattern. This is because by asserting to the compiler that the pointers are not aliased, it should be able to partially do the load-update-store refactoring automatically. However, because some compilers have limited or no support for the restrict keyword, using the load-update-store pattern is still recommended.

Keywords to avoid

There are some keywords which do more harm than good, but are still being used either due to legacy code or developer superstitions. These keywords should not be used in AAX plug-ins.

Data types

The TI C672x+ is a 32-bit floating point DSP platform, and has a few peculiarities that you should be aware of.

Unintended data type conversions

When developing for the TI platform it is important to keep an eye out for unintended type conversions, and especially for implicit double-precision instructions. The following points are helpful for both program efficiency and for future maintenance of the code, since they clarify the developer's understanding of how the code should operate, e.g. by specifying that a cast is occurring, and make it obvious that steps such as data type conversions are an intentional part of the algorithm.
To help ensure that you are not violating these principles, always be aware of any warnings generated by the compiler. In particular, do not ignore warnings related to "implicit conversion from 'double' to 'float'" or "implicit conversion from 'double' to 'int'"; these warnings may indicate that you are declaring a double when a float would be just as good.
In the final stages of optimization, examine the generated assembly code to make sure there are no unintended double-precision instructions or memory accesses.

Additional data type optimizations

The AAX SDK includes cross-platform macros that can be used to convert two single-precision float loads to one double-precision load. The coefficient smoothing case study below includes an example use case for these macros.
const float * pTable = &SmoothCoefTable[address];
float firstCoef = AAX_LO(*pTable);
float secondCoef = AAX_HI(*pTable);
#define AAX_LO(x)
These macros are used on TI to convert 2 single words accesses to one double word access to provide a...
Definition: AAX_MiscUtils.h:69
#define AAX_HI(x)
Definition: AAX_MiscUtils.h:70
#define AAX_ALIGNMENT_HINT(a, b)
Currently only functional on TI, these word alignments will provide better performance on TI.
Definition: AAX_MiscUtils.h:53
Listing 18: Example of using AAX macros for converting two float loads to one double load.
In this example the AAX_ALIGNMENT_HINT macro checks whether data is aligned on a 8-byte boundary, then the double word is loaded, and finally the AAX_LO and AAX_HI macros get the double word's first and second (float) parts.
If SmoothCoefTable consists of floats and is 8-byte aligned, then this scenario will work fine for loads when address is even. This raises the question about how to load double word from &SmoothCoefTable[address], when address is odd. Since this kind of optimization is most useful for loading data from external memory, where the CPU savings of a single double word load vs two 32-bit loads is greatest, then one trick which can help is to trade off memory (as external memory is plentiful) for performance. Specifically, SmoothCoefTable can be orginized in a such way that for every member of this table, except the first and the last ones, there will be two consequent entries.
const int32_t size = 4;
// instead of this classic variant...
const float SmoothCoefTable[size] = {
-0.1, -0.2, -0.3, -0.4
}
// ...table can be organized this way
const float SmoothCoefTable[size*2 - 2] = {
-0.1, -0.2,
-0.2, -0.3,
-0.3, -0.4,
-0.4, 0.0 /* last member is dummy */
}
Listing 19: Example of restructuring the table so that it can be easily used in the optimization scenario given above.
In this case the number of loads will be halved at the cost of doubling the size of the table. If the table is located in external memory then the additional memory requirement can be an excellent trade-off for the performance gained.

Case study: Efficient parameter smoothing at single and double precision

Coefficient smoothing ("de-zippering") can often be one of the most difficult parts of a plug-in to optimize for real-time operation. This is especially true in cases when full double-precision smoothing filters have been used in a plug-in's Native code, with the possibility of very small coefficients. In these cases it can be difficult to optimize the smoothing code while also satisfying requirements for audio data parity between the plug-in's Native and DSP configurations.
 
double * const AAX_RESTRICT deZipper = dzCoefsP->mDeZip [ch ][0];
const double * AAX_RESTRICT coefs = myCoefsP->mBiqCoefsBuf [0];
// Double - precision
for (int i = 0; i < eNumBiquads * eNumCoefs ; ++i)
{
double dz = deZipper [i];
dz += zeroCoef * ( coefs [i] - deZipper [i]);
}
Listing 20: Example of double-precision smoothing.
In this section we will describe three specific approaches that may be taken to perform optimized real-time smoothing without compromising sound quality.

Method 1: Clamped single-precision smoothing

The simplest approach for optimization of a double-precision smoothing filter is to replace it with modified single-precision smoothing. Unfortunately, we have found that this approach can lead to glitches and instability at higher sample rates when adjusting controls due to transient innacuraccies in the smoothing.
double * const AAX_RESTRICT deZipper = dzCoefsP->mDeZip [ch ][0];
const double * AAX_RESTRICT coefs = myCoefsP->mBiqCoefsBuf [0];
// Method 1 - single - precision
for (int i = 0; i < eNumBiquads * eNumCoefs ; ++i)
{
float dz = deZipper [i];
dz += zeroCoef * ( coefs [i] - deZipper [i]);
// If the de -zip step is so small that the coefficient doesn't change then clamp
// the value to the target to ensure we are using exactly the desired value .
deZipper [i] = (dz == deZipper [i]) ? coefs [i] : dz;
}
Listing 21: Example of clamped single-precision smoothing.

Method 2: Mixed-precision smoothing

To resolve the stability issues at high sample rates, the state may be accumulated at double-precision. This results in mixed-precision operations that are much faster on TI DSPs than full double-precision calculations, though still slower than single-precision.
float * const AAX_RESTRICT deZipper = dzCoefsP->mDeZip [ch ][0];
double * const AAX_RESTRICT deZipState = dzCoefsP->mDZState [ch][0];
const float * AAX_RESTRICT coefs = myCoefsP->mBiqCoefsBuf [0];
// Method 2 - partial double precision
# pragma UNROLL ( CBiquad::eNumCoefs )
for(int i = 0; i < eNumBiquads * eNumCoefs ; i ++)
{
double dz = deZipState [i];
dz += zeroCoef * ((coefs [i]) - ( deZipper [i]));
deZipState [i] = dz;
deZipper [i] = float (dz);
}
Listing 22: Example of mixed-precision smoothing.

Method 3: Loop unrolling and double-word memory accesses

Further performance gains can be made by unrolling the loop and using double word memory accesses. This code is faster, but is still not as fast as full single-precision.
float * const AAX_RESTRICT deZipper = dzCoefsP->mDeZip [ch][0];
double * const AAX_RESTRICT deZipState = dzCoefsP->mDZState [ch][0];
const float * AAX_RESTRICT coefs = myCoefsP->mBiqCoefsBuf [0];
// Method 3 - partial double precision - unrolled with double-precision memory accesses for(int i = 0; i < (eNumBiquads * eNumCoefs); i +=2 )
{
double dz0 = deZipState [i];
double dz1 = deZipState [i+1];
dz0 += zeroCoef * (AAX_LO ( coefs [i]) - AAX_LO ( deZipper [i]));
dz1 += zeroCoef * ( AAX_HI ( coefs [i]) - AAX_HI ( deZipper [i]));
deZipState [i] = dz0;
deZipper [i] = float (dz0);
deZipState [i+1] = dz1;
deZipper [i+1] = float (dz1);
}
Listing 23: Example of loop unrolling and double-precision memory accesses for smoothing optimization.

Coefficient smoothing example summary

Refactoring conditionals and branches

Note
For more detailed information on how to reduce or eliminate the use of branches in algorithms, see section 5.2 of the Hand-Tuning Loops and Control Code on the TMS320C6000 guide provided by TI.
An important technique in refactoring algorithms to enhance loop performance is to reduce or eliminate conditionals and branches in code. The TI compiler focuses a lot of its optimization energy on keeping its pipeline full of inside loops. However, it cannot pipeline a loop if the one of the following is true:
To demonstrate this, we will again begin with an unoptimized example:
for ( int i = 0; i &lt; numSamples ; ++i)
{
if (! bypass )
{
const float filtOutput1 = input [i] * coef0 + state0 * coef1 ;
const float filtOutput2 = filtOutput1 * coef2 + state1 * coef3 ;
output [i] = filtOutput2 ;
}
else
{
output [i] = input [i];
}
}
Listing 24: Another unoptimized filter algorithm.
Though trivial, this example illustrates the problem with conditionals inside of loops. In TI assembly, conditional code usually translates into code branches, which prevents loops from pipelining effectively see Understanding CGTools-generated ASM files. Let's refactor the loop in our example to reduce the size of its conditional branch:
for (int i = 0; i &lt; numSamples ; ++i)
{
const float filtOutput1 = input [i] * coef0 + state0 * coef1 ;
const float filtOutput2 = filtOutput1 * coef2 + state1 * coef3 ;
output [i] = filtOutput2 ;
if ( bypass )
{
output [i] = input [i];
}
}
Listing 25: Filter algorithm with a refactored conditional branch.
At first, it may seem wasteful to perform the filter calculation if bypass will simply throw away the result. In reality, however, the opposite is true: as a real-time algorithm, this code is constrained by its maximum, worst-case cycle count. It is important to understand this point: essentially, the cycle count of the plug-in is always its worst-case performance.
By reducing the algorithm's maximum cycle count we are therefore reducing waste, even though we are increasing the plug-in's cycle count when it is bypassed. In fact, the ideal scenario for most algorithms is to use only one code path (and, consequentially, a single deterministic cycle count) despite the fact that this can result in worse performance for some specific states. To state this fundamental principle in a different way:
The performance of specific states in an AAX DSP algorithm is not relevant if there is another possible state with worse performance.
Going back to our optimized example, you may also notice that the conditional still exists. Doesn't this create a branch in the assembly code as well and prevent pipelining?
In the case of very brief conditionals such as this, the answer is usually no. On TI processors, most instructions can be executed conditionally, depending on the value of a control register. Thus, the single assignment (output = input) inside this conditional will reduce to a few conditional instructions without having to execute a branch. As a result, the TI compiler will be able to efficiently pipeline this loop.
That said, it is occasionally necessary to eliminate conditionals entirely. One effective solution for these situations is to execute the branched logic algorithmically rather than conditionally. To demonstrate this approach, here is our filter example again, this time with the the conditional completely eliminated from the loop:
for (int i = 0; i &lt; numSamples ; ++i)
{
const float filtOutput1 = input [i] * coef0 + state0 * coef1 ;
const float filtOutput2 = filtOutput1 * coef2 + state1 * coef3 ;
output [i] = (! bypass ) * filtOutput2 + bypass * input [i];
}
Listing 26: Filter algorithm with branching logic executed algorithmically.
This code is shorter and completely eliminates the conditional from inside the loop body. However, there is an associated cost in readability, in that it is not initially obvious how exactly bypass affects the output. This is of course a tradeoff that you will need to consider on a case-by-case basis. In general, we encourage you to consider this technique only when you have verified in the assembly code that simply reducing the size of the conditional is not enough to achieve effective instruction pipelining.
Another useful technique for optimizing loops is to use pragma MUST_ITERATE and pragma PROB_ITERATE (see more about these pragmas in Loop controls), which help the compiler guess the number of iterations for the loop. It is extremely useful when you know the exact number of the iterations, and this number never changes during plug-in processing. For example, this is applicable for the loops which iterate through the audio samples in the input and output buffers. The number of input samples is always constant for an AAX DSP plug-in algorithm; the buffer length must be described with the option AAX_eProperty_DSP_AudioBufferLength for each DSP component in the plug-in's description.
The following code example shows an algorithm processing function template. For convenience, this function template takes the audio buffer length as a template parameter:
template<int kAudioWindowSize>
Example_AlgorithmProcessFunction( SExample_Alg_Context * const inInstancesBegin [], const void * inInstancesEnd)
{
for (SExample_Alg_Context * const * walk = inInstancesBegin; walk != inInstancesEnd; ++walk)
{
SExample_Alg_Context* const AAX_RESTRICT contextP = *walk;
const float * const AAX_RESTRICT inputP = contextP->mInputPP;
float * const AAX_RESTRICT outputP = contextP->mOutputPP;
#pragma MUST_ITERATE( kAudioWindowSize, kAudioWindowSize, kAudioWindowSize )
for (int32_t i = 0; i < kAudioWindowSize; ++i)
{
outputP[i] = inputP[i];
}
}
}
Listing 27: Optimizing loop using pragma MUST_ITERATE.
Note that the audio buffer length property takes a AAX_EAudioBufferLengthDSP value. The values of this enum are set to the power-of-two for each buffer length, so in this case the kAudioWindowSize value would be set to match 2 << AAX_eProperty_DSP_AudioBufferLength when compiling this algorithm callback into the TI DLL
The same optimization can be used for the loops that iterate through input/output channels, as demonstrated by the DemoDist example plug-in.

Case study: pipeline refactoring in Avid's EQ3 and Dyn3 plug-ins

While optimizing the "stock" Pro Tools equalization and dynamics processors we came across many real-world optimization scenarios that will be applicable to a broad variety of plug-ins. In this section we will consider specific techniques that we used to enable software pipelining of these algorithms by the TI compiler, including an in-depth look at the pseudo-speculative execution approach used in our Dyn3 plug-in's polynomial gain calculation loop.

Move individual processing operations into separate loops

Oftentimes a sample-by-sample iterative loop that is not software pipelining can be broken up into individual loops that incrementally apply changes to the audio buffer. These smaller loops have a much better chance of being successfully pipelined by the compiler. In EQ3, moving our biquad audio processing stages to dedicated loops that do not include coefficient smoothing or other tasks resulted in large performance gains.

Avoid pipeline dependencies

The goal of the above optimization is to allow the compiler to successfully pipeline each iterative loop. However, even a pipelined loop may be optimized further. One of the best ways of optimizing loops is to keep the processor busy while pipeline dependencies are cleared.
For example, in EQ3 we found that it was better to perform the plug-in's input and output meter calculations in the same loop rather than separating them out into individual loops. This is because each meter calculation has a dependency on its previous value, which puts a dependency in the pipeline. Doing both at the same time gives the process more to do while waiting for the next value. In Dyn3 we had similar results merging table lookup, attack, and release loops into a single iterative loop. As long as the loop is still successfully pipelined by the compiler, these "larger" loops tended to have much better performance due to the reduction in blocking dependencies.

Detailed example of loop optimization in Dyn3

At this point it will be helpful to go into greater detail about our optimizations for Dyn3's polynomial gain calculation loop, because the increase in performance was quite large and is fairly representative of other algorithms. The unoptimized code took 43 cycles to execute one iteration of the loop. After rearranging the code it now takes 6 cycles. The basic problem was numerous pipeline dependencies: the Loop Carried Dependency Bound was 42 cycles, yet the Partitioned Resource Bound was 4 cycles. In other words, if all of these dependencies were removed the loop could potentially execute in 4 cycles.
2760 ;* SOFTWARE PIPELINE INFORMATION
2761 ;*
2762 ;* Loop source line : 199
2763 ;* Loop opening brace source line : 200
2764 ;* Loop closing brace source line : 213
2765 ;* Known Minimum Trip Count : 4
2768 ;* Loop Carried Dependency Bound (^) : 42
2769 ;* Unpartitioned Resource Bound : 4
2770 ;* Partitioned Resource Bound (*) : 4
2785 ;*
2786 ;* Searching for software pipeline schedule at ...
2787 ;* ii = 42 Did not find schedule
2788 ;* ii = 43 Schedule found with 1 iterations in parallel
2789 ;* Done
for (int i =0; i&lt; kAudioWindowSize ; i++) // cSmoothingBlockSize
{
const float * smoothCoeffs = stateP -&gt; mSmoothedPoly ;
float logEnv = logEnvArray [i]; // logEnvArray [ fIdx +i];
logEnv -= smoothThrLow ;
if( logEnv &gt;= 0.0 f) // In the knee
smoothCoeffs += eCpdPolyOrder ;
if( logEnv &gt;= 0.0 f) // In the knee
logEnv -= smoothThrLowDelta ;
if( logEnv &gt;= 0.0 f) // In the linear GR stage
smoothCoeffs += eCpdPolyOrder ;
const float filteredLogEnv = smoothCoeffs [ eCpdPolyCoeffsC ] +
logEnv *( smoothCoeffs [ eCpdPolyCoeffsB ] +
smoothCoeffs [ eCpdPolyCoeffsA ]* logEnv );
filtLogEnvArray [i] = filteredLogEnv + smoothedMakeupGain ;
}
Listing 28: Dyn3's unoptimized polynomial gain calculation loop and asm listing.
And I don't think that even covers every case, but you get the idea. The bottom line is there is no way this loop can pipeline well. In contrast, here is the optimized code and listing file output once these dependencies have been removed:
2476 ;* Loop opening brace source line : 167
2477 ;* Loop closing brace source line : 179
2446 ;* Known Minimum Trip Count : 4
2482 ;* Loop Carried Dependency Bound (^) : 1
2483 ;* Unpartitioned Resource Bound : 4
2484 ;* Partitioned Resource Bound (*) : 4
2512 ;* ii = 6 Schedule found with 5 iterations in parallel
for (int i =0; i&lt; cProcessingBlockSize ; i++)
{
float logEnv = logEnvArray [i];
float logEnvThrHi = logEnv - smoothThrHigh ;
const float gainSlope = smoothThrSlope +
logEnv * smoothSlope ;
const float gainKnee = smoothKneeC +
logEnvThrHi *( smoothKneeB +
smoothKneeA * logEnvThrHi );
const bool bKnee = ( logEnv &gt; smoothThrLow );
const bool bSlope = ( logEnv &gt; smoothThrHigh );
float filteredLogEnv = bKnee ? gainKnee : 0.0f;
filteredLogEnv = bSlope ? gainSlope : filteredLogEnv ;
filtLogEnvArray [i] = filteredLogEnv ;
}
Listing 29: Dyn3's optimized polynomial gain calculation loop and asm listing
In this case gainSlope is only dependent on the loading of logEnv, so that can begin almost immediately. GainKnee must wait for logEnvThrHi, but gainSlope can be calculated during that time. bKnee and bSlope are also only dependent on logEnv, and start right away. The main dependency is filteredLogEnv which is dependent on bKnee and gainKnee and then bSlope and gainSlope. Anyhow, this is far fewer dependencies. Here is another version which runs in exactly the same number of cycles. (In fact, under the hood it may be creating the same asm code; we have not compared instruction-by-instruction.)
for (int i =0; i&lt; kAudioWindowSize ; i++)
{
float logEnv = logEnvArray [i];
float logEnvThrHi = logEnv - smoothThrHigh ;
const bool bKnee = ( logEnv &gt; thrLow );
const bool bSlope = ( logEnv &gt; thrHigh );
float filteredLogEnv = bKnee ?
kneeC + logEnvThrHi *( kneeB + kneeA * logEnvThrHi ) :
0.0 f;
filteredLogEnv = bSlope ?
thrSlope + logEnv * slope :
filteredLogEnv ;
filtLogEnvArray [i] = filteredLogEnv ;
}
Listing 30: An alternative optimization for Dyn3's polynomial gain calculation loop.

But what about Native?

You might expect this altered code to execute well on a TI DSP but poorly on x86. However, keep in mind that a large degree of speculative execution is used on Intel's processors. This means that pipeline dependencies due to conditionals can be broken because multiple paths are executed. In these cases, only one of the results is used and the others are thrown away. In other words, if you saw pseudo code showing the literal execution of the unoptimized code above on Intel then it would probably look a lot like the optimized code. The lesson? For TI it is important to rearrange your code so that essentially it implements speculative execution as much as possible, and if applied correctly this optimization should not negatively impact your plug-in's native performance.

Case study: Additional optimization lessons from EQ3 and Dyn3

The pipeline optimization example above is just one example, and the following techniques also helped us achieve many-fold increases in performance. Note that many of these techniques are discussed in greater detail in the sections above.

Watch the assembly listing

In the process of optimizing these plug-ins we found their asm listing files very helpful, especially the Loop Carried Dependency Bound and the Partitioned Resource Bound information. The listing file shows how many cycles the code is taking to execute, and we could make an estimate of how far away we were from the optimal implementation by seeing how well the pipeline is being utilized.

Divide processing tasks over multiple calls

In the old RTAS version of EQ3 the coefficients were updated (smoothed) every 8 samples. Initially, this was changed to every 4 samples in the AAX version in order to easily work with 4-sample blocks on HDX. However, we were able to achieve better results by adding "ping pong" logic that alternates between smoothing the first and second half of the coefficients on each pass. To make this work in our odd-banded EQ we had to pad the smoothing coefficients by one biquad's worth to make an even number of biquads, but regardless of this inefficiency we still achieved performance gains.

Eliminate branches that block pipelining

Eliminating large conditional branches is critical to optimal performance on TI. This can be an especially tempting pitfall for developers who are used to coding only for x86 processors.
Consider the "ping pong" optimization described above. This logic does not break pipelining because the conditional logic that checks the state of the flag does not result in a large branch; once the ping pong value is set, the exact same logic operates in every processing callback. If instead we used an if statement to determine which "side" should execute, this would prevent pipelining optimizations and would seriously impact performance.

Remove double-precision operations where they are not required

Here is some coefficient smoothing code from our pre-optimization EQ3 algorithm. This code was embedded in the inner biquad processing loop:
# pragma UNROLL ( CBiquad::eNumCoefs )
for (int k = 0; k < CBiquad::eNumCoefs; ++k)
{
double &dz = deZipper[k];
step[k] = zeroCoef * ( coefs[k] - dz);
}
# pragma UNROLL ( CBiquad::eNumCoefs )
for(int k = 0; k < CBiquad::eNumCoefs; ++k)
{
double nm1_dz = deZipper[k]; // read state
nm1_dz += step[k];
biquadCoefs[k] = static_cast< float > ( nm1_dz );
deZipper[k] = nm1_dz ; // write state
}
void DeDenormal(double &iValue)
Clamps very small floating point values to zero.
Definition: AAX_Denormal.h:225
Listing 31: Unoptimized coefficient smoothing in EQ3
To optimize this code, we converted the logic to use single-precision de-zipper values. However, this resulted in a sonic difference due to the fact that the smoothed coefficients would not necessarily ramp all the way to the correct target value. To solve that we added a conditional "clamp" that halts the smoothing once there is no difference between the 32-bit smoothed value and the target value. On examination of the assembler output, we found that this conditional pipelines very well.
# pragma UNROLL ( CBiquad::eNumCoefs )
for(int i = 0; i < (cMaxNumBiquadsWithPad / 2) * CBiquad::eNumCoefs; ++i)
{
float dz = deZipper[i];
dz += zeroCoef * ( coefs[i] - deZipper[i]);
deZipper[i] = (dz == deZipper[i]) ? coefs[i] : dz; // clamp
}
Listing 32: Optimized coefficient smoothing in EQ3

Make coefficients contiguous

We were able to achieve significant performance gains in iterative loops like the smoothing code shown above by ensuring that all of the coefficients that would be accessed by the loop are contiguous in memory. In addition, note that in the optimized code there is only one loop, which iterates NumBiquads*NumCoefs times. This optimization is possible due to the fact that each filter's coefficients are contiguous in the coefs array.

Use AAX_RESTRICT wherever applicable

We have found that the restrict keyword is vital for optimal performance on TI DSPs. For example, the parameter smoothing logic in our Dyn3 plug-in was reduced from 18 cycles to 3 cycles per loop iteration simply by the addition of this keyword to the applicable pointer variables.
For more information about the restrict keyword, see restrict.

Be aware of shell overhead

In the TI Shell there is code that loops through every buffered coefficient FIFO before every sample buffer in order to swap the algorithm's context field pointers to a new set of coefficients if one is available. This uses a nominal number of cycles per buffered port, which can add up very quickly in small plug-ins.
For example, before our optimizations EQ3 used eight individual buffered coefficient blocks. On investigation, we found that the shell overhead from managing these buffers added up to be roughly equivalent to the algorithm's total processing cycles! To work around this we merged the 8 coefficient blocks into one large block. The trade-off of this optimization is that more work must be done on the host to re-generate and copy the whole coefficient state every time any parameter changes, so this is an optimization that should be applied only when appropriate for the individual plug-in.For example, before our optimizations EQ3 used eight individual buffered coefficient blocks. On investigation, we found that the shell overhead from managing these buffers added up to be roughly equivalent to the algorithm's total processing cycles! To work around this we merged the 8 coefficient blocks into one large block. The trade-off of this optimization is that more work must be done on the host to re-generate and copy the whole coefficient state every time any parameter changes, so this is an optimization that should be applied only when appropriate for the individual plug-in.

Watch for opportunities to merge or eliminate operations

Keep an eye out for unnecessary processing stages performed by your algorithm. Gain stages, phase toggles, and "dummy" coefficients are particularly good candidates for this kind of optimization. For example:

Read the TI documentation

There are many helpful optimization resources available from Texas Instruments. Out of all of the TI optimization documents we encountered, we found the Hand-Tuning Loops and Control Code on the TMS320C6000 guide to be the most helpful and complete.

Optimization on the HDX platform

Interrupt latency

Besides the large latency due to context switching (lots of data file registers to store) and the pipeline (many stages), interrupts can be disabled around pipelined loops, which cannot be interrupted. This can be controlled with the -mi=X compiler option, which will disallow unsafe pipelining for loops that are longer than X cycles. See TI's documentation (SPRU187O Section 2.12) for more details and references regarding this behavior.

External memory access

A loop which performs many reads and writes may require access to external memory. In this scenario, the loop may take 10's or even 100's of times longer to execute than the compiler expects it to!
There are two options for dealing with this:
  1. Search and destroy these loops individually
    • Move all the data used by the loop to internal RAM.
    • Use HDX's DMA facilities for external memory accesses.
    • #pragma FUNC_INTERRUPT_THRESHOLD can be used to disable pipelining on a case by case basis.
  2. For modules that are known to have these loops but are not worth hand optimizing, then turn off pipelined loop optimization altogether. (-mu aka –disable_software_pipelining).
Note
This is only a problem in the C67(0-2)x ISAx used on the HDX platform. In The C64xx and C674x ISA, there is an SPLOOP command which can buffer the branches within pipelined loops to allow them to be interruptable.

Code Composer Studio optimization tools

Compiler Consultant

The Compiler Consultant tool can be used to suggest additional optimizations.
To enable the Compiler Consultant in Code Composer Studio, do the following:
  1. Set an optimization level of -o2 or -o3 (Found in CCSv4 under Build Options > Compiler > Basic)
  2. Set the –consultant: Generate Compiler Consultant Advise switch (Found in CCSv4 under Build Options > Compiler > Feedback)

Optimization information file

Optimization information files can be generated in Code Composer Studio by selecting the option Build Options > Compiler > Feedback > Opt Info File. Optimization information files have an .nfo extension and are placed into the project's intermediate build products directory. In general, these files list function call-graph information and describe whether or not individual functions can be inlined.

Error Codes

The following appendices document error codes that are specific to plug-in hosting in Pro Tools HDX and other AAX platforms based on the TI DSP environment.

-138xx: DHM Core DSP errors

These errors relate to routing and assignment problems on Pro Tools HDX hardware. Plug-ins should never be able to trigger these error codes, which indicate low-level problems in the system.

Table 1: DHM Core DSP error codes

Value Definition
-13801 ePSError_CTIDSP_WrongSampleRate
-13802 ePSError_CTIDSP_NoFreeStreams
-13803 ePSError_CTIDSP_StreamCreationTimeout
-13804 ePSError_CTIDSP_StreamDestruction
-13805 ePSError_CTIDSP_InactiveStream
-13806 ePSError_CTIDSP_StreamCorrupted
-13807 ePSError_CTIDSP_QueueFull
-13808 ePSError_CTIDSP_NullPointer
-13809 ePSError_CTIDSP_WrongStreamID
-13810 ePSError_CTIDSP_ImageError
-13811 ePSError_CTIDSP_ResetError
-13812 ePSError_CTIDSP_ImageVerify
-13813 ePSError_CTIDSP_DSPAlreadyInBootOrReset
-13814 ePSError_CTIDSP_TriggerInterrupt
-13815 ePSError_CTIDSP_BufferSizeNotAligned
-13816 ePSError_CTIDSP_TimeoutWaitingForHPIC
-13817 ePSError_CTIDSP_SetUHPIError
-13818 ePSError_CTIDSP_UHPINotReady

-140xx: AAX Host errors

These errors relate to logic failures in the AAX host software. These errors can be due to plug-in bugs or system configuration problems.

Table 2: AAX Host Software error codes

Value Definition
-14001 kAAXH_Result_Warning
-14003 kAAXH_Result_UnsupportedPlatform
-14004 kAAXH_Result_EffectNotRegistered
-14005 kAAXH_Result_IncompleteInstantiationRequest
-14006 kAAXH_Result_NoShellMgrLoaded
-14007 kAAXH_Result_UnknownExceptionLoadingTIPlugIn
-14008 kAAXH_Result_EffectComponentsMissing
-14009 kAAXH_Result_BadLegacyPlugInIDIndex
-14010 kAAXH_Result_EffectFactoryInitedTooManyTimes
-14011 kAAXH_Result_InstanceNotFoundWhenDeinstantiating
-14012 kAAXH_Result_FailedToRegisterEffectPackage
-14013 kAAXH_Result_PlugInSignatureNotValid
-14014 kAAXH_Result_ExceptionDuringInstantiation
-14015 kAAXH_Result_ShuffleCancelled
-14016 kAAXH_Result_NoPacketTargetRegistered
-14017 kAAXH_Result_ExceptionReconnectingAfterShuffle
-14018 kAAXH_Result_EffectModuleCreationFailed
-14019 kAAXH_Result_AccessingUninitializedComponent
-14020 kAAXH_Result_TIComponentInstantiationPostponed
-14021 kAAXH_Result_FailedToRegisterEffectPackageNotAuthorized
-14022 kAAXH_Result_FailedToRegisterEffectPackageWrongArchitecture
-14023 kAAXH_Result_PluginBuiltAgainstIncompatibleSDKVersion
-14023 kAAXH_Result_PluginBuiltAgainstIncompatibleSDKVersion
-14100* kAAXH_Result_InvalidArgumentValue
-14101* kAAXH_Result_NameNotFoundInPageTable
*Overlaps with -141xx: TI System errors definitions

-141xx: TI System errors

These errors relate to logic failures in the TI management software and generally indicate a failure in the HDX system services such as buffered message queues, context management, and callback timing.

Table 3: TI system error codes

Value Definition
-14101 eTISysErrorNotImpl
-14102 eTISysErrorMemory
-14103 eTISysErrorParam
-14104 eTISysErrorNull
-14105 eTISysErrorCommunication
-14106 eTISysErrorIllegalAccess
-14107 eTISysErrorDirectAccessOfFifoBlocksUnsupported
-14108 eTISysErrorPortIdOutOfBounds
-14109 eTISysErrorPortTypeDoesNotSupportDirectAccess
-14110 eTISysErrorFIFOFull
-14111 eTISysErrorRPCTimeOutOnDSP
-14112 eTISysErrorShellMgrChip_SegsDontMatchAddrs
-14113 eTISysErrorOnChipRPCNotRegistered
-14114 eTISysErrorUnexpectedBufferLength
-14115 eTISysErrorUnexpectedEntryPointName
-14116 eTISysErrorPortIDTooLargeForContextBlock
-14117 eTISysErrorMixerDelayNotSupportedForPlugIns
-14118 eTISysErrorShellFailedToStartUp
-14119 eTISysErrorUnexpectedCondition
-14120 eTISysErrorShellNotRunningWhenExpected
-14121 eTISysErrorFailedToCreateNewPIInstance
-14122 eTISysErrorUnknownPIInstance
-14123 eTISysErrorTooManyInstancesForSingleBufferProcessing
-14124 eTISysErrorNoDSPs
-14125 eTISysBadDSPID
-14126 eTISysBadPIContextWriteBlockSize
-14128 eTISysInstanceInitFailed
-14129 eTISysSameModuleLoadedTwiceOnSameChip
-14130 eTISysCouldNotOpenPlugInModule
-14130 eTISysCouldNotOpenPlugInModule
-14131 eTISysPlugInModuleMissingDependcies
-14132 eTISysPlugInModuleLoadableSegmentCountMismatch
-14133 eTISysPlugInModuleLoadFailure
-14134 eTISysOutOfOnChipDebuggingSpace
-14135 eTISysMissingAlgEntryPoint
-14136 eTISysInvalidRunningStatus
-14137 eTISysExceptionRunningInstantiation
-14138 eTISysTIShellBinaryNotFound
-14139 eTISysTimeoutWaitingForTIShell
-14140 eTISysSwapScriptTimeout
-14141 eTISysTIDSPModuleNotFound
-14142 eTISysTIDSPReadError

-142xx: DIDL errors

These errors all relate to the dynamic library loading system that manages ELF DLL binaries on Pro Tools HDX hardware. For example, a eDIDL_FileNotFound error will be raised if the ELF DLL name specified by an Effect's Describe code does not match any DLL that is present in the plug-in's bundle.

Table 4: DIDL error codes

Value Definition
-14201 eDIDL_FileNotFound
-14202 eDIDL_FileNotOpen
-14203 eDIDL_FileAlreadyOpen
-14204 eDIDL_InvalidElfFile
-14205 eDIDL_ImageNotFound
-14206 eDIDL_SymbolNotFound
-14207 eDIDL_DependencyNotLoaded
-14208 eDIDL_BadAlignment
-14209 eDIDL_NotImplemented

-144xx: HDX hardware errors

These errors relate to failures on the HDX hardware itself. Plug-ins should never be able to trigger these error codes, which indicate low-level problems in the system.

Table 5: HDX hardware error codes

Value Definition
-14401 eBerlinImageError
-14402 eBerlinImageWriteError
-14403 eBerlinInvalidArgs
-14404 eBerlinCantGetTMSChannel
-14405 eBerlinChunkWriteError
-14406 eBerlinChunkReadError
-14407 eBerlinInvalidReqID
-14408 eBerlinDSPInResetError
-14409 eBerlinDSPTimeOut
-14410 eBerlinIncorrectTdmCableWiring
-14411 eBerlinInvalidClock

-145xx: DHM isochronous audio engine errors

These errors relate to failures within the HDX audio engine software. Plug-ins should never be able to trigger these error codes, which indicate low-level problems in the system.

Table 6: DHM isochronous audio engine error codes

Value Definition
-14500 eDsiIsochEngineGenericError
-14501 eDsiIsochEngineWrongChannelNumber
-14502 eDsiIsochEngineTxRingFull
-14503 eDsiIsochEngineRxRingNotReady
-14504 eDsiIsochEngineWrongNumberOfSamplesRequest
-14505 eDsiIsochEngineUnrecognizedSampleRate
-14506 eDsiIsochEngineUnsupportedSampleSizeBytes
-14507 eDsiIsochEngineUnsupportedNumberOfChannels
-14508 eDsiIsochEngineUnsupportedSampleRate
-14509 eDsiIsochEngineDMAAlreadyEnabled
-14510 eDsiIsochEngineDMAAlreadyDisabled
-14511 eDsiIsochEngineInterruptHandlerAlreadyInstalled
-14512 eDsiIsochEngineBadCardRecord
-14513 eDsiIsochEngineCantSetValueDuringStreaming
-14514 eDsiIsochEngineStreamingAlreadyStarted
-14515 eDsiIsochEngineStreamingAlreadyStopped
-14516 eDsiIsochEngineStreamingCantBeStarted
-14517 eDsiIsochEngineUnsupportedSamplesPerInterrupt
-14518 eDsiIsochEngineCantSetSamplesPerInterrupt
-14519 eDsiIsochEngineInterruptLoopAlreadyExists
-14520 eDsiIsochEngineGlobalDMADisabled
-14521 eDsiIsochEngineActiveInterruptMaskAlreadyEnabled
-14522 eDsiIsochEngineSDI0Errors

-30xxx: Dynamically-generated error codes

Errors in the -30xxx range are dynamically generated codes, and thus the same failure point could generate a different error code depending on the order in which errors occurred. These kinds of error codes are used heavily by the TI Shell Manager, the host component that interacts with the on–DSP shell environment.
If one of these error codes is being generated by the TI Shell Manager (the most common case) then you should be able to get more information about the failure by enabling the following DigiTrace logging facility:
DTF_TISHELLMGR=file@DTP_NORMAL
or, within the DSH tool:
enable_trace_facility [DTF_TISHELLMGR, DTP_NORMAL]
This should result in a log with more information such as the name of the failing plug-in, the dynamically generated error code, and a string description of its meaning. Depending on the failure case, the DAE dish command getlastdsploaderror can also sometimes be used to retrieve the description string for a dynamically-generated error if it was the last error generated during the DSP loading operation.
Collaboration diagram for TI DSP Guide: