This is the documentation for the latest (main) development branch of mcuxpresso sdk. If you are looking for the documentation of previous releases, use the drop-down menu on the left and select the desired version.

System Optimization

Performance and power efficiency are key for embedded systems. The following sections list some tips and best practices for system optimization from this perspective.

Profiling

The Xtensa Xplorer IDE tool can run software simulation and profile the application directly. Both simulation and profiling are cycle accurate. This is convenient for algorithm or heavy application developers.

Figure 1 shows profiling result of the helloworld program on simulation console.

|

|

Figure 1 shows the profiling chart, partially.

|

|

The generated license file only supports debug/run on the RT6xx device target. It does not support software simulation/Xplorer ISS. If there are special requirements to run the software simulations, contact Cadence directly.

It is also common to measure exact cycle counts for specific processing/ timing measurements. The following is an example code to show how to do the cycle counts.

/* Cycle counts inline function */
static unsigned long inline get_ccount(void)
{
    unsigned long r;
    __asm__ volatile ("rsr.ccount %0" : "=r" (r));
    return r;
}
tic = get_ccount();
processing_function();
toc = get_ccount();
printf("Processing takes %d cycles \r\n", toc - tic);

Parent topic:System Optimization

Using Local Memories

RT6xx HiFi4 DSP has 64 K data TCM and 64 K instruction TCM. The TCM is filled with less than 2 K of kernel vectors and the rest is available for application needs. They are the fastest RAM available with no access latency, and can improve critical data/ instruction performance considerably. Consider using TCM as much as possible.

To program code/ data to TCM area:

  1. Define macros for sections in TCM. Reuse existing .drm0 and .iram0 sections. Both sections are default TCM sections for every Xplorer project/ memory map. The following are the sections in fsl_common.h.

    #define DRAM0_DATA __attribute__((section(".dram0.data")))
    #define DRAM0_BSS __attribute__((section(".dram0.bss")))
    #define IRAM0_TEXT __attribute__((section(".iram0.text")))
    #define ALIGNED(alignbytes) __attribute__((aligned(alignbytes)))
    
  2. Use macros to declare code and data to be placed in TCM. The data and bss/ uninitialized data are different sections. The following is an example for an FFT function call.

    DRAM0_DATA const static int32_t fft_in_ref[FFT_LENGTH] = {
    DRAM0_DATA const static int32_t fft_out_ref[FFT_LENGTH] = {
    DRAM0_BSS static int32_t fft_out[FFT_LENGTH];
    IRAM0_TEXT int TEST_FFT()
    {
      …
            fft_cplx32x32(fft_out, fft_in, FFT_HANDLE, 3);
      …
    }
    
  3. The TCM addresses start from 0x2400 0000, which is too far away to main(). Therefore, the project enables long calls to ensure that main() is able to call sub functions. To enable, select Build Properties > Optimization > Enable long calls and select the Yes checkbox. Alternatively, add -mlongcalls to the compiler flags.

Parent topic:System Optimization

Power Efficiency

RT6xx HiFi4 DSP has been equipped as a powerful processing core that can run at 600 MHz, but it may not be necessary at all time. It is always recommended to optimize the power efficiency at system level.

Voltage level and core frequency have huge impact on the power efficiency. Using FFT processing as example, continuous FFT may take ~130 mW at Vddcore 1.1 V / 452 MHz, and ~4 mW at Vddcore 0.7 V / 29 MHz, at room temperature. Profile or do cycle counts for software workload. If it only requires 300 MHz at peak, then the run at full power is not required. Instead, it can run with Vddcore 0.8 V 300 MHz. For more information on the Vddcore and DSP frequency operating conditions, see any dsp example\cm33\pmic_support.c BOARD_SetPmicVoltageForFreq(), or see the data sheet, section 13.1 General Operating Conditions.

Some other tips for better power efficiency:

  • Turn off DSP/ set DSPSTALL, if needed.

  • If possible, make DSP clock adapts to its workload.

  • Call XT_WAITI when DSP is in while loop waiting for interrupt. Similar to Arm side __WFI(), it suspends some processor operations to reduce the power consumption.

    #include <xtensa/tie/xt_interrupt.h >
     extern void XT_WAITI(immediate s);
    

    The immediate value passed is the interrupt level or lower to be IGNORED. For example, if you call XT_WAITI(2) both L1 and L2 interrupts are ignored and only L3 interrupts can wake up DSP.

    Current can further reduce when clocks are turned off to unused memory partitions. For more information, see User Manual 4.5.5.13 DSP SRAM access disable, Register SYSCTL0_DSP_SRAM_ACCESS_DISABLE.

  • PLLs consume power. Consider FFRO for low-power use cases.

Parent topic:System Optimization

Parent topic:HiFi4 System Programming