2.2.0
How to profile a deployed model


ST Edge AI Core

How to profile a deployed model


for STM32 target, based on ST Edge AI Core Technology 2.2.0



r1.0

Overview

The npu_profiler.py file is a utility script used to profile a deployed model through the aiValidation stack. Based on the ai_runner Python module and the information provided by the NPU compiler, the following information are reported by epoch:

Setting up a work environment

The stm_ai_runner Python package includes the npu_profiler.py script located at $STEDGEAI_CORE_DIR/scripts/ai_runner/examples/npu_profiler.py. To set up the work environment, ensure that all dependencies listed in the “How to use the AiRunner package” article are installed.

How-to

Model deployment

The model is deployed on the STM32N6-DK development board with an instrumented version of the aiValidation stack allowing to collect the data during the inferences (see “Getting started - How to evaluate a model on STM32N6 board” article).

Requested NPU compiler option

Different memory pool configurations or options can be used (no limitations). Only the epoch-controller must be not used to keep the epoch level granularity.

To report the NPU compiler information, the following compiler options are requested (generation of “c_info.json” file is added by the front end: stedgeai generate command):

--csv-file network.csv --output-info-file "c_info.json" 

Note

To obtain relevant information from the NPU compiler, the attributes in the memory pool descriptors file should be correctly defined and aligned with the hardware settings, particularly for the freqRatio and byteWidth attributes.

Note

To evaluate the additional memory traffic required to fetch the bit-stream by epoch, the --enable-epoch-controller --ec-single-epoch options can be used. These options allow to use of the epoch-controller while maintaining epoch granularity. However, in this mode, some information generated by the NPU compiler will not be available.

Step-by-step

Generate/compile the model

$ stedgeai generate -m <model_file> --target stm32n6 --st-neural-art <profile>@neural_art.json

Deploy the generated model on the STM32N6-DK development board with the aiValidation stack

$ python $STEDGEAI_CORE_DIR/scripts/N6_scripts/n6_loader.py ..

Launch the npu_profilerutility

$ python $STEDGEAI_CORE_DIR/scripts/ai_runner/examples/npu_profiler.py --cfile ./st_ai_ws/neural_art__network/

The '--cfile' argument indicates the location of the generated network.c. Associated folder should contents the c_info.json file.

Runtime instrumentation

The user user callbacks from the LL_ATON stack are used to enable/set and to dump the different hardware counters (mainly from the NPU debug and trace unit).

  • ts indicates the time-stamp based on an MCU or NPU free-running counter allowing to measure the execution time of the different steps.
  • pre/core\post indicate the different phases to execute a given epoch.

Precision of the measures

Start/Stop counters are utilized during the execution of user callbacks and are based on the NPU and/or MCU clocks without a divider, which introduces some jitter in the measurements. However, this jitter can be mitigated by performing multiple inferences and considering that the error is not critical or significant compared to the epoch length.

Reported Profiling Metrics

After executing different inferences with various counter configurations, the npu_profiler utility returns results for the entire model as well as for individual epochs of the model.

Initial part (runtime information)

After connecting to the STM32N6-DK development board, the main runtime information is reported. This includes the main characteristics of the deployed model (I/O format/size), the number of epochs, the tools version, and other relevant details.

 NPU Utility - AiRunner Profiler (version 0.2)
 Creating date : Tue Mar  4 22:27:25 2025
 Creating AiRunner session with `serial:921600` descriptor
 Proto-buffer driver v2.0 (msg v3.1) (Serial driver v1.0 - COM40:921600) ['network']

  Summary 'network' - ['network']
  ------------------------------------------------------------------------------------------------------------
  I[1/1] 'Input_11_out_0'        :   uint8[1,224,224,3], 150528 Bytes, QLinear(0.003921569,0,uint8), activations
  O[1/1] 'Transpose_597_out_0'   :   uint8[1,3087,6], 18522 Bytes, QLinear(0.007704938,3,uint8), activations
  n_nodes                        :   199
  activations                    :   1254400
  compile_datetime               :   Mar  3 2025 20:53:13
  ------------------------------------------------------------------------------------------------------------
  protocol                       :   Proto-buffer driver v2.0 (msg v3.1) (Serial driver v1.0 - COM40:921600)
  tools                          :   ST Neural ART (LL_ATON api) v1.0.0
  runtime lib                    :   atonn-v1.0.0-rc0-16-g0307e413 (optimized SW lib v10.0.0-fd22b7f9 GCC)
  capabilities                   :   IO_ONLY, PER_LAYER, PER_LAYER_WITH_DATA
  device.desc                    :   stm32 family - 0x486 - STM32N6xx @800/400MHz
  device.attrs                   :   fpu,core_icache,core_dcache,npu_cache=1,mcu_freq=800MHz,
                                     noc_freq=400MHz,npu_freq=1000MHz,nic_freq=900MHz
  ------------------------------------------------------------------------------------------------------------

Per epoch

 ------------------------------------------------------------
   EpochBlock_186 (c_idx=192, num=186:186, c_type=HW, extra=0)
 ------------------------------------------------------------
 [compiler]
   epoch type             : EpochType.HW
   operations             : {'hw.conv': 4, 'hw.mul': 1, 'hw.add': 2, 'hw.sigmoid': 1, 'hw.dmain': 6, 'hw.dmaout': 2}
   processor units        : {'CONV_ACC': 4, 'ARITH_ACC': 3, 'ACTIV_ACC': 1, 'STR_ENG': 8}
   ops                    : 453348
   compute cycles         : 4608 (max_cycles=27648)
   ops/cycle              : 16.4 GOPS (ideal=98.4)
   mem accesses           : [octoFlash: r=4608, w=0] [npuRAM5: r=19600, w=8820]
 [target]
   duration               : 0.043ms (0.3%) total=16.431ms
   mcu cycles             : [18859, 10599, 5253] -> 0.024,0.013,0.007 ms (mcu_freq=800MHz), 0.043ms
   core npu cycles        : 13300 -> 0.013 ms (npu_freq=1000MHz) diff. vs max_cycles from compiler: -14348 (-107.88%)
   compute cycles ratio   : 10.59% (core only: 34.65%)
   busif 0                : r=14080    w=0        rburst[1,2,4,8]x8=(8,12,16,208) wburst[1,2,4,8]x8=(0,0,0,0) -> 1.06 GB/s
   busif 1                : r=10208    w=7960     rburst[1,2,4,8]x8=(4,12,8,152) wburst[1,2,4,8]x8=(5,9,9,117) -> 1.37 GB/s
   busif (total)          : r=24288      w=7960       -> 2.42 GB/s
   ops/cycle              : 10.4 GOPS (core only: 34.1)
   strg engines           : 114.21% (id=4)
    active                : STRENG.1.i: busif=0, 5845 cycles, [npuRAM5:12544], 44%
    active                : STRENG.2.i: busif=0, 5679 cycles, [npuRAM5:12544], 43%
    active                : STRENG.3.i: busif=0, 5618 cycles, [npuRAM5:12544], 42%
    active                : STRENG.4.i: busif=0, 15190 cycles, [octoFlash:4608], 114%
    active                : STRENG.5.i: busif=1, 6788 cycles, [npuRAM5:1764], 51%
    active                : STRENG.6.i: busif=1, 6270 cycles, [npuRAM5:12544], 47%
    active                : STRENG.7.o: busif=1, 13935 cycles, [npuRAM5:882], 105%
    active                : STRENG.8.o: busif=1, 13785 cycles, [npuRAM5:1764], 104%
   NPU cache cnts         : R[hit=0, miss=80, alloc-miss=0, evict=0], W[hit=0, miss=0, alloc-miss=0, through=0]
    octoFlash             : r=4608       w=0          -> 346.47 MB/s
    npuRAM5               : r=19600      w=8820       -> 2.14 GB/s
    total                 : r=24208      w=8820       -> 2.48 GB/s

Compiler Section

Parameter Description
epoch type Type of epoch
operations Kind and number of operations mapped on the associated processing units
processor units Used processing units
ops Total number of mathematical operations required to execute the epoch
compute cycles Total number of computed cycles (ideal memory)
ops/cycle Estimated number of operations per cycle (including memory accesses)
mem accesses Number of read/write memory accesses per memory pool

Target Section

Parameter Description
duration Duration in milliseconds of the epoch
operations Kind and number of operations mapped on the associated processing units
total duration Time required to execute the epoch, including the pre/post phases
mcu cycles Number of CPU/MCU clock cycles per phase
core NPU cycles Number of NPU clock cycles to execute the epoch (core part only)
compute cycles ratio The ratio between the “compute/ideal” cycles and the measured cycles
busif X Number of read/write burst requests by interface, with 8-Byte granularity for the number of r/w memory accesses
strg engines For each used stream engine, active time is reported including the associated buffer size
NPU cache cnts Indicates the NPU cache operations, read/write type
<mempool> Peak memory bandwidth (based on cycles where the NPU is active, core phase) and associated number of r/w accesses

For the strg engine, the peak bandwidth is not computed because the real number of r/w accesses is not known. The percentage is related to the execution time of the core part.

Final part (entire model)

   Compiler
    cycles                : 15424817 (total max_cycles)
    ideal cycles          : 7485802 (total compute_cycles)
    ops                   : 510646150
    ops/cycle             : 33
    inf/s                 : 64.8 (based on estimated total max_cycles)
    ideal ops/cycle       : 68
    ideal inf/s           : 133.6

   Measured
    total duration        : 13.649ms, npu_core:13.159ms (96.41%), mcu_cycles=[1.17, 96.31, 2.53]%
    mcu cycles            : 10919340 (core only = 10515928)
    npu cycles            : 13692714 (core only = 13159399)
    ops/cycle             : 37.3 GOPS / including SW epochs (core only = 38.8)
    inf/s                 : 73.3 (13.649ms)
    compute cycles ratio  : 54.67% (core only: 56.89%)

   Memory bandwidth / inference
    npuRAM5               : r=20572262   w=11045878   -> 2.40 GB/s (peak), 2.32 GB/s (average)
    octoFlash             : r=1758190    w=0          -> 133.60 MB/s (peak), 128.80 MB/s (average)
    cpuRAM2               : r=802816     w=602112     -> 106.75 MB/s (peak), 102.92 MB/s (average)
    npuRAM4               : r=1806336    w=301056     -> 160.13 MB/s (peak), 154.38 MB/s (average)
    total r/w             : r=24939604   w=11949046   -> 2.80 GB/s (peak), 2.70 GB/s (average)

Compiler Section

Parameter Description
cycles Total number of estimated cycles (including memory accesses)
ideal cycles Total number of computed cycles (ideal memory)
ops Total number of mathematical operations required to execute the model
ops/cycle Estimated number of operations per cycle (including memory accesses)
inf/s Estimated inferences per second (including memory accesses)
ideal ops/cycle Estimated number of operations per cycle (ideal memory)
ideal inf/s Estimated inferences per second (ideal memory)

Measured Section

Parameter Description
total duration Time required to execute the model (that is, latency), including the pre/post phases
mcu cycles Number of CPU/MCU clock cycles
npu cycles Number of clock cycles required by the hardware accelerator
ops/cycle Real number of operations per cycle (including SOFTWARE/HYBRID epochs)
inf/s Inferences per second, including the pre/post phases
compute cycles ratio Ratio between the “compute/ideal” cycles and the measured cycles (including pre/post and SW/HYBRID epochs)

Memory Bandwidth Section

For each used memory pool, the peak memory bandwidth (based on cycles where the NPU is active during the core phase) and the average memory bandwidth (for the entire model) are reported, along with the associated number of read/write accesses (in bytes).

Guideline to Use the LL ATON runtime Services

The LL ATON runtime provides low-level services (ll_aton_dbgtrc.h file) that allow configuring and using the counter from the Debug and Trace Unit of the ST Neural-ART IP. This hardware unit enables the collection of various internal signals and routes them to specific event counters. There are 16 counters, each 32-bit wide. The counters can be configured to count levels or active edges.

User callbacks

According to the expected granularity and the observed events, it is recommended to use the user callback services to configure, enable, and dump the counters.

void LL_ATON_RT_SetRuntimeCallback(TraceRuntime_FuncPtr_t rt_callback);
void LL_ATON_RT_SetEpochCallback(TraceEpochBlock_FuncPtr_t epoch_block_callback, NN_Instance_TypeDef *nn_instance);

The first is used to enable and to clock the Debug and Trace Unit when the stack is initialized (call of LL_ATON_RT_RuntimeInit()/LL_ATON_RT_RuntimeDeInit()).

void _rt_callback(LL_ATON_RT_Callbacktype_t ctype)
{
  if(ctype == LL_ATON_RT_Callbacktype_RT_Init){
    LL_Dbgtrc_EnableClock();
    LL_Dbgtrc_Init(0);
  }
  else {
    LL_Dbgtrc_Deinit(0);
    LL_Dbgtrc_DisableClock();
  }
}

The second is used to configure and to enable/disable the counters to collect the expected events during the execution of a given epoch.

static void _epoch_callback(LL_ATON_RT_Callbacktype_t ctype,
                            const NN_Instance_TypeDef *nn_instance,
                            const LL_ATON_RT_EpochBlockItem_t *epoch_block)
{
    const uint32_t ts = _get_free_running_counters();
    if (ctype == LL_ATON_RT_Callbacktype_PRE_START)
    {
        /* Initialize/enable the counters */
        _configure_counters(epoch_block);
    }
    else if (ctype == LL_ATON_RT_Callbacktype_POST_START)
    {
        /* compute/store duration of the start phase */
    }
    else if (ctype == LL_ATON_RT_Callbacktype_PRE_END)
    {
        /* compute/store duration of the core phase */
    }
    else if (ctype == LL_ATON_RT_Callbacktype_POST_END)
    {
        /* compute/store duration of the post phase */
        /* Stop/dump the counters */
        _dump_counters(epoch_block);
    } 
}

Measured events

The LL ATON files (ll_aton_dbgtrc.h/c file) provide the helper functions allowing to monitor the typical hardware events to:

Warning

The Debug and Trace Unit of the ST Neural-ART IP provides 16 counters, which do not allow simultaneous monitoring of different measures. Consequently, multiple inferences may be required to evaluate the various metrics.

Computes the duration of an epoch

The LL_Dbgtrc_Count_Epoch_Len() function calculates the duration of an epoch. It requires the ID of the first input stream engine, which triggers the start of the epoch, and the ID of the last output stream engine, which triggers the end of the epoch. These two events activate a free running counter based on the NPU clock to measure the effective duration of the specified epoch. However, the NPU compiler does not provide these IDs. The in_streng_mask and out_streng_mask fields indicate the entire ID used.

const EpochBlock_ItemTypeDef *LL_ATON_EpochBlockItems_Default(void) {
 static const EpochBlock_ItemTypeDef ll_atonn_rt_epoch_block_array[] = {
...
    {
      .start_epoch_block = LL_ATON_Start_EpochBlock_115,
      .end_epoch_block = LL_ATON_End_EpochBlock_115,
      .wait_mask = 0x00000001,
      .flags = EpochBlock_Flags_epoch_start | EpochBlock_Flags_epoch_end | EpochBlock_Flags_pure_hw,
#ifdef LL_ATON_EB_DBG_INFO
      .epoch_num = 115,
      .last_epoch_num = 115,
      .in_streng_mask = 0x00000280,
      .out_streng_mask = 0x00000001,
      .estimated_npu_cycles = 0,
      .estimated_tot_cycles = 0,
#endif // LL_ATON_EB_DBG_INFO
    },
...
  };
  return ll_atonn_rt_epoch_block_array;
}

To approximate the duration of a specific epoch, a free running counter is used. It is configured during the PRE_START call-back, and its value is used as time-stamp to evaluate the duration of each epoch phase (pre/core and post).

  • Configure and start a free running counter
LL_Dbgtrc_Counter_InitTypdef counter_init;
counter_init.signal = DBGTRC_VDD;
counter_init.evt_type = DBGTRC_EVT_HI;
counter_init.wrap = 0;
counter_init.countdown = 0;
counter_init.int_disable = 1;
counter_init.counter = 0;
LL_Dbgtrc_Counter_Init(0, _COUNTER_ID, &counter_init);
LL_Dbgtrc_Counter_Start(0, _COUNTER_ID); 
  • Read the free running counter
LL_Dbgtrc_Counter_Read(0, _COUNTER_ID);
  • Reset the counter
volatile uint32_t *reg = (volatile uint32_t *)(ATON_DEBUG_TRACE_EVENT_CNT_ADDR(0, _NPU_CLK_COUNTER));
*reg = 0;

Stream engine activity

To measure the number of cycles a given set of stream engines are active (i.e.: not stalled) during an epoch execution, the following helper functions are used (on counter event is required by the stream engine).

int LL_Dbgtrc_Count_StrengActive_Config(uint32_t istreng, uint32_t ostreng, unsigned int counter);
int LL_Dbgtrc_Count_StrengActive_Start(uint32_t istreng, uint32_t ostreng, unsigned int counter);
int LL_Dbgtrc_Count_StrengActive_Stop(uint32_t istreng, uint32_t ostreng, unsigned int counter);
  • Configure and start the counters

Since the maximum number of used stream engines (<10 for STM32N6 NPU configuration) is less than the available counters, the in_streng_mask and out_streng_mask fields are used to configure and start the counters during the PRE_START callback.

void _configure_counters(const LL_ATON_RT_EpochBlockItem_t *epoch_block) {
    LL_Dbgtrc_Count_StrengActive_Config(
        epoch_block->in_streng_mask,
        epoch_block->out_streng_mask,
        _COUNTER_BASE_IDX);
    LL_Dbgtrc_Count_StrengActive_Start(
        epoch_block->in_streng_mask,
        epoch_block->out_streng_mask,
        _COUNTER_BASE_IDX);
}
  • Read the counters
void _dump_counters(const LL_ATON_RT_EpochBlockItem_t *epoch_block) {
    int i, r_idx;
    uint32_t counters[16];
    int id_base = _COUNTER_BASE_IDX;
    LL_Dbgtrc_Count_StrengActive_Stop(
        epoch_block->in_streng_mask,
        epoch_block->out_streng_mask,
        _COUNTER_BASE_IDX);
    /* read the counters */
    r_idx = 0;
    for (i = 0; i < ATON_STRENG_NUM; i++)
    {
        if ((epoch_block->in_streng_mask | epoch_block->out_streng_mask) & (1 << i))
            counters[r_idx] = LL_Dbgtrc_Counter_Read(0, id_base++));
            r_idx++;
    }
}

Note

As the increment of the associated counters is based on real hardware events, the reported values are precise.

Number of r/w bursts

To evaluate the memory traffic (number of r/w memory accesses) during an epoch execution, the following helper functions are used.

int LL_Dbgtrc_Count_BurstsLen(unsigned int counter, unsigned char busif, unsigned char readwrite);
int LL_Dbgtrc_BurstLenBenchStart(unsigned int counter_id);
int LL_Dbgtrc_BurstLenGet(unsigned int counter_id, unsigned int *counters);

By bus interface (0 or 1), four counters are requested to monitor the count of burst lengths [1, 2, 4 ,8] (read or write mode). Each bus interface has a width of 8 bytes, meaning that the data precision for read and write operations is 8 bytes.

  • Configure and start the counters

To evaluate all the burst lengths during an epoch execution, all counters (4x4) are used.

void _configure_counters(const LL_ATON_RT_EpochBlockItem_t *epoch_block) {
    /* equivalent to LL_Dbgtrc_BurstLenBenchStart(0); */
    const int counter_id = 0;
    LL_Dbgtrc_Count_BurstsLen(counter_id, 0, 0);      /* Busif 0 writes */
    LL_Dbgtrc_Count_BurstsLen(counter_id + 4, 0, 1);  /* Busif 0 reads */
    LL_Dbgtrc_Count_BurstsLen(counter_id + 8, 1, 0);  /* Busif 1 writes */
    LL_Dbgtrc_Count_BurstsLen(counter_id + 12, 1, 1); /* Busif 1 reads */
}
  • Read the counters
void _dump_counters(const LL_ATON_RT_EpochBlockItem_t *epoch_block) {
    int i;
    const int counter_id = 0;
    uint32_t counters[16];
    for (i = 0; i < 16; i++)
        counters[i] = LL_Dbgtrc_Counter_Read(0, counter_id + i);
}

Note

As the increment of the associated counters is based on real hardware events, the reported values are precise.