How to profile a deployed model
for STM32 target, based on ST Edge AI Core Technology 2.2.0
r1.0
Overview
The npu_profiler.py
file is a utility script used to
profile a deployed model through the aiValidation stack. Based on
the ai_runner
Python module and the information
provided by the NPU compiler, the following information are reported
by epoch:
- The total epoch duration (including the pre/core/post phases) is expressed in milliseconds (ms) and/or MCU cycles.
- The epoch length is expressed in NPU cycles.
- The stream engines activities are measured in NPU cycles.
- The number of read/write burst requests by interface (busif0 & 1) is recorded.
- The NPU cache activities are monitored.
- The average bandwidth by interface and by used memory region
(memory pool) is calculated.
- The used processing units and associated types of operations are
identified.
- Ops/cycle (number of operations per cycle), Ideal/Core part only/Measured.
Setting up a work environment
The stm_ai_runner
Python package includes the
npu_profiler.py
script located at
$STEDGEAI_CORE_DIR/scripts/ai_runner/examples/npu_profiler.py
.
To set up the work environment, ensure that all dependencies listed
in the “How to use the
AiRunner package” article are installed.
How-to
Model deployment
The model is deployed on the STM32N6-DK development board with an instrumented version of the aiValidation stack allowing to collect the data during the inferences (see “Getting started - How to evaluate a model on STM32N6 board” article).
Requested NPU compiler option
Different memory pool configurations or options can be used (no limitations). Only the epoch-controller must be not used to keep the epoch level granularity.
To report the NPU compiler information, the following compiler options are requested (generation of “c_info.json” file is added by the front end: stedgeai generate command):
--csv-file network.csv --output-info-file "c_info.json"
Note
To obtain relevant information from the NPU compiler, the
attributes in the memory pool descriptors file should be correctly
defined and aligned with the hardware settings, particularly for the
freqRatio
and byteWidth
attributes.
Note
To evaluate the additional memory traffic required to fetch the
bit-stream by epoch, the
--enable-epoch-controller --ec-single-epoch
options can
be used. These options allow to use of the epoch-controller while
maintaining epoch granularity. However, in this mode, some
information generated by the NPU compiler will not be available.
Step-by-step
Generate/compile the model
$ stedgeai generate -m <model_file> --target stm32n6 --st-neural-art <profile>@neural_art.json
Deploy the generated model on the STM32N6-DK development board with the aiValidation stack
$ python $STEDGEAI_CORE_DIR/scripts/N6_scripts/n6_loader.py ..
Launch the npu_profiler
utility
$ python $STEDGEAI_CORE_DIR/scripts/ai_runner/examples/npu_profiler.py --cfile ./st_ai_ws/neural_art__network/
The '--cfile'
argument indicates the location of the
generated network.c
. Associated folder should contents
the c_info.json
file.
Runtime instrumentation
The user user callbacks from the LL_ATON stack are used to enable/set and to dump the different hardware counters (mainly from the NPU debug and trace unit).
ts
indicates the time-stamp based on an MCU or NPU free-running counter allowing to measure the execution time of the different steps.pre/core\post
indicate the different phases to execute a given epoch.
Precision of the measures
Start/Stop counters are utilized during the execution of user callbacks and are based on the NPU and/or MCU clocks without a divider, which introduces some jitter in the measurements. However, this jitter can be mitigated by performing multiple inferences and considering that the error is not critical or significant compared to the epoch length.
Reported Profiling Metrics
After executing different inferences with various counter
configurations, the npu_profiler
utility returns
results for the entire model as well as for individual epochs of the
model.
Initial part (runtime information)
After connecting to the STM32N6-DK development board, the main runtime information is reported. This includes the main characteristics of the deployed model (I/O format/size), the number of epochs, the tools version, and other relevant details.
NPU Utility - AiRunner Profiler (version 0.2)
Creating date : Tue Mar 4 22:27:25 2025
Creating AiRunner session with `serial:921600` descriptor
Proto-buffer driver v2.0 (msg v3.1) (Serial driver v1.0 - COM40:921600) ['network']
Summary 'network' - ['network']
------------------------------------------------------------------------------------------------------------
I[1/1] 'Input_11_out_0' : uint8[1,224,224,3], 150528 Bytes, QLinear(0.003921569,0,uint8), activations
O[1/1] 'Transpose_597_out_0' : uint8[1,3087,6], 18522 Bytes, QLinear(0.007704938,3,uint8), activations
n_nodes : 199
activations : 1254400
compile_datetime : Mar 3 2025 20:53:13
------------------------------------------------------------------------------------------------------------
protocol : Proto-buffer driver v2.0 (msg v3.1) (Serial driver v1.0 - COM40:921600)
tools : ST Neural ART (LL_ATON api) v1.0.0
runtime lib : atonn-v1.0.0-rc0-16-g0307e413 (optimized SW lib v10.0.0-fd22b7f9 GCC)
capabilities : IO_ONLY, PER_LAYER, PER_LAYER_WITH_DATA
device.desc : stm32 family - 0x486 - STM32N6xx @800/400MHz
device.attrs : fpu,core_icache,core_dcache,npu_cache=1,mcu_freq=800MHz,
noc_freq=400MHz,npu_freq=1000MHz,nic_freq=900MHz
------------------------------------------------------------------------------------------------------------
Per epoch
------------------------------------------------------------
EpochBlock_186 (c_idx=192, num=186:186, c_type=HW, extra=0)
------------------------------------------------------------
[compiler]
epoch type : EpochType.HW
operations : {'hw.conv': 4, 'hw.mul': 1, 'hw.add': 2, 'hw.sigmoid': 1, 'hw.dmain': 6, 'hw.dmaout': 2}
processor units : {'CONV_ACC': 4, 'ARITH_ACC': 3, 'ACTIV_ACC': 1, 'STR_ENG': 8}
ops : 453348
compute cycles : 4608 (max_cycles=27648)
ops/cycle : 16.4 GOPS (ideal=98.4)
mem accesses : [octoFlash: r=4608, w=0] [npuRAM5: r=19600, w=8820]
[target]
duration : 0.043ms (0.3%) total=16.431ms
mcu cycles : [18859, 10599, 5253] -> 0.024,0.013,0.007 ms (mcu_freq=800MHz), 0.043ms
core npu cycles : 13300 -> 0.013 ms (npu_freq=1000MHz) diff. vs max_cycles from compiler: -14348 (-107.88%)
compute cycles ratio : 10.59% (core only: 34.65%)
busif 0 : r=14080 w=0 rburst[1,2,4,8]x8=(8,12,16,208) wburst[1,2,4,8]x8=(0,0,0,0) -> 1.06 GB/s
busif 1 : r=10208 w=7960 rburst[1,2,4,8]x8=(4,12,8,152) wburst[1,2,4,8]x8=(5,9,9,117) -> 1.37 GB/s
busif (total) : r=24288 w=7960 -> 2.42 GB/s
ops/cycle : 10.4 GOPS (core only: 34.1)
strg engines : 114.21% (id=4)
active : STRENG.1.i: busif=0, 5845 cycles, [npuRAM5:12544], 44%
active : STRENG.2.i: busif=0, 5679 cycles, [npuRAM5:12544], 43%
active : STRENG.3.i: busif=0, 5618 cycles, [npuRAM5:12544], 42%
active : STRENG.4.i: busif=0, 15190 cycles, [octoFlash:4608], 114%
active : STRENG.5.i: busif=1, 6788 cycles, [npuRAM5:1764], 51%
active : STRENG.6.i: busif=1, 6270 cycles, [npuRAM5:12544], 47%
active : STRENG.7.o: busif=1, 13935 cycles, [npuRAM5:882], 105%
active : STRENG.8.o: busif=1, 13785 cycles, [npuRAM5:1764], 104%
NPU cache cnts : R[hit=0, miss=80, alloc-miss=0, evict=0], W[hit=0, miss=0, alloc-miss=0, through=0]
octoFlash : r=4608 w=0 -> 346.47 MB/s
npuRAM5 : r=19600 w=8820 -> 2.14 GB/s
total : r=24208 w=8820 -> 2.48 GB/s
Compiler Section
Parameter | Description |
---|---|
epoch type | Type of epoch |
operations | Kind and number of operations mapped on the associated processing units |
processor units | Used processing units |
ops | Total number of mathematical operations required to execute the epoch |
compute cycles | Total number of computed cycles (ideal memory) |
ops/cycle | Estimated number of operations per cycle (including memory accesses) |
mem accesses | Number of read/write memory accesses per memory pool |
Target Section
Parameter | Description |
---|---|
duration | Duration in milliseconds of the epoch |
operations | Kind and number of operations mapped on the associated processing units |
total duration | Time required to execute the epoch, including the pre/post phases |
mcu cycles | Number of CPU/MCU clock cycles per phase |
core NPU cycles | Number of NPU clock cycles to execute the epoch (core part only) |
compute cycles ratio | The ratio between the “compute/ideal” cycles and the measured cycles |
busif X | Number of read/write burst requests by interface, with 8-Byte granularity for the number of r/w memory accesses |
strg engines | For each used stream engine, active time is reported including the associated buffer size |
NPU cache cnts | Indicates the NPU cache operations, read/write type |
<mempool> | Peak memory bandwidth (based on cycles where the NPU is active, core phase) and associated number of r/w accesses |
For the strg engine
, the peak bandwidth is not
computed because the real number of r/w accesses is not known. The
percentage is related to the execution time of the core part.
Final part (entire model)
Compiler
cycles : 15424817 (total max_cycles)
ideal cycles : 7485802 (total compute_cycles)
ops : 510646150
ops/cycle : 33
inf/s : 64.8 (based on estimated total max_cycles)
ideal ops/cycle : 68
ideal inf/s : 133.6
Measured
total duration : 13.649ms, npu_core:13.159ms (96.41%), mcu_cycles=[1.17, 96.31, 2.53]%
mcu cycles : 10919340 (core only = 10515928)
npu cycles : 13692714 (core only = 13159399)
ops/cycle : 37.3 GOPS / including SW epochs (core only = 38.8)
inf/s : 73.3 (13.649ms)
compute cycles ratio : 54.67% (core only: 56.89%)
Memory bandwidth / inference
npuRAM5 : r=20572262 w=11045878 -> 2.40 GB/s (peak), 2.32 GB/s (average)
octoFlash : r=1758190 w=0 -> 133.60 MB/s (peak), 128.80 MB/s (average)
cpuRAM2 : r=802816 w=602112 -> 106.75 MB/s (peak), 102.92 MB/s (average)
npuRAM4 : r=1806336 w=301056 -> 160.13 MB/s (peak), 154.38 MB/s (average)
total r/w : r=24939604 w=11949046 -> 2.80 GB/s (peak), 2.70 GB/s (average)
Compiler Section
Parameter | Description |
---|---|
cycles | Total number of estimated cycles (including memory accesses) |
ideal cycles | Total number of computed cycles (ideal memory) |
ops | Total number of mathematical operations required to execute the model |
ops/cycle | Estimated number of operations per cycle (including memory accesses) |
inf/s | Estimated inferences per second (including memory accesses) |
ideal ops/cycle | Estimated number of operations per cycle (ideal memory) |
ideal inf/s | Estimated inferences per second (ideal memory) |
Measured Section
Parameter | Description |
---|---|
total duration | Time required to execute the model (that is, latency), including the pre/post phases |
mcu cycles | Number of CPU/MCU clock cycles |
npu cycles | Number of clock cycles required by the hardware accelerator |
ops/cycle | Real number of operations per cycle (including SOFTWARE/HYBRID epochs) |
inf/s | Inferences per second, including the pre/post phases |
compute cycles ratio | Ratio between the “compute/ideal” cycles and the measured cycles (including pre/post and SW/HYBRID epochs) |
Memory Bandwidth Section
For each used memory pool, the peak memory bandwidth (based on cycles where the NPU is active during the core phase) and the average memory bandwidth (for the entire model) are reported, along with the associated number of read/write accesses (in bytes).
Guideline to Use the LL ATON runtime Services
The LL ATON runtime provides low-level services
(ll_aton_dbgtrc.h
file) that allow configuring and
using the counter from the Debug and Trace Unit of the ST Neural-ART
IP. This hardware unit enables the collection of various internal
signals and routes them to specific event counters. There are 16
counters, each 32-bit wide. The counters can be configured to count
levels or active edges.
User callbacks
According to the expected granularity and the observed events, it is recommended to use the user callback services to configure, enable, and dump the counters.
void LL_ATON_RT_SetRuntimeCallback(TraceRuntime_FuncPtr_t rt_callback);
void LL_ATON_RT_SetEpochCallback(TraceEpochBlock_FuncPtr_t epoch_block_callback, NN_Instance_TypeDef *nn_instance);
The first is used to enable and to clock the Debug and Trace Unit
when the stack is initialized (call of
LL_ATON_RT_RuntimeInit()/LL_ATON_RT_RuntimeDeInit()
).
void _rt_callback(LL_ATON_RT_Callbacktype_t ctype)
{
if(ctype == LL_ATON_RT_Callbacktype_RT_Init){
();
LL_Dbgtrc_EnableClock(0);
LL_Dbgtrc_Init}
else {
(0);
LL_Dbgtrc_Deinit();
LL_Dbgtrc_DisableClock}
}
The second is used to configure and to enable/disable the counters to collect the expected events during the execution of a given epoch.
static void _epoch_callback(LL_ATON_RT_Callbacktype_t ctype,
const NN_Instance_TypeDef *nn_instance,
const LL_ATON_RT_EpochBlockItem_t *epoch_block)
{
const uint32_t ts = _get_free_running_counters();
if (ctype == LL_ATON_RT_Callbacktype_PRE_START)
{
/* Initialize/enable the counters */
(epoch_block);
_configure_counters}
else if (ctype == LL_ATON_RT_Callbacktype_POST_START)
{
/* compute/store duration of the start phase */
}
else if (ctype == LL_ATON_RT_Callbacktype_PRE_END)
{
/* compute/store duration of the core phase */
}
else if (ctype == LL_ATON_RT_Callbacktype_POST_END)
{
/* compute/store duration of the post phase */
/* Stop/dump the counters */
(epoch_block);
_dump_counters}
}
Measured events
The LL ATON files (ll_aton_dbgtrc.h/c
file) provide
the helper functions allowing to monitor the typical hardware events
to:
- Computes the duration of an epoch.
- Configures a group of counters to measure the number of cycles a given set of stream engines are active (that is: not stalled) during an epoch execution.
- Counts the burst lenghts on a given Bus Interface.
Warning
The Debug and Trace Unit of the ST Neural-ART IP provides 16 counters, which do not allow simultaneous monitoring of different measures. Consequently, multiple inferences may be required to evaluate the various metrics.
Computes the duration of an epoch
The LL_Dbgtrc_Count_Epoch_Len()
function calculates
the duration of an epoch. It requires the ID of the first input
stream engine, which triggers the start of the epoch, and the ID of
the last output stream engine, which triggers the end of the epoch.
These two events activate a free running counter based on the NPU
clock to measure the effective duration of the specified epoch.
However, the NPU compiler does not provide these IDs. The
in_streng_mask
and out_streng_mask
fields
indicate the entire ID used.
const EpochBlock_ItemTypeDef *LL_ATON_EpochBlockItems_Default(void) {
static const EpochBlock_ItemTypeDef ll_atonn_rt_epoch_block_array[] = {
...
{
.start_epoch_block = LL_ATON_Start_EpochBlock_115,
.end_epoch_block = LL_ATON_End_EpochBlock_115,
.wait_mask = 0x00000001,
.flags = EpochBlock_Flags_epoch_start | EpochBlock_Flags_epoch_end | EpochBlock_Flags_pure_hw,
#ifdef LL_ATON_EB_DBG_INFO
.epoch_num = 115,
.last_epoch_num = 115,
.in_streng_mask = 0x00000280,
.out_streng_mask = 0x00000001,
.estimated_npu_cycles = 0,
.estimated_tot_cycles = 0,
#endif // LL_ATON_EB_DBG_INFO
},
...
};
return ll_atonn_rt_epoch_block_array;
}
To approximate the duration of a specific epoch,
a free running counter is used. It is configured during the
PRE_START
call-back, and its value is used as
time-stamp to evaluate the duration of each epoch phase (pre/core
and post).
- Configure and start a free running counter
;
LL_Dbgtrc_Counter_InitTypdef counter_init.signal = DBGTRC_VDD;
counter_init.evt_type = DBGTRC_EVT_HI;
counter_init.wrap = 0;
counter_init.countdown = 0;
counter_init.int_disable = 1;
counter_init.counter = 0;
counter_init(0, _COUNTER_ID, &counter_init);
LL_Dbgtrc_Counter_Init(0, _COUNTER_ID); LL_Dbgtrc_Counter_Start
- Read the free running counter
(0, _COUNTER_ID); LL_Dbgtrc_Counter_Read
- Reset the counter
volatile uint32_t *reg = (volatile uint32_t *)(ATON_DEBUG_TRACE_EVENT_CNT_ADDR(0, _NPU_CLK_COUNTER));
*reg = 0;
Stream engine activity
To measure the number of cycles a given set of stream engines are active (i.e.: not stalled) during an epoch execution, the following helper functions are used (on counter event is required by the stream engine).
int LL_Dbgtrc_Count_StrengActive_Config(uint32_t istreng, uint32_t ostreng, unsigned int counter);
int LL_Dbgtrc_Count_StrengActive_Start(uint32_t istreng, uint32_t ostreng, unsigned int counter);
int LL_Dbgtrc_Count_StrengActive_Stop(uint32_t istreng, uint32_t ostreng, unsigned int counter);
- Configure and start the counters
Since the maximum number of used stream engines
(<10
for STM32N6 NPU configuration) is less than the
available counters, the in_streng_mask
and
out_streng_mask
fields are used to configure and start
the counters during the PRE_START
callback.
void _configure_counters(const LL_ATON_RT_EpochBlockItem_t *epoch_block) {
(
LL_Dbgtrc_Count_StrengActive_Config->in_streng_mask,
epoch_block->out_streng_mask,
epoch_block);
_COUNTER_BASE_IDX(
LL_Dbgtrc_Count_StrengActive_Start->in_streng_mask,
epoch_block->out_streng_mask,
epoch_block);
_COUNTER_BASE_IDX}
- Read the counters
void _dump_counters(const LL_ATON_RT_EpochBlockItem_t *epoch_block) {
int i, r_idx;
uint32_t counters[16];
int id_base = _COUNTER_BASE_IDX;
(
LL_Dbgtrc_Count_StrengActive_Stop->in_streng_mask,
epoch_block->out_streng_mask,
epoch_block);
_COUNTER_BASE_IDX/* read the counters */
= 0;
r_idx for (i = 0; i < ATON_STRENG_NUM; i++)
{
if ((epoch_block->in_streng_mask | epoch_block->out_streng_mask) & (1 << i))
[r_idx] = LL_Dbgtrc_Counter_Read(0, id_base++));
counters++;
r_idx}
}
Note
As the increment of the associated counters is based on real hardware events, the reported values are precise.
Number of r/w bursts
To evaluate the memory traffic (number of r/w memory accesses) during an epoch execution, the following helper functions are used.
int LL_Dbgtrc_Count_BurstsLen(unsigned int counter, unsigned char busif, unsigned char readwrite);
int LL_Dbgtrc_BurstLenBenchStart(unsigned int counter_id);
int LL_Dbgtrc_BurstLenGet(unsigned int counter_id, unsigned int *counters);
By bus interface (0 or 1), four counters are requested to monitor the count of burst lengths [1, 2, 4 ,8] (read or write mode). Each bus interface has a width of 8 bytes, meaning that the data precision for read and write operations is 8 bytes.
- Configure and start the counters
To evaluate all the burst lengths during an epoch execution, all counters (4x4) are used.
void _configure_counters(const LL_ATON_RT_EpochBlockItem_t *epoch_block) {
/* equivalent to LL_Dbgtrc_BurstLenBenchStart(0); */
const int counter_id = 0;
(counter_id, 0, 0); /* Busif 0 writes */
LL_Dbgtrc_Count_BurstsLen(counter_id + 4, 0, 1); /* Busif 0 reads */
LL_Dbgtrc_Count_BurstsLen(counter_id + 8, 1, 0); /* Busif 1 writes */
LL_Dbgtrc_Count_BurstsLen(counter_id + 12, 1, 1); /* Busif 1 reads */
LL_Dbgtrc_Count_BurstsLen}
- Read the counters
void _dump_counters(const LL_ATON_RT_EpochBlockItem_t *epoch_block) {
int i;
const int counter_id = 0;
uint32_t counters[16];
for (i = 0; i < 16; i++)
[i] = LL_Dbgtrc_Counter_Read(0, counter_id + i);
counters}
Note
As the increment of the associated counters is based on real hardware events, the reported values are precise.