ST Neural-ART compiler primer
for STM32 target, based on ST Edge AI Core Technology 2.2.0
r1.1
Overview
The article describes the main options and required input files of the ST Neural-ART compiler (or NPU compiler) for deploying AI/DL models on ST Neural-ART NPU-based devices.
Note that the NPU compiler is integrated as a specific back-end in the ST Edge AI Core CLI. The CLI acts as a driver or front end. However, it is also possible to use the NPU compiler directly with the generated intermediate files (ONNX + JSON files). Only these input files representing the original model are supported. It is not possible to pass an ONNX QDQ or quantized TFLite file directly. In this case, the options are passed directly to the compiler as arguments.
The main outputs are:
- A
'.c'
file representing the specialized or configuration file to execute the model against the associated NPU runtime software stack. It is added to a classical embedded C-project which includes the source files of the NPU runtime code (also called ‘ll_ATON’ files) and the associated optimized network runtime library for the delegated or no-hardware assisted software operators. If the epoch controller (EC) is used, an additional'network_ecblobs.h'
file is created, it contains the “blob” (command stream) for the epoch controller unit. - The memory initializers are used to program weights/parameters of the network on the board.
The NPU compiler is configurable and offers numerous
customizations to alter the way a given network is compiled. This
article presents only the main relevant options. To see the complete
list of options with short descriptions, use the --help
option.
--help $ atonn
NPU compiler location
‘atonn’ designates the name of the executable.
%STEDGEAI_CORE_DIR%\Utilities\<host_os>\atonn
%STEDGEAI_CORE_DIR%
indicates the location where the ST Edge AI Core components are installed.
Options
For a STM32N6 device including an ST Neural ART NPU, the minimal options needed to operate properly are the following ones:
$ atonn --load-mpool my_mpool.mpool --onnx-input my_model_OE_3_1_0.onnx --json-quant-file my_model_OE_3_1_0_Q.json
--cache-maintenance --Ocache-opt --mvei --native-float --enable-virtual-mem-pools --load-mdesc stm32n6
Recommended default options:
$ atonn --load-mpool my_mpool.mpool --onnx-input my_model_OE_3_1_0.onnx --json-quant-file my_model_OE_3_1_0_Q.json
--cache-maintenance --Ocache-opt --mvei --native-float --enable-virtual-mem-pools --load-mdesc stm32n6
--enable-virtual-mem-pools --enable-epoch-controller -Os
Mandatory options
Model/Platform configuration options
Option | Description |
---|---|
-i, --onnx-input |
Input ONNX file path |
--json-quant-file |
Input JSON file path (Quantization information) |
-g, --c-output |
Output .c file path |
-t, --load-mdesc, --target |
Specifies the target machine to use, machine description (.mdesc) file path |
--load-mpool |
Specifies the memory pool descriptors to use, memory pool (.pool) file path |
- The NPU supports only the quantized models that
have been “translated” to the ONNX+JSON format by the ST Edge AI Core CLI.
- To enable code-generation of the compiler, the
'-g'
option is mandatory and sets the name of the.c
output file.
- The NPU compiler is platform-agnostic. The
--load-mdesc/--target
option allows to provide the configuration of the Neural ART accelerator™ which is implemented (also called machine descriptor file, the ‘stm32n6.mdesc’ file should be not modified).
- The
--load-mpool
provides the description of the memory pools which can be used to implement the provided model. This file is user specific.
Specific STM32N6 device options
An additional set of mandatory options should be used when using the STM32N6 as the target:
Option | Description |
---|---|
--native-float |
Consider all floats without quantization info as native floats |
--mvei |
Enable the scratch buffers generation for Arm® Cortex® M55 core with MVEI extension |
--cache-maintenance |
Enable generation of cache maintenance code for both MCU & NPU caches |
--Ocache-opt |
Enables usage of the NPU cache path for memory pools which have the attribute ‘cacheable’ set in the memory-pool description file, requires option ‘–cache-maintenance’ to be passed as well. When requested the NPU compiler emits instructions which make use the NPU cach path for the streaming engine configurations. |
- The
--native-float
is mandatory to bypass the check that everything should be quantized in the network given as argument. Not quantizing parts of the network results in lengthy floating operations, but this is not prevented the compiler from completing its work.
- The
--mvei
option is needed to use the optimized AI network runtime library for Arm® Cortex® M55 core with MVEI extension.
Warning
For performance reasons, it is recommended to enable the NPU
cache (using the --Ocache-opt
option) for the NPU
memory regions stored in external memories (flash/RAM). The shared
memory regions between the NPU and MCU domains, used for
input/output buffers and by the software epochs, must be cacheable
MCU regions. To guarantee memory
coherence during inference, the --cache-maintenance
option is mandatory. However, it is the application’s responsibility
to use cache maintenance functions when filling or using the
input/output or activations memory regions before to perform the
inference.
Recommended options
Some options are recommended, to allow for greater performances, or better user experience in most of the cases:
Option | Description |
---|---|
--enable-virtual-mem-pools |
Enable automatic use of an adjacent memory pool with the same properties for large buffer allocation |
--enable-epoch-controller |
Enables code generation using epoch controller (EC) |
-M ,--Os |
Optimize size: attempts to reduce memory buffers size |
- The
--enable-virtual-mem-pools
option is important to allow the compiler to “fuse”/“merge” multiple memory pools that are adjacent. The default behavior is not to allocate buffers that spread over multiple pools, this option removes this constraint. The resulting buffer allocation is usually more space-efficient (and allows for a greater use of internal memories to handle activation buffers). - The
-Os
option provides usually good results without inference speed reduction. - The
--enable-epoch-controller
option is required to enable the epoch controller unit and to generate the “command stream” as C-array tables (part of thenetwork_ecblobs.h
file).
Warning
The NPU compiler does not fully integrate support for the epoch controller, so a postprocess script is needed to build the additional file. This feature is integrated into the ST Edge AI Core; therefore, users should not directly call the NPU compiler.
Option | Description |
---|---|
--mapping-recap |
Useful option to enable the mapping recap during the code generation phase |
Useful options
Option | Category | Description |
---|---|---|
--out-dir-prefix |
global | The output prefix used to specify the output directory for generated files. |
--network-name |
global | Changes the generated network c-name
(default: Default ). This results in different symbol
names in the .c file. |
--options-file |
global | Export the full command line to a
.ini file for future import. This option is useful to
reproduce a compilation step. The .ini file contains
the full command line |
--load-options |
global | Import an exported .ini
file |
--onnx-output |
extra outputs | Output ONNX filename. The NPU compiler
makes manipulations/optimizations on the input ONNX file. The
resulting ONNX file is then the one that is really transformed to a
.c file. This intermediate ONNX file can be exported
using this option. |
--save-mdesc-file |
extra outputs | Dumps machine description in use to file |
--save-mpool-file |
extra outputs | Dumps memory pools in use to file |
--dot-file |
extra outputs | Debug purpose - Dumps a
.dot file showing internal behavior of the generated
program. Operations done in the .c file (details of the
epochs) can be exported in a .dot file that provides a
graphical view of operations done in C. |
--csv-file |
extra outputs | Dumps a .csv -report file
showing details for memory transfers and power consumption
estimation per epoch. A report showing details of the memory
transfers, and estimations of inference time/power
consumption can be exported. (The values in the csv are mainly
estimations and may diverge from the results in real
life) |
-A,--Oauto |
optimization | Enables iterative automatic optimization options selection for all options below (increases compilation time) |
--Omax-ca-pipe n |
optimization | Where n is an int >0. Sets the maximum number of CONV_ACC used when pipelining (default is 0 meaning unlimited, max is defined in the machine descriptor file, mdesc file) |
--Oalt-sched |
optimization | Enables alternate scheduler algorithm |
--Oconv-split-cw |
optimization | Enables optimization to split convolutions channel-wise |
--Oconv-split-kw |
optimization | Enables optimization to split convolutions kernel-wise |
--Oconv-split-stripe |
optimization | Enables optimization to split convolutions stripe-wise for 1xN kernels |
-O,--optimization |
optimization | Optimization level one of [0-3](speed) |
- Each optimization can be done one-by-one. Using
-A,--Oauto
explores all combinations possible and assess which one seems to provide the “best” that is, fastest results - Adding
--d-auto n
(with0<n<3
: provides debug information about automatic optimization done by-A,--Oauto
) - When using
-A,--Oauto
,--Omax-ca-pipe 0
should be added to ensure that the exploration of the configurations is optimal.
Memory-pools
Introduction
Memory-pool descriptor files are used to give the NPU compiler a vision of the memory characteristics and mapping that is free to use for the compiler on the embedded system.
When the NPU compiler can use two memory locations to fit, for
example, an activation buffer, it decides which location is ‘the
best’ based on estimated data access time. For this allocation
algorithm to perform well, it is then required to describe the
memory with as much accuracy as possible.
The example mpool
files are written for STM32N6 and
contain descriptions for each internal memory cuts + external
memories accessed through XSPI:
- Disposable AXISRAM is definitely slower than NPU-RAMs. The NPU compiler prefers using NPU-RAMs to allocate buffers and uses the AXISRAM if the NPU-RAMs are full.
- External memories are even slower than AXISRAM. NPU compiler allocates data in external memories as a last-resort.
Additional characteristics can be specified for each memory region defined in the memory pool file:
rights
: Read/write accesses are required for each defined region. Read-only regions contain weights, while write regions contain activations and, as a last resort, weights if read-only pools are full.constants_preferred
: A cue can be added for each region to advise the NPU compiler to preferentially use a given pool for storing constants (weights).. This overrides speed considerations addressed above.chacheable
: Cacheability can be enabled for each region.
Finally, if applicable, the cache structure and characteristics should be defined in the mpool file, this gives the NPU compiler all the information it needs to optimize as much as possible the inference time.
Example
Below is an example of a memory-pool file used as a default
configuration in the STEdgeaAI tool. The syntax of the
mpool
file is strict JSON: no comments are allowed.
This example can be used as a base for variations, as it contains the mandatory section of such a file:
memory
: Definition of the memory characteristics described abovecacheinfo
: cache characteristicsmem_file_prefix
: prefix for the memory initializer files generated by the NPU compiler.mempools
: the memory-pools description
{
"memory": {
"cacheinfo": [
{
"nlines": 512,
"linesize": 64,
"associativity": 8,
"bypass_enable": 1,
"prop": { "rights": "ACC_WRITE", "throughput": "MID", "latency": "MID", "byteWidth": 8,
"freqRatio": 2.50, "read_power": 13.584, "write_power": 12.645 }
}
],
"mem_file_prefix": "atonbuf",
"mempools": [
{
"fname": "AXIFLEXMEM",
"name": "flexMEM",
"fformat": "FORMAT_RAW",
"prop": { "rights": "ACC_WRITE", "throughput": "MID", "latency": "MID", "byteWidth": 8,
"freqRatio": 2.50, "read_power": 9.381, "write_power": 8.569 },
"offset": { "value": "0x34000000", "magnitude": "BYTES" },
"size": { "value": "0", "magnitude": "KBYTES" }
},
{
"fname": "AXISRAM1",
"name": "cpuRAM1",
"fformat": "FORMAT_RAW",
"prop": { "rights": "ACC_WRITE", "throughput": "MID", "latency": "MID", "byteWidth": 8,
"freqRatio": 2.50, "read_power": 16.616, "write_power": 14.522 },
"offset": { "value": "0x34064000", "magnitude": "BYTES" },
"size": { "value": "0", "magnitude": "KBYTES" }
},
{
"fname": "AXISRAM2",
"name": "cpuRAM2",
"fformat": "FORMAT_RAW",
"prop": { "rights": "ACC_WRITE", "throughput": "MID", "latency": "MID", "byteWidth": 8,
"freqRatio": 2.50, "read_power": 17.324, "write_power": 15.321 },
"offset": { "value": "0x34100000", "magnitude": "BYTES" },
"size": { "value": "1024", "magnitude": "KBYTES" }
},
{
"fname": "AXISRAM3",
"name": "npuRAM3",
"fformat": "FORMAT_RAW",
"prop": { "rights": "ACC_WRITE", "throughput": "HIGH", "latency": "LOW", "byteWidth": 8,
"freqRatio": 1.25, "read_power": 18.531, "write_power": 16.201 },
"offset": { "value": "0x34200000", "magnitude": "BYTES" },
"size": { "value": "448", "magnitude": "KBYTES" }
},
{
"fname": "AXISRAM4",
"name": "npuRAM4",
"fformat": "FORMAT_RAW",
"prop": { "rights": "ACC_WRITE", "throughput": "HIGH", "latency": "LOW", "byteWidth": 8,
"freqRatio": 1.25, "read_power": 18.531, "write_power": 16.201 },
"offset": { "value": "0x34270000", "magnitude": "BYTES" },
"size": { "value": "448", "magnitude": "KBYTES" }
},
{
"fname": "AXISRAM5",
"name": "npuRAM5",
"fformat": "FORMAT_RAW",
"prop": { "rights": "ACC_WRITE", "throughput": "HIGH", "latency": "LOW", "byteWidth": 8,
"freqRatio": 1.25, "read_power": 18.531, "write_power": 16.201 },
"offset": { "value": "0x342e0000", "magnitude": "BYTES" },
"size": { "value": "448", "magnitude": "KBYTES" }
},
{
"fname": "AXISRAM6",
"name": "npuRAM6",
"fformat": "FORMAT_RAW",
"prop": { "rights": "ACC_WRITE", "throughput": "HIGH", "latency": "LOW", "byteWidth": 8,
"freqRatio": 1.25, "read_power": 19.006, "write_power": 15.790 },
"offset": { "value": "0x34350000", "magnitude": "BYTES" },
"size": { "value": "448", "magnitude": "KBYTES" }
},
{
"fname": "xSPI1",
"name": "hyperRAM",
"fformat": "FORMAT_RAW",
"prop": { "rights": "ACC_WRITE", "throughput": "MID", "latency": "HIGH", "byteWidth": 2,
"freqRatio": 5.00, "cacheable": "CACHEABLE_ON","read_power": 380, "write_power": 340.0,
"constants_preferred": "true" },
"offset": { "value": "0x90000000", "magnitude": "BYTES" },
"size": { "value": "32", "magnitude": "MBYTES" }
},
{
"fname": "xSPI2",
"name": "octoFlash",
"fformat": "FORMAT_RAW",
"prop": { "rights": "ACC_READ", "throughput": "MID", "latency": "HIGH", "byteWidth": 1,
"freqRatio": 6.00, "cacheable": "CACHEABLE_ON", "read_power": 110, "write_power": 400.0,
"constants_preferred": "true" },
"offset": { "value": "0x70000000", "magnitude": "BYTES" },
"size": { "value": "64", "magnitude": "MBYTES" }
}
]
}
}
Note
In the descriptions for “AXIFLEXMEM” and “AXISRAM1”, the sizes
for the memory regions mapped after 0x34000000
have
been set to 0 bytes. This is done to signal the Neural-ART compiler
not to use those memory pools when allocating
buffers. The reason for this is that, in the full example, the
embedded firmware is placed within this range (that is, the linker
script places everything in the range
0x34000000 - 0x34100000
)
Tip
It is mandatory that the memory ranges available for/used by the NPU compiler and the memory ranges available for the applicative firmware are disjoint. For example, if the firmware is stored somewhere and the NPU uses this location for storing activations, then the firmware is erased during inference.
Syntax description and main options
“mempools” key
This item contains a list that describes each of the memory regions.
Each memory region is described in an object. The keys of such objects are described below, with their main associated accepted values.
fname
(string) File name to use for the memory initializer (the final name is also prefixed bymem_file_prefix
)name
(string) Memory Name (found in reports)fformat
File format used for the memory initializer. Values can be among:FORMAT_RAW
Raw binary formatFORMAT_HEX
Same as raw, but easier to read for humans: hexadecimal data representation of the contents.FORMAT_HEX16
,FORMAT_HEX32
can also be used.FORMAT_IHEX
Intel-Hex file
offset
Memory start address. The value is an object with the following key/valuesvalue
(string) Start address value. Either in decimal format (for example, 805306368) or hexadecimal format (for example, 0x30000000)magnitude
(string) Unit to interpret the value field. Can be a value amongBYTES
,KBYTES
,MBYTES
.
size
Memory size. The value is an object with the following key/valuesvalue
(string) Size of the memory. Either in decimal format (for example, 805306368) or hexadecimal format (for example, 0x30000000)magnitude
(string) Unit to interpret the value field. Can be a value amongBYTES
,KBYTES
,MBYTES
.
mode
Defines whether the pool address is to be absolute or relative. For now, useUSEMODE_ABSOLUTE
, relative mode is explained in other documents.prop
Defines multiple properties of the current pool, within an object. The keys/values of this object available for the object are as follows:rights
States if the memory is read-only or read-write:ACC_READ
,ACC_WRITE
throughput
Memory throughput description:LOW
,MID
orHIGH
. The description here is relative to other pools (this is not an absolute value).latency
Memory latency description:LOW
,MID
orHIGH
. The description here is relative to other pools (this is not an absolute value).byte_width
(uint32) Memory bus access widthfreq_ratio
(float) Memory frequency ratio compared to accelerators (that is, NPU frequency) – bigger values mean slower frequency.- Absolute values are used for estimations available in reports
- Relative ratios between pools are important for the NPU compiler to take decisions (that is, if the reports are not used, it is not needed to change the values when changing the NPU frequency)
cacheable
Is the pool cacheable?ON
orOFF
(default isOFF
)read_power
(float) Read power for a single byte_width access in mW (power measured at nominal frequency and voltage) – used for reportswrite_power
(float) Active write power for a single byte_width access in mW (power measured at nominal frequency and voltage) – used for reportsconstants_preferred
(bool) Preferred memory pool for constant values (for example, weights).true
orfalse
(default:false
)
“cacheinfo” key
This part describes the cache architecture.
The provided memory-pools files give a good representation of the cache used in STM32N6 (CACHEAXI), and should not be modified.
Virtual mem-pools considerations
The --enable-virtual-mem-pools
allows the compiler
to merge contiguous memory pool into a single, virtual
pool.
In the example above, the pools cpuRAM2
,
npuRAM3-npuRAM6
are all contiguous (that is, the end of
cpuRAM2
is the start of npuRAM3
). When
specifying --enable-virtual-mem-pools
, the NPU compiler
considers all those pools as a single one, and thus will, if needed
to allocate buffers that span on multiple “concrete” pools.
This is, in a way, equivalent to create only one large entry in the
memory pool file that would merge all the first memory pools.
Using --enable-virtual-mem-pools
enables the user to
conveniently use the same .mpool
file to fit
different models with different memory requirements.
This, however comes at the cost that if merged memory
characteristics are very different, the resulting virtual pool
inherits the worst characteristics of them, and thus it may
result in suboptimal allocation of buffers. (When allocating
buffers to the virtual memory pool, the underlying concrete pools
characteristics are not considered.)
To overcome this issue, it might be useful to add holes (for example, 1 byte) between ranges of memory that have very different characteristics, this prevents the compiler from merging heterogenous pools.
ST Edge AI core CLI as front end
The ST Edge AI core CLI is required for using the NPU compiler.
It facilitates the application of specific transformations and
generates the intermediate
representation (ONNX + JSON files). The
‘--st-neural-art
’ option allows specifying a compilation profiles json
file containing different sets of options named
profiles that are directly passed to the compiler.
--st-neural-art STR
Argument syntax: empty or
<profile>[@<compilation-profiles-file-path>]
Set a profile from a compilation profile file. This option is mandatory to enable the passes allowing to analyze/generate a model.
Argument | Description |
---|---|
help |
Print the NPU compiler help. |
Empty to use a “default” profile. | |
profile-<tool_profile> |
Use a built-in tool profile (see next section). |
<profilename>@<compilation-profiles-file-path> |
Use the custom profiles. |
Tool built-in profiles
The tool package provides a typical compilation profile file
including different settings which can be selected with the
'profile-'
argument of the
'--st-neural-art'
option. It references the memory-pool descriptor files
allowing to dedicate the major part of the internal memories and
external memory devices to the NPU
memory subsystem.
- Location of the compilation profile file:
$STEDGEAI_CORE_DIR/Utilities/windows/targets/stm32/resources/neural_art.json
.
- Memory-pool descriptor file:
$STEDGEAI_CORE_DIR/Utilities/windows/targets/stm32/resources/mpools/stm32n6.mpool
.
Profile name | Description |
---|---|
minimal |
Mandatory options only. No
optimization. All available memories can be used
('stm32n6.mpool' file) |
default |
Extend the minimal profile enabling the optimization options to efficiently use the memory-pools |
allmems--O3 |
Add optimization level 3 for speed |
allmems--O3-autosched |
Add autosched option |
allmems--auto |
Enable auto option |
internal-memories-only--default |
default profile, but only
the internal memories are allowed
('stm32n6__internal_memories_only.mpool' file) |
Note
The provided profiles are the generic setting and should be customized according to the application constraints and the imported models.
Front-end options
The following options impact the generated files.
Option | Description |
---|---|
--name |
overwrite the name of the files
generated by the NPU compiler (and symbols within the
.c file) |
--no-inputs-allocation ,
--no-outputs-allocation |
inputs/outputs buffers will not be allocated by the NPU compiler in the provided memory pools and must be allocated/“linked” at runtime by the embedded application. |
--input-data-type ,
--output-data-type |
used to change the type of data of the input/output tensors for the deployed model |
--inputs-ch-position ,
--outputs-ch-position |
used to change the memory layout (NHWC vs NCHW) of the input/output tensors for the deployed model |
Compilation Profiles JSON File
The following JSON snippet provides a typical compilation
profiles json file. This is not strict JSON format. The
comments are allowed and start with //
.
{
"Globals": {
// This section is required but can be empty
},
"Profiles": {
// Minimal options required for the generated file to work
"minimal" : {
"memory_pool": "./mpools/stm32n6.mpool",
"options": "--native-float --mvei --cache-maintenance --Ocache-opt"
},
// Advised minimal options
"default" : {
"memory_pool": "./mpools/stm32n6.mpool",
"options": "--native-float --mvei --cache-maintenance --Ocache-opt --enable-virtual-mem-pools --Os"
},
"test" : {
"memory_pool": "./mpools/stm32n6.mpool",
"options": "--native-float --mvei --cache-maintenance --Ocache-opt --enable-virtual-mem-pools --optimization 1"
},
"test2" : {
"memory_pool": "./my_memory_pools/stm32n6_test.mpool",
"options": "--native-float --mvei --cache-maintenance --Ocache-opt --enable-virtual-mem-pools\
--optimization 1 --Oauto-sched --enable-epoch-controller"
}
}
}
This file has two sections: "Globals"
and
"Profiles"
.
- The
"Globals"
section is used for debug purposes. It is optional, and can be removed from the profile file. - The
"Profiles"
section allows defining multiple named profiles that are selectable through the'--st-neural-art'
option of the ST Edge AI CLI.
A profile is a combination of:
- A path to a memory-pool descriptor file (.mpool file).
- A list of options to be passed to the NPU compiler.
- Optional: A path to a machine description file (.mdesc file) - by default, the stm32n6 description file is used.
Tips, variations around the basic use case
Generic recommendations
- When using small-enough models, try to refrain from using external memories: any transaction with external memory is time-consuming!
- Remember that
--enable-virtual-mem-pools
, though convenient, may have massive disruptive effects on inference time.- When in optimization steps, try to remove this option from the compiler’s parameters.
NPU compiler options
- Experiment with options, a lot of options can be added and may provide successful (or not) effects.
- After a successful
--Oauto
compilation for a model, note down the “best” options used, and report them in the arguments of the compiler at next compilation, this will save compilation time during development.
Memory-pools
Warning
Always remember to have disjoint “memory spaces” available for the NPU compiler (that will allocate buffers and constants of the network) and the firmware compiler (that will allocate space for program, variables, etc.) to prevent memory issues.
- Carving memory for performance-improvement of a model is a
time-consuming activity.
- Start with basic memory pools with e.g. virtual memory pools enabled.
- Check if it seems possible to fit the whole model in internal
memories.
- If the firmware is small-enough, it is possible to evolve from the example mpool by adding space before the AXISRAM2 pool (do not forget to remove this memory space in the linker script)
- When trying to optimize memory usage/allocation for speed, try
to disjoint virtually-merged-pools (by adding small holes between
them).
- Ultimately, try not to use virtual-memory-pools.
- For development purposes
- An external RAM can be used easily instead of an external flash (for example, by setting the size of the external flash to 0 and adding constants_preferred=true to external ram characteristics) if both types of memories are available.
- This prevents wear of the flash, will be quicker to load weights
on the board, etc.
- Be aware that this can make the inference faster (external RAM on discovery-kit board can use hexa-SPI vs. octo-SPI for external Flash)