Overview

The article describes the main options and required input files of the ST Neural-ART compiler (or NPU compiler) for deploying AI/DL models on ST Neural-ART NPU-based devices.

Note that the NPU compiler is integrated as a specific back-end in the ST Edge AI Core CLI. The CLI acts as a driver or front end. However, it is also possible to use the NPU compiler directly with the generated intermediate files (ONNX + JSON files). Only these input files representing the original model are supported. It is not possible to pass an ONNX QDQ or quantized TFLite file directly. In this case, the options are passed directly to the compiler as arguments.

NPU compiler

The main outputs are:

A '.c' file representing the specialized or configuration file to execute the model against the associated NPU runtime software stack. It is added to a classical embedded C-project which includes the source files of the NPU runtime code (also called ‘ll_ATON’ files) and the associated optimized network runtime library for the delegated or no-hardware assisted software operators. If the epoch controller (EC) is used, an additional 'network_ecblobs.h' file is created, it contains the “blob” (command stream) for the epoch controller unit.
The memory initializers are used to program weights/parameters of the network on the board.

The NPU compiler is configurable and offers numerous customizations to alter the way a given network is compiled. This article presents only the main relevant options. To see the complete list of options with short descriptions, use the --help option.

$ atonn --help

NPU compiler location

‘atonn’ designates the name of the executable.

%STEDGEAI_CORE_DIR%\Utilities\<host_os>\atonn

%STEDGEAI_CORE_DIR% indicates the location where the ST Edge AI Core components are installed.

Options

For a STM32N6 device including an ST Neural ART NPU, the minimal options needed to operate properly are the following ones:

$ atonn --load-mpool my_mpool.mpool --onnx-input my_model_OE_3_1_0.onnx --json-quant-file my_model_OE_3_1_0_Q.json
        --cache-maintenance --Ocache-opt --mvei --native-float --enable-virtual-mem-pools --load-mdesc stm32n6

Recommended default options:

$ atonn --load-mpool my_mpool.mpool --onnx-input my_model_OE_3_1_0.onnx --json-quant-file my_model_OE_3_1_0_Q.json
        --cache-maintenance --Ocache-opt --mvei --native-float --enable-virtual-mem-pools --load-mdesc stm32n6
        --enable-virtual-mem-pools --enable-epoch-controller -Os

Mandatory options

Model/Platform configuration options

Option	Description
`-i, --onnx-input`	Input ONNX file path
`--json-quant-file`	Input JSON file path (Quantization information)
`-g, --c-output`	Output .c file path
`-t, --load-mdesc, --target`	Specifies the target machine to use, machine description (.mdesc) file path
`--load-mpool`	Specifies the memory pool descriptors to use, memory pool (.pool) file path

The NPU supports only the quantized models that have been “translated” to the ONNX+JSON format by the ST Edge AI Core CLI.
To enable code-generation of the compiler, the '-g' option is mandatory and sets the name of the .c output file.
The NPU compiler is platform-agnostic. The --load-mdesc/--target option allows to provide the configuration of the Neural ART accelerator™ which is implemented (also called machine descriptor file, the ‘stm32n6.mdesc’ file should be not modified).
The --load-mpool provides the description of the memory pools which can be used to implement the provided model. This file is user specific.

Specific STM32N6 device options

An additional set of mandatory options should be used when using the STM32N6 as the target:

Option	Description
`--native-float`	Consider all floats without quantization info as native floats
`--mvei`	Enable the scratch buffers generation for Arm® Cortex® M55 core with MVEI extension
`--cache-maintenance`	Enable generation of cache maintenance code for both MCU & NPU caches
`--Ocache-opt`	Enables usage of the NPU cache path for memory pools which have the attribute ‘cacheable’ set in the memory-pool description file, requires option ‘–cache-maintenance’ to be passed as well. When requested the NPU compiler emits instructions which make use the NPU cach path for the streaming engine configurations.

The --native-float is mandatory to bypass the check that everything should be quantized in the network given as argument. Not quantizing parts of the network results in lengthy floating operations, but this is not prevented the compiler from completing its work.
The --mvei option is needed to use the optimized AI network runtime library for Arm® Cortex® M55 core with MVEI extension.

Warning

For performance reasons, it is recommended to enable the NPU cache (using the --Ocache-opt option) for the NPU memory regions stored in external memories (flash/RAM). The shared memory regions between the NPU and MCU domains, used for input/output buffers and by the software epochs, must be cacheable MCU regions. To guarantee memory coherence during inference, the --cache-maintenance option is mandatory. However, it is the application’s responsibility to use cache maintenance functions when filling or using the input/output or activations memory regions before to perform the inference.

Recommended options

Some options are recommended, to allow for greater performances, or better user experience in most of the cases:

Option	Description
`--enable-virtual-mem-pools`	Enable automatic use of an adjacent memory pool with the same properties for large buffer allocation
`--enable-epoch-controller`	Enables code generation using epoch controller (EC)
`-M`,`--Os`	Optimize size: attempts to reduce memory buffers size

The --enable-virtual-mem-pools option is important to allow the compiler to “fuse”/“merge” multiple memory pools that are adjacent. The default behavior is not to allocate buffers that spread over multiple pools, this option removes this constraint. The resulting buffer allocation is usually more space-efficient (and allows for a greater use of internal memories to handle activation buffers).
The -Os option provides usually good results without inference speed reduction.
The --enable-epoch-controller option is required to enable the epoch controller unit and to generate the “command stream” as C-array tables (part of the network_ecblobs.h file).

Warning

The NPU compiler does not fully integrate support for the epoch controller, so a postprocess script is needed to build the additional file. This feature is integrated into the ST Edge AI Core; therefore, users should not directly call the NPU compiler.

Option	Description
`--mapping-recap`	Useful option to enable the mapping recap during the code generation phase

Useful options

Option	Category	Description
`--out-dir-prefix`	global	The output prefix used to specify the output directory for generated files.
`--network-name`	global	Changes the generated network c-name (default: `Default`). This results in different symbol names in the `.c` file.
`--options-file`	global	Export the full command line to a `.ini` file for future import. This option is useful to reproduce a compilation step. The `.ini` file contains the full command line
`--load-options`	global	Import an exported `.ini` file
`--onnx-output`	extra outputs	Output ONNX filename. The NPU compiler makes manipulations/optimizations on the input ONNX file. The resulting ONNX file is then the one that is really transformed to a `.c` file. This intermediate ONNX file can be exported using this option.
`--save-mdesc-file`	extra outputs	Dumps machine description in use to file
`--save-mpool-file`	extra outputs	Dumps memory pools in use to file
`--dot-file`	extra outputs	Debug purpose - Dumps a `.dot` file showing internal behavior of the generated program. Operations done in the `.c` file (details of the epochs) can be exported in a `.dot` file that provides a graphical view of operations done in C.
`--csv-file`	extra outputs	Dumps a `.csv`-report file showing details for memory transfers and power consumption estimation per epoch. A report showing details of the memory transfers, and estimations of inference time/power consumption can be exported. (The values in the csv are mainly estimations and may diverge from the results in real life)
`-A,--Oauto`	optimization	Enables iterative automatic optimization options selection for all options below (increases compilation time)
`--Omax-ca-pipe n`	optimization	Where n is an int >0. Sets the maximum number of CONV_ACC used when pipelining (default is 0 meaning unlimited, max is defined in the machine descriptor file, mdesc file)
`--Oalt-sched`	optimization	Enables alternate scheduler algorithm
`--Oconv-split-cw`	optimization	Enables optimization to split convolutions channel-wise
`--Oconv-split-kw`	optimization	Enables optimization to split convolutions kernel-wise
`--Oconv-split-stripe`	optimization	Enables optimization to split convolutions stripe-wise for 1xN kernels
`-O,--optimization`	optimization	Optimization level one of [0-3](speed)

Each optimization can be done one-by-one. Using -A,--Oauto explores all combinations possible and assess which one seems to provide the “best” that is, fastest results
Adding --d-auto n (with 0<n<3: provides debug information about automatic optimization done by -A,--Oauto)
When using -A,--Oauto, --Omax-ca-pipe 0 should be added to ensure that the exploration of the configurations is optimal.

Memory-pools

Introduction

Memory-pool descriptor files are used to give the NPU compiler a vision of the memory characteristics and mapping that is free to use for the compiler on the embedded system.

When the NPU compiler can use two memory locations to fit, for example, an activation buffer, it decides which location is ‘the best’ based on estimated data access time. For this allocation algorithm to perform well, it is then required to describe the memory with as much accuracy as possible.
The example mpool files are written for STM32N6 and contain descriptions for each internal memory cuts + external memories accessed through XSPI:

Disposable AXISRAM is definitely slower than NPU-RAMs. The NPU compiler prefers using NPU-RAMs to allocate buffers and uses the AXISRAM if the NPU-RAMs are full.
External memories are even slower than AXISRAM. NPU compiler allocates data in external memories as a last-resort.

Additional characteristics can be specified for each memory region defined in the memory pool file:

rights: Read/write accesses are required for each defined region. Read-only regions contain weights, while write regions contain activations and, as a last resort, weights if read-only pools are full.
constants_preferred: A cue can be added for each region to advise the NPU compiler to preferentially use a given pool for storing constants (weights).. This overrides speed considerations addressed above.
chacheable: Cacheability can be enabled for each region.

Finally, if applicable, the cache structure and characteristics should be defined in the mpool file, this gives the NPU compiler all the information it needs to optimize as much as possible the inference time.

Example

Below is an example of a memory-pool file used as a default configuration in the STEdgeaAI tool. The syntax of the mpool file is strict JSON: no comments are allowed.

This example can be used as a base for variations, as it contains the mandatory section of such a file:

memory: Definition of the memory characteristics described above
- cacheinfo: cache characteristics
- mem_file_prefix: prefix for the memory initializer files generated by the NPU compiler.
- mempools: the memory-pools description

{
    "memory": {
        "cacheinfo": [
            {
                "nlines": 512,
                "linesize": 64,
                "associativity": 8,
                "bypass_enable": 1,
                "prop": { "rights": "ACC_WRITE",  "throughput": "MID",   "latency": "MID", "byteWidth": 8,
                          "freqRatio": 2.50, "read_power": 13.584, "write_power": 12.645 }
            }
        ],
        "mem_file_prefix": "atonbuf",
        "mempools": [
            {
                "fname": "AXIFLEXMEM",
                "name":  "flexMEM",
                "fformat": "FORMAT_RAW",
                "prop":   { "rights": "ACC_WRITE", "throughput": "MID",  "latency": "MID", "byteWidth": 8,
                            "freqRatio": 2.50, "read_power": 9.381,  "write_power": 8.569 },
                "offset": { "value": "0x34000000", "magnitude":  "BYTES" },
                "size":   { "value": "0",        "magnitude": "KBYTES" }
            },
            {
                "fname": "AXISRAM1",
                "name":  "cpuRAM1",
                "fformat": "FORMAT_RAW",
                "prop":   { "rights": "ACC_WRITE", "throughput": "MID",  "latency": "MID", "byteWidth": 8,
                            "freqRatio": 2.50, "read_power": 16.616, "write_power": 14.522 },
                "offset": { "value": "0x34064000", "magnitude":  "BYTES" },
                "size":   { "value": "0",        "magnitude": "KBYTES" }
            },
            {
                "fname": "AXISRAM2",
                "name":  "cpuRAM2",
                "fformat": "FORMAT_RAW",
                "prop":   { "rights": "ACC_WRITE", "throughput": "MID",  "latency": "MID", "byteWidth": 8,
                            "freqRatio": 2.50, "read_power": 17.324, "write_power": 15.321 },
                "offset": { "value": "0x34100000", "magnitude":  "BYTES" },
                "size":   { "value": "1024",       "magnitude": "KBYTES" }
            },
            {
                "fname": "AXISRAM3",
                "name":  "npuRAM3",
                "fformat": "FORMAT_RAW",
                "prop":   { "rights": "ACC_WRITE", "throughput": "HIGH", "latency": "LOW", "byteWidth": 8,
                            "freqRatio": 1.25, "read_power": 18.531, "write_power": 16.201 },
                "offset": { "value": "0x34200000", "magnitude":  "BYTES" },
                "size":   { "value": "448",        "magnitude": "KBYTES" }
            },
            {
                "fname": "AXISRAM4",
                "name":  "npuRAM4",
                "fformat": "FORMAT_RAW",
                "prop":   { "rights": "ACC_WRITE", "throughput": "HIGH", "latency": "LOW", "byteWidth": 8,
                            "freqRatio": 1.25, "read_power": 18.531, "write_power": 16.201 },
                "offset": { "value": "0x34270000", "magnitude":  "BYTES" },
                "size":   { "value": "448",        "magnitude": "KBYTES" }
            },
            {
                "fname": "AXISRAM5",
                "name":  "npuRAM5",
                "fformat": "FORMAT_RAW",
                "prop":   { "rights": "ACC_WRITE", "throughput": "HIGH", "latency": "LOW", "byteWidth": 8,
                            "freqRatio": 1.25, "read_power": 18.531, "write_power": 16.201 },
                "offset": { "value": "0x342e0000", "magnitude":  "BYTES" },
                "size":   { "value": "448",        "magnitude": "KBYTES" }
            },
            {
                "fname": "AXISRAM6",
                "name":  "npuRAM6",
                "fformat": "FORMAT_RAW",
                "prop":   { "rights": "ACC_WRITE", "throughput": "HIGH", "latency": "LOW", "byteWidth": 8,
                            "freqRatio": 1.25, "read_power": 19.006, "write_power": 15.790 },
                "offset": { "value": "0x34350000", "magnitude":  "BYTES" },
                "size":   { "value": "448",        "magnitude": "KBYTES" }
            },
            {
                "fname": "xSPI1",
                "name":  "hyperRAM",
                "fformat": "FORMAT_RAW",
                "prop":   { "rights": "ACC_WRITE", "throughput": "MID", "latency": "HIGH", "byteWidth": 2,
                            "freqRatio": 5.00, "cacheable": "CACHEABLE_ON","read_power": 380, "write_power": 340.0,
                            "constants_preferred": "true" },
                "offset": { "value": "0x90000000", "magnitude":  "BYTES" },
                "size":   { "value": "32",         "magnitude": "MBYTES" }
            },
            {
                "fname": "xSPI2",
                "name":  "octoFlash",
                "fformat": "FORMAT_RAW",
                "prop":   { "rights": "ACC_READ",  "throughput": "MID", "latency": "HIGH", "byteWidth": 1,
                            "freqRatio": 6.00, "cacheable": "CACHEABLE_ON", "read_power": 110, "write_power": 400.0,
                            "constants_preferred": "true" },
                "offset": { "value": "0x70000000", "magnitude":  "BYTES" },
                "size":   { "value": "64",         "magnitude": "MBYTES" }
            }
        ]
    }
}

Note

In the descriptions for “AXIFLEXMEM” and “AXISRAM1”, the sizes for the memory regions mapped after 0x34000000 have been set to 0 bytes. This is done to signal the Neural-ART compiler not to use those memory pools when allocating buffers. The reason for this is that, in the full example, the embedded firmware is placed within this range (that is, the linker script places everything in the range 0x34000000 - 0x34100000)

Tip

It is mandatory that the memory ranges available for/used by the NPU compiler and the memory ranges available for the applicative firmware are disjoint. For example, if the firmware is stored somewhere and the NPU uses this location for storing activations, then the firmware is erased during inference.

Syntax description and main options

“mempools” key

This item contains a list that describes each of the memory regions.

Each memory region is described in an object. The keys of such objects are described below, with their main associated accepted values.

fname (string) File name to use for the memory initializer (the final name is also prefixed by mem_file_prefix)
name (string) Memory Name (found in reports)
fformat File format used for the memory initializer. Values can be among:
- FORMAT_RAW Raw binary format
- FORMAT_HEX Same as raw, but easier to read for humans: hexadecimal data representation of the contents. FORMAT_HEX16, FORMAT_HEX32 can also be used.
- FORMAT_IHEX Intel-Hex file
offset Memory start address. The value is an object with the following key/values
- value (string) Start address value. Either in decimal format (for example, 805306368) or hexadecimal format (for example, 0x30000000)
- magnitude (string) Unit to interpret the value field. Can be a value among BYTES, KBYTES, MBYTES.
size Memory size. The value is an object with the following key/values
- value (string) Size of the memory. Either in decimal format (for example, 805306368) or hexadecimal format (for example, 0x30000000)
- magnitude (string) Unit to interpret the value field. Can be a value among BYTES, KBYTES, MBYTES.
mode Defines whether the pool address is to be absolute or relative. For now, use USEMODE_ABSOLUTE, relative mode is explained in other documents.
prop Defines multiple properties of the current pool, within an object. The keys/values of this object available for the object are as follows:
- rights States if the memory is read-only or read-write: ACC_READ, ACC_WRITE
- throughput Memory throughput description: LOW, MID or HIGH. The description here is relative to other pools (this is not an absolute value).
- latency Memory latency description: LOW, MID or HIGH. The description here is relative to other pools (this is not an absolute value).
- byte_width (uint32) Memory bus access width
- freq_ratio (float) Memory frequency ratio compared to accelerators (that is, NPU frequency) – bigger values mean slower frequency.
  - Absolute values are used for estimations available in reports
  - Relative ratios between pools are important for the NPU compiler to take decisions (that is, if the reports are not used, it is not needed to change the values when changing the NPU frequency)
- cacheable Is the pool cacheable? ON or OFF (default is OFF)
- read_power (float) Read power for a single byte_width access in mW (power measured at nominal frequency and voltage) – used for reports
- write_power (float) Active write power for a single byte_width access in mW (power measured at nominal frequency and voltage) – used for reports
- constants_preferred (bool) Preferred memory pool for constant values (for example, weights). true or false (default: false)

“cacheinfo” key

This part describes the cache architecture.

The provided memory-pools files give a good representation of the cache used in STM32N6 (CACHEAXI), and should not be modified.

Virtual mem-pools considerations

The --enable-virtual-mem-pools allows the compiler to merge contiguous memory pool into a single, virtual pool.

In the example above, the pools cpuRAM2, npuRAM3-npuRAM6 are all contiguous (that is, the end of cpuRAM2 is the start of npuRAM3). When specifying --enable-virtual-mem-pools, the NPU compiler considers all those pools as a single one, and thus will, if needed to allocate buffers that span on multiple “concrete” pools.
This is, in a way, equivalent to create only one large entry in the memory pool file that would merge all the first memory pools.

Using --enable-virtual-mem-pools enables the user to conveniently use the same .mpool file to fit different models with different memory requirements.
This, however comes at the cost that if merged memory characteristics are very different, the resulting virtual pool inherits the worst characteristics of them, and thus it may result in suboptimal allocation of buffers. (When allocating buffers to the virtual memory pool, the underlying concrete pools characteristics are not considered.)

To overcome this issue, it might be useful to add holes (for example, 1 byte) between ranges of memory that have very different characteristics, this prevents the compiler from merging heterogenous pools.

ST Edge AI core CLI as front end

The ST Edge AI core CLI is required for using the NPU compiler. It facilitates the application of specific transformations and generates the intermediate representation (ONNX + JSON files). The ‘--st-neural-art’ option allows specifying a compilation profiles json file containing different sets of options named profiles that are directly passed to the compiler.

`--st-neural-art STR`

Argument syntax: empty or <profile>[@<compilation-profiles-file-path>]

Set a profile from a compilation profile file. This option is mandatory to enable the passes allowing to analyze/generate a model.

Argument	Description
`help`	Print the NPU compiler help.
	Empty to use a “default” profile.
`profile-<tool_profile>`	Use a built-in tool profile (see next section).
`<profilename>@<compilation-profiles-file-path>`	Use the custom profiles.

Tool built-in profiles

The tool package provides a typical compilation profile file including different settings which can be selected with the 'profile-' argument of the '--st-neural-art' option. It references the memory-pool descriptor files allowing to dedicate the major part of the internal memories and external memory devices to the NPU memory subsystem.

Location of the compilation profile file: $STEDGEAI_CORE_DIR/Utilities/windows/targets/stm32/resources/neural_art.json.
Memory-pool descriptor file: $STEDGEAI_CORE_DIR/Utilities/windows/targets/stm32/resources/mpools/stm32n6.mpool.

Profile name	Description
`minimal`	Mandatory options only. No optimization. All available memories can be used (`'stm32n6.mpool'` file)
`default`	Extend the minimal profile enabling the optimization options to efficiently use the memory-pools
`allmems--O3`	Add optimization level 3 for speed
`allmems--O3-autosched`	Add autosched option
`allmems--auto`	Enable auto option
`internal-memories-only--default`	`default` profile, but only the internal memories are allowed (`'stm32n6__internal_memories_only.mpool'` file)

Note

The provided profiles are the generic setting and should be customized according to the application constraints and the imported models.

Front-end options

The following options impact the generated files.

Option	Description
`--name`	overwrite the name of the files generated by the NPU compiler (and symbols within the `.c` file)
`--no-inputs-allocation`, `--no-outputs-allocation`	inputs/outputs buffers will not be allocated by the NPU compiler in the provided memory pools and must be allocated/“linked” at runtime by the embedded application.
`--input-data-type`, `--output-data-type`	used to change the type of data of the input/output tensors for the deployed model
`--inputs-ch-position`, `--outputs-ch-position`	used to change the memory layout (NHWC vs NCHW) of the input/output tensors for the deployed model

Compilation Profiles JSON File

The following JSON snippet provides a typical compilation profiles json file. This is not strict JSON format. The comments are allowed and start with //.

{
    "Globals": {
    // This section is required but can be empty
    },
    "Profiles": {
        // Minimal options required for the generated file to work
        "minimal" : {
            "memory_pool": "./mpools/stm32n6.mpool",
            "options": "--native-float --mvei --cache-maintenance --Ocache-opt"
        },
        // Advised minimal options
        "default" : {
            "memory_pool": "./mpools/stm32n6.mpool",
            "options": "--native-float --mvei --cache-maintenance --Ocache-opt --enable-virtual-mem-pools --Os"
        },
        "test" : {
            "memory_pool": "./mpools/stm32n6.mpool",
            "options": "--native-float --mvei --cache-maintenance --Ocache-opt --enable-virtual-mem-pools --optimization 1"
        },
        "test2" : {
            "memory_pool": "./my_memory_pools/stm32n6_test.mpool",
            "options": "--native-float --mvei --cache-maintenance --Ocache-opt --enable-virtual-mem-pools\
                        --optimization 1 --Oauto-sched --enable-epoch-controller"
        }
    }
}

This file has two sections: "Globals" and "Profiles".

The "Globals" section is used for debug purposes. It is optional, and can be removed from the profile file.
The "Profiles" section allows defining multiple named profiles that are selectable through the '--st-neural-art' option of the ST Edge AI CLI.

A profile is a combination of:

A path to a memory-pool descriptor file (.mpool file).
A list of options to be passed to the NPU compiler.
Optional: A path to a machine description file (.mdesc file) - by default, the stm32n6 description file is used.

Tips, variations around the basic use case

Generic recommendations

When using small-enough models, try to refrain from using external memories: any transaction with external memory is time-consuming!
Remember that --enable-virtual-mem-pools, though convenient, may have massive disruptive effects on inference time.
- When in optimization steps, try to remove this option from the compiler’s parameters.

NPU compiler options

Experiment with options, a lot of options can be added and may provide successful (or not) effects.
After a successful --Oauto compilation for a model, note down the “best” options used, and report them in the arguments of the compiler at next compilation, this will save compilation time during development.

Memory-pools

Warning

Always remember to have disjoint “memory spaces” available for the NPU compiler (that will allocate buffers and constants of the network) and the firmware compiler (that will allocate space for program, variables, etc.) to prevent memory issues.

Carving memory for performance-improvement of a model is a time-consuming activity.
- Start with basic memory pools with e.g. virtual memory pools enabled.
- Check if it seems possible to fit the whole model in internal memories.
  - If the firmware is small-enough, it is possible to evolve from the example mpool by adding space before the AXISRAM2 pool (do not forget to remove this memory space in the linker script)
- When trying to optimize memory usage/allocation for speed, try to disjoint virtually-merged-pools (by adding small holes between them).
  - Ultimately, try not to use virtual-memory-pools.
For development purposes
- An external RAM can be used easily instead of an external flash (for example, by setting the size of the external flash to 0 and adding constants_preferred=true to external ram characteristics) if both types of memories are available.
- This prevents wear of the flash, will be quicker to load weights on the board, etc.
  - Be aware that this can make the inference faster (external RAM on discovery-kit board can use hexa-SPI vs. octo-SPI for external Flash)