2.2.0
ST Neural-ART compiler primer


ST Edge AI Core

ST Neural-ART compiler primer


for STM32 target, based on ST Edge AI Core Technology 2.2.0



r1.1

Overview

The article describes the main options and required input files of the ST Neural-ART compiler (or NPU compiler) for deploying AI/DL models on ST Neural-ART NPU-based devices.

Note that the NPU compiler is integrated as a specific back-end in the ST Edge AI Core CLI. The CLI acts as a driver or front end. However, it is also possible to use the NPU compiler directly with the generated intermediate files (ONNX + JSON files). Only these input files representing the original model are supported. It is not possible to pass an ONNX QDQ or quantized TFLite file directly. In this case, the options are passed directly to the compiler as arguments.

The main outputs are:

The NPU compiler is configurable and offers numerous customizations to alter the way a given network is compiled. This article presents only the main relevant options. To see the complete list of options with short descriptions, use the --help option.

$ atonn --help

NPU compiler location

‘atonn’ designates the name of the executable.

%STEDGEAI_CORE_DIR%\Utilities\<host_os>\atonn

%STEDGEAI_CORE_DIR% indicates the location where the ST Edge AI Core components are installed.

Options

For a STM32N6 device including an ST Neural ART NPU, the minimal options needed to operate properly are the following ones:

$ atonn --load-mpool my_mpool.mpool --onnx-input my_model_OE_3_1_0.onnx --json-quant-file my_model_OE_3_1_0_Q.json
        --cache-maintenance --Ocache-opt --mvei --native-float --enable-virtual-mem-pools --load-mdesc stm32n6

Recommended default options:

$ atonn --load-mpool my_mpool.mpool --onnx-input my_model_OE_3_1_0.onnx --json-quant-file my_model_OE_3_1_0_Q.json
        --cache-maintenance --Ocache-opt --mvei --native-float --enable-virtual-mem-pools --load-mdesc stm32n6
        --enable-virtual-mem-pools --enable-epoch-controller -Os

Mandatory options

Model/Platform configuration options

Option Description
-i, --onnx-input Input ONNX file path
--json-quant-file Input JSON file path (Quantization information)
-g, --c-output Output .c file path
-t, --load-mdesc, --target Specifies the target machine to use, machine description (.mdesc) file path
--load-mpool Specifies the memory pool descriptors to use, memory pool (.pool) file path
  • The NPU supports only the quantized models that have been “translated” to the ONNX+JSON format by the ST Edge AI Core CLI.
  • To enable code-generation of the compiler, the '-g' option is mandatory and sets the name of the .c output file.
  • The NPU compiler is platform-agnostic. The --load-mdesc/--target option allows to provide the configuration of the Neural ART accelerator™ which is implemented (also called machine descriptor file, the ‘stm32n6.mdesc’ file should be not modified).
  • The --load-mpool provides the description of the memory pools which can be used to implement the provided model. This file is user specific.

Specific STM32N6 device options

An additional set of mandatory options should be used when using the STM32N6 as the target:

Option Description
--native-float Consider all floats without quantization info as native floats
--mvei Enable the scratch buffers generation for Arm® Cortex® M55 core with MVEI extension
--cache-maintenance Enable generation of cache maintenance code for both MCU & NPU caches
--Ocache-opt Enables usage of the NPU cache path for memory pools which have the attribute ‘cacheable’ set in the memory-pool description file, requires option ‘–cache-maintenance’ to be passed as well. When requested the NPU compiler emits instructions which make use the NPU cach path for the streaming engine configurations.
  • The --native-float is mandatory to bypass the check that everything should be quantized in the network given as argument. Not quantizing parts of the network results in lengthy floating operations, but this is not prevented the compiler from completing its work.
  • The --mvei option is needed to use the optimized AI network runtime library for Arm® Cortex® M55 core with MVEI extension.

Warning

For performance reasons, it is recommended to enable the NPU cache (using the --Ocache-opt option) for the NPU memory regions stored in external memories (flash/RAM). The shared memory regions between the NPU and MCU domains, used for input/output buffers and by the software epochs, must be cacheable MCU regions. To guarantee memory coherence during inference, the --cache-maintenance option is mandatory. However, it is the application’s responsibility to use cache maintenance functions when filling or using the input/output or activations memory regions before to perform the inference.

Useful options

Option Category Description
--out-dir-prefix global The output prefix used to specify the output directory for generated files.
--network-name global Changes the generated network c-name (default: Default). This results in different symbol names in the .c file.
--options-file global Export the full command line to a .ini file for future import. This option is useful to reproduce a compilation step. The .ini file contains the full command line
--load-options global Import an exported .ini file
--onnx-output extra outputs Output ONNX filename. The NPU compiler makes manipulations/optimizations on the input ONNX file. The resulting ONNX file is then the one that is really transformed to a .c file. This intermediate ONNX file can be exported using this option.
--save-mdesc-file extra outputs Dumps machine description in use to file
--save-mpool-file extra outputs Dumps memory pools in use to file
--dot-file extra outputs Debug purpose - Dumps a .dot file showing internal behavior of the generated program. Operations done in the .c file (details of the epochs) can be exported in a .dot file that provides a graphical view of operations done in C.
--csv-file extra outputs Dumps a .csv-report file showing details for memory transfers and power consumption estimation per epoch. A report showing details of the memory transfers, and estimations of inference time/power consumption can be exported. (The values in the csv are mainly estimations and may diverge from the results in real life)
-A,--Oauto optimization Enables iterative automatic optimization options selection for all options below (increases compilation time)
--Omax-ca-pipe n optimization Where n is an int >0. Sets the maximum number of CONV_ACC used when pipelining (default is 0 meaning unlimited, max is defined in the machine descriptor file, mdesc file)
--Oalt-sched optimization Enables alternate scheduler algorithm
--Oconv-split-cw optimization Enables optimization to split convolutions channel-wise
--Oconv-split-kw optimization Enables optimization to split convolutions kernel-wise
--Oconv-split-stripe optimization Enables optimization to split convolutions stripe-wise for 1xN kernels
-O,--optimization optimization Optimization level one of [0-3](speed)
  • Each optimization can be done one-by-one. Using -A,--Oauto explores all combinations possible and assess which one seems to provide the “best” that is, fastest results
  • Adding --d-auto n (with 0<n<3: provides debug information about automatic optimization done by -A,--Oauto)
  • When using -A,--Oauto, --Omax-ca-pipe 0 should be added to ensure that the exploration of the configurations is optimal.

Memory-pools

Introduction

Memory-pool descriptor files are used to give the NPU compiler a vision of the memory characteristics and mapping that is free to use for the compiler on the embedded system.

When the NPU compiler can use two memory locations to fit, for example, an activation buffer, it decides which location is ‘the best’ based on estimated data access time. For this allocation algorithm to perform well, it is then required to describe the memory with as much accuracy as possible.
The example mpool files are written for STM32N6 and contain descriptions for each internal memory cuts + external memories accessed through XSPI:

  • Disposable AXISRAM is definitely slower than NPU-RAMs. The NPU compiler prefers using NPU-RAMs to allocate buffers and uses the AXISRAM if the NPU-RAMs are full.
  • External memories are even slower than AXISRAM. NPU compiler allocates data in external memories as a last-resort.

Additional characteristics can be specified for each memory region defined in the memory pool file:

  • rights: Read/write accesses are required for each defined region. Read-only regions contain weights, while write regions contain activations and, as a last resort, weights if read-only pools are full.
  • constants_preferred: A cue can be added for each region to advise the NPU compiler to preferentially use a given pool for storing constants (weights).. This overrides speed considerations addressed above.
  • chacheable: Cacheability can be enabled for each region.

Finally, if applicable, the cache structure and characteristics should be defined in the mpool file, this gives the NPU compiler all the information it needs to optimize as much as possible the inference time.

Example

Below is an example of a memory-pool file used as a default configuration in the STEdgeaAI tool. The syntax of the mpool file is strict JSON: no comments are allowed.

This example can be used as a base for variations, as it contains the mandatory section of such a file:

  • memory: Definition of the memory characteristics described above
    • cacheinfo: cache characteristics
    • mem_file_prefix: prefix for the memory initializer files generated by the NPU compiler.
    • mempools: the memory-pools description
{
    "memory": {
        "cacheinfo": [
            {
                "nlines": 512,
                "linesize": 64,
                "associativity": 8,
                "bypass_enable": 1,
                "prop": { "rights": "ACC_WRITE",  "throughput": "MID",   "latency": "MID", "byteWidth": 8,
                          "freqRatio": 2.50, "read_power": 13.584, "write_power": 12.645 }
            }
        ],
        "mem_file_prefix": "atonbuf",
        "mempools": [
            {
                "fname": "AXIFLEXMEM",
                "name":  "flexMEM",
                "fformat": "FORMAT_RAW",
                "prop":   { "rights": "ACC_WRITE", "throughput": "MID",  "latency": "MID", "byteWidth": 8,
                            "freqRatio": 2.50, "read_power": 9.381,  "write_power": 8.569 },
                "offset": { "value": "0x34000000", "magnitude":  "BYTES" },
                "size":   { "value": "0",        "magnitude": "KBYTES" }
            },
            {
                "fname": "AXISRAM1",
                "name":  "cpuRAM1",
                "fformat": "FORMAT_RAW",
                "prop":   { "rights": "ACC_WRITE", "throughput": "MID",  "latency": "MID", "byteWidth": 8,
                            "freqRatio": 2.50, "read_power": 16.616, "write_power": 14.522 },
                "offset": { "value": "0x34064000", "magnitude":  "BYTES" },
                "size":   { "value": "0",        "magnitude": "KBYTES" }
            },
            {
                "fname": "AXISRAM2",
                "name":  "cpuRAM2",
                "fformat": "FORMAT_RAW",
                "prop":   { "rights": "ACC_WRITE", "throughput": "MID",  "latency": "MID", "byteWidth": 8,
                            "freqRatio": 2.50, "read_power": 17.324, "write_power": 15.321 },
                "offset": { "value": "0x34100000", "magnitude":  "BYTES" },
                "size":   { "value": "1024",       "magnitude": "KBYTES" }
            },
            {
                "fname": "AXISRAM3",
                "name":  "npuRAM3",
                "fformat": "FORMAT_RAW",
                "prop":   { "rights": "ACC_WRITE", "throughput": "HIGH", "latency": "LOW", "byteWidth": 8,
                            "freqRatio": 1.25, "read_power": 18.531, "write_power": 16.201 },
                "offset": { "value": "0x34200000", "magnitude":  "BYTES" },
                "size":   { "value": "448",        "magnitude": "KBYTES" }
            },
            {
                "fname": "AXISRAM4",
                "name":  "npuRAM4",
                "fformat": "FORMAT_RAW",
                "prop":   { "rights": "ACC_WRITE", "throughput": "HIGH", "latency": "LOW", "byteWidth": 8,
                            "freqRatio": 1.25, "read_power": 18.531, "write_power": 16.201 },
                "offset": { "value": "0x34270000", "magnitude":  "BYTES" },
                "size":   { "value": "448",        "magnitude": "KBYTES" }
            },
            {
                "fname": "AXISRAM5",
                "name":  "npuRAM5",
                "fformat": "FORMAT_RAW",
                "prop":   { "rights": "ACC_WRITE", "throughput": "HIGH", "latency": "LOW", "byteWidth": 8,
                            "freqRatio": 1.25, "read_power": 18.531, "write_power": 16.201 },
                "offset": { "value": "0x342e0000", "magnitude":  "BYTES" },
                "size":   { "value": "448",        "magnitude": "KBYTES" }
            },
            {
                "fname": "AXISRAM6",
                "name":  "npuRAM6",
                "fformat": "FORMAT_RAW",
                "prop":   { "rights": "ACC_WRITE", "throughput": "HIGH", "latency": "LOW", "byteWidth": 8,
                            "freqRatio": 1.25, "read_power": 19.006, "write_power": 15.790 },
                "offset": { "value": "0x34350000", "magnitude":  "BYTES" },
                "size":   { "value": "448",        "magnitude": "KBYTES" }
            },
            {
                "fname": "xSPI1",
                "name":  "hyperRAM",
                "fformat": "FORMAT_RAW",
                "prop":   { "rights": "ACC_WRITE", "throughput": "MID", "latency": "HIGH", "byteWidth": 2,
                            "freqRatio": 5.00, "cacheable": "CACHEABLE_ON","read_power": 380, "write_power": 340.0,
                            "constants_preferred": "true" },
                "offset": { "value": "0x90000000", "magnitude":  "BYTES" },
                "size":   { "value": "32",         "magnitude": "MBYTES" }
            },
            {
                "fname": "xSPI2",
                "name":  "octoFlash",
                "fformat": "FORMAT_RAW",
                "prop":   { "rights": "ACC_READ",  "throughput": "MID", "latency": "HIGH", "byteWidth": 1,
                            "freqRatio": 6.00, "cacheable": "CACHEABLE_ON", "read_power": 110, "write_power": 400.0,
                            "constants_preferred": "true" },
                "offset": { "value": "0x70000000", "magnitude":  "BYTES" },
                "size":   { "value": "64",         "magnitude": "MBYTES" }
            }
        ]
    }
}

Note

In the descriptions for “AXIFLEXMEM” and “AXISRAM1”, the sizes for the memory regions mapped after 0x34000000 have been set to 0 bytes. This is done to signal the Neural-ART compiler not to use those memory pools when allocating buffers. The reason for this is that, in the full example, the embedded firmware is placed within this range (that is, the linker script places everything in the range 0x34000000 - 0x34100000)

Tip

It is mandatory that the memory ranges available for/used by the NPU compiler and the memory ranges available for the applicative firmware are disjoint. For example, if the firmware is stored somewhere and the NPU uses this location for storing activations, then the firmware is erased during inference.

Syntax description and main options

“mempools” key

This item contains a list that describes each of the memory regions.

Each memory region is described in an object. The keys of such objects are described below, with their main associated accepted values.

  • fname (string) File name to use for the memory initializer (the final name is also prefixed by mem_file_prefix)
  • name (string) Memory Name (found in reports)
  • fformat File format used for the memory initializer. Values can be among:
    • FORMAT_RAW Raw binary format
    • FORMAT_HEX Same as raw, but easier to read for humans: hexadecimal data representation of the contents. FORMAT_HEX16, FORMAT_HEX32 can also be used.
    • FORMAT_IHEX Intel-Hex file
  • offset Memory start address. The value is an object with the following key/values
    • value (string) Start address value. Either in decimal format (for example, 805306368) or hexadecimal format (for example, 0x30000000)
    • magnitude (string) Unit to interpret the value field. Can be a value among BYTES, KBYTES, MBYTES.
  • size Memory size. The value is an object with the following key/values
    • value (string) Size of the memory. Either in decimal format (for example, 805306368) or hexadecimal format (for example, 0x30000000)
    • magnitude (string) Unit to interpret the value field. Can be a value among BYTES, KBYTES, MBYTES.
  • mode Defines whether the pool address is to be absolute or relative. For now, use USEMODE_ABSOLUTE, relative mode is explained in other documents.
  • prop Defines multiple properties of the current pool, within an object. The keys/values of this object available for the object are as follows:
    • rights States if the memory is read-only or read-write: ACC_READ, ACC_WRITE
    • throughput Memory throughput description: LOW, MID or HIGH. The description here is relative to other pools (this is not an absolute value).
    • latency Memory latency description: LOW, MID or HIGH. The description here is relative to other pools (this is not an absolute value).
    • byte_width (uint32) Memory bus access width
    • freq_ratio (float) Memory frequency ratio compared to accelerators (that is, NPU frequency) – bigger values mean slower frequency.
      • Absolute values are used for estimations available in reports
      • Relative ratios between pools are important for the NPU compiler to take decisions (that is, if the reports are not used, it is not needed to change the values when changing the NPU frequency)
    • cacheable Is the pool cacheable? ON or OFF (default is OFF)
    • read_power (float) Read power for a single byte_width access in mW (power measured at nominal frequency and voltage) – used for reports
    • write_power (float) Active write power for a single byte_width access in mW (power measured at nominal frequency and voltage) – used for reports
    • constants_preferred (bool) Preferred memory pool for constant values (for example, weights). true or false (default: false)

“cacheinfo” key

This part describes the cache architecture.

The provided memory-pools files give a good representation of the cache used in STM32N6 (CACHEAXI), and should not be modified.

Virtual mem-pools considerations

The --enable-virtual-mem-pools allows the compiler to merge contiguous memory pool into a single, virtual pool.

In the example above, the pools cpuRAM2, npuRAM3-npuRAM6 are all contiguous (that is, the end of cpuRAM2 is the start of npuRAM3). When specifying --enable-virtual-mem-pools, the NPU compiler considers all those pools as a single one, and thus will, if needed to allocate buffers that span on multiple “concrete” pools.
This is, in a way, equivalent to create only one large entry in the memory pool file that would merge all the first memory pools.

Using --enable-virtual-mem-pools enables the user to conveniently use the same .mpool file to fit different models with different memory requirements.
This, however comes at the cost that if merged memory characteristics are very different, the resulting virtual pool inherits the worst characteristics of them, and thus it may result in suboptimal allocation of buffers. (When allocating buffers to the virtual memory pool, the underlying concrete pools characteristics are not considered.)

To overcome this issue, it might be useful to add holes (for example, 1 byte) between ranges of memory that have very different characteristics, this prevents the compiler from merging heterogenous pools.

ST Edge AI core CLI as front end

The ST Edge AI core CLI is required for using the NPU compiler. It facilitates the application of specific transformations and generates the intermediate representation (ONNX + JSON files). The ‘--st-neural-art’ option allows specifying a compilation profiles json file containing different sets of options named profiles that are directly passed to the compiler.

NPU compiler as back-end

--st-neural-art STR

Argument syntax: empty or <profile>[@<compilation-profiles-file-path>]

Set a profile from a compilation profile file. This option is mandatory to enable the passes allowing to analyze/generate a model.

Argument Description
help Print the NPU compiler help.
Empty to use a “default” profile.
profile-<tool_profile> Use a built-in tool profile (see next section).
<profilename>@<compilation-profiles-file-path> Use the custom profiles.

Tool built-in profiles

The tool package provides a typical compilation profile file including different settings which can be selected with the 'profile-' argument of the '--st-neural-art' option. It references the memory-pool descriptor files allowing to dedicate the major part of the internal memories and external memory devices to the NPU memory subsystem.

  • Location of the compilation profile file: $STEDGEAI_CORE_DIR/Utilities/windows/targets/stm32/resources/neural_art.json.
  • Memory-pool descriptor file: $STEDGEAI_CORE_DIR/Utilities/windows/targets/stm32/resources/mpools/stm32n6.mpool.
Profile name Description
minimal Mandatory options only. No optimization. All available memories can be used ('stm32n6.mpool' file)
default Extend the minimal profile enabling the optimization options to efficiently use the memory-pools
allmems--O3 Add optimization level 3 for speed
allmems--O3-autosched Add autosched option
allmems--auto Enable auto option
internal-memories-only--default default profile, but only the internal memories are allowed ('stm32n6__internal_memories_only.mpool' file)

Note

The provided profiles are the generic setting and should be customized according to the application constraints and the imported models.

Front-end options

The following options impact the generated files.

Option Description
--name overwrite the name of the files generated by the NPU compiler (and symbols within the .c file)
--no-inputs-allocation, --no-outputs-allocation inputs/outputs buffers will not be allocated by the NPU compiler in the provided memory pools and must be allocated/“linked” at runtime by the embedded application.
--input-data-type, --output-data-type used to change the type of data of the input/output tensors for the deployed model
--inputs-ch-position, --outputs-ch-position used to change the memory layout (NHWC vs NCHW) of the input/output tensors for the deployed model

Compilation Profiles JSON File

The following JSON snippet provides a typical compilation profiles json file. This is not strict JSON format. The comments are allowed and start with //.

{
    "Globals": {
    // This section is required but can be empty
    },
    "Profiles": {
        // Minimal options required for the generated file to work
        "minimal" : {
            "memory_pool": "./mpools/stm32n6.mpool",
            "options": "--native-float --mvei --cache-maintenance --Ocache-opt"
        },
        // Advised minimal options
        "default" : {
            "memory_pool": "./mpools/stm32n6.mpool",
            "options": "--native-float --mvei --cache-maintenance --Ocache-opt --enable-virtual-mem-pools --Os"
        },
        "test" : {
            "memory_pool": "./mpools/stm32n6.mpool",
            "options": "--native-float --mvei --cache-maintenance --Ocache-opt --enable-virtual-mem-pools --optimization 1"
        },
        "test2" : {
            "memory_pool": "./my_memory_pools/stm32n6_test.mpool",
            "options": "--native-float --mvei --cache-maintenance --Ocache-opt --enable-virtual-mem-pools\
                        --optimization 1 --Oauto-sched --enable-epoch-controller"
        }
    }
}

This file has two sections: "Globals" and "Profiles".

  • The "Globals" section is used for debug purposes. It is optional, and can be removed from the profile file.
  • The "Profiles" section allows defining multiple named profiles that are selectable through the '--st-neural-art' option of the ST Edge AI CLI.

A profile is a combination of:

  • A path to a memory-pool descriptor file (.mpool file).
  • A list of options to be passed to the NPU compiler.
  • Optional: A path to a machine description file (.mdesc file) - by default, the stm32n6 description file is used.

Tips, variations around the basic use case

Generic recommendations

  • When using small-enough models, try to refrain from using external memories: any transaction with external memory is time-consuming!
  • Remember that --enable-virtual-mem-pools, though convenient, may have massive disruptive effects on inference time.
    • When in optimization steps, try to remove this option from the compiler’s parameters.

NPU compiler options

  • Experiment with options, a lot of options can be added and may provide successful (or not) effects.
  • After a successful --Oauto compilation for a model, note down the “best” options used, and report them in the arguments of the compiler at next compilation, this will save compilation time during development.

Memory-pools

Warning

Always remember to have disjoint “memory spaces” available for the NPU compiler (that will allocate buffers and constants of the network) and the firmware compiler (that will allocate space for program, variables, etc.) to prevent memory issues.

  • Carving memory for performance-improvement of a model is a time-consuming activity.
    • Start with basic memory pools with e.g. virtual memory pools enabled.
    • Check if it seems possible to fit the whole model in internal memories.
      • If the firmware is small-enough, it is possible to evolve from the example mpool by adding space before the AXISRAM2 pool (do not forget to remove this memory space in the linker script)
    • When trying to optimize memory usage/allocation for speed, try to disjoint virtually-merged-pools (by adding small holes between them).
      • Ultimately, try not to use virtual-memory-pools.
  • For development purposes
    • An external RAM can be used easily instead of an external flash (for example, by setting the size of the external flash to 0 and adding constants_preferred=true to external ram characteristics) if both types of memories are available.
    • This prevents wear of the flash, will be quicker to load weights on the board, etc.
      • Be aware that this can make the inference faster (external RAM on discovery-kit board can use hexa-SPI vs. octo-SPI for external Flash)