ST Neural-ART NPU - Runtime loadable model support

for STM32 target, based on ST Edge AI Core Technology 2.2.0

r1.2

STM32N6 with Neural-ART accelerator™

Introduction

based on the Python scripts v1.3: $STEDGEAI_CORE_DIR/scripts/N6_reloc directory

What is a runtime loadable model?

A runtime loadable model, also known as a relocatable binary model, is a binary object that can be installed and executed at runtime within an STM32 memory subsystem. This model includes a compiled version of the generated neural network (NN) C-files, encompassing the necessary forward kernel functions and weights/parameters. Its primary purpose is to offer a flexible method for upgrading AI-based applications without the need to regenerate and flash the entire end-user firmware. This approach is particularly useful for technologies such as firmware over-the-air (FOTA).

The binary object can be seen as a lightweight plug-in, capable of running from any address (position-independent code) and storing its data anywhere in memory (position-independent data). An efficient and minimal dynamic loader enables the instantiation and usage of this model. Unlike traditional systems, the STM32 firmware does not embed a complex and resource-intensive dynamic linker for Arm Cortex®-M microcontrollers. The generated object is mainly self-contained, requiring limited and well-defined external symbols (NPU RT dependent) at runtime.

In comparison with the software-only STM32 solution (fully self-contained object), the runtime loadable model for the Neural-ART accelerator must be installed inside an NPU runtime software stack. This allows supporting multiple instances, including the static version. No specific synchronization mechanism is embedded in the relocatable part; instead, the scheduling and access to the hardware resources (NPU subsystem) are managed by the stack itself (static part).

The generated runtime loadable model is a container that includes relocatable code (text/data/rodata sections) to configure various NPU epochs. It mainly comprises a compiled version (MCU-dependent) of the specialized network.c file, the LL ATON driver functions used, and the code for purely delegated software operations, which are part of the optimized network runtime library. For hybrid epochs (call of LL_ATON_LIB_xxx functions), a specific callback-based mechanism is in place to directly invoke the services of the stack. These system call-backs functions are provided by the static part of the NPU stack and are registered during the install phase of the model (see ll_aton_reloc_install() function).

Solution overview - Memory model

As illustrated by the following figure, to use runtime loadable models, a part of the internal RAMs (RW AI regions) are fixed/reserved and have the absolute addresses. These regions should not be used by the application during inference. A non-fixed executable RAM region is required to install a given model, this RW region is used to resolve the relocatable references during the installation process at runtime.

Item	Definition
executable RAM region	Designates the memory region used to install a model at runtime (txt/data sections). This MCU/NPU memory-mapped region, located anywhere in the memory subsystem, can be reserved at runtime by the application. The minimum requested size is returned by the ll_aton_reloc_get_info() function. Attributes: MCU RWX, NPU RO. For performance reason, the region should be fully cached (MCU). (*)
RW AI regions	Designates the memory regions which are reserved for the activations/parameters. Base addresses are absolutes and are defined in the mem-pool descriptor files. Attributes: MCU/NPU memory-mapped, MCU RW, NPU RW. To know/check the memory regions which are used, the ll_aton_reloc_get_mem_pool_desc() can be used. (**)
RW AI region (external RAM)	If requested, designates the memory regions which are allocated/reserved to place the requested part of the memory-pool defined for the external RAM. The base address can be relative (can be placed anywhere in the external RAM) or absolute. The referenced addresses are resolved during the install process at runtime. It can be reserved at runtime by the application. The requested size is returned by the ll_aton_reloc_get_info() function. Attributes: MCU RWX, NPU RW (**).
Model X (weights/params)	Designates the memory regions where the relocatable binary models are stored. Base addresses (file_ptr) are relative and dependent on the application managing the FOTA-like mechanism. Attributes: MCU/NPU memory-mapped, MCU RO, NPU RO.
App	Designates the memory regions which are reserved for the application (static part), it embeds a minimal part (NN load service) allowing to load/install and execute a relocatable model.

(*) This region must be also accessible by the NPU in the case where the epoch controller is enabled.
(**) These regions should be also accessible by the MCU in the case where a operator/epoch is delegated on the MCU.

Limitations

Only the STM32N6-based series with Neural-ART accelerator is supported.
Addresses of the used internal/on-chip memory-pools are fixed/absolutes ('USEMODE_ABSOLUTE' attribute). Only the external addresses relative to the off-chip memories: flash and RAM type, are relocatable/relative (or 'USEMODE_RELATIVE' attribute)
Only two relocatable/relative memory pools are supported. One for RO regions handling the weights/params and another for an external RW memory region. Note that the external RAM region can be also absolute.

Setting up a work environment

The generation of a runtime loadable model is not integrated into the ST Edge AI Core CLI. Instead, a set of specific Python scripts is provided in the $STEDGEAI_CORE_DIR/scripts/N6_reloc directory. The npu_driver.py script serves as the entry point for generating the runtime loadable model.

%STEDGEAI_CORE_DIR% represents the root location where the ST Edge AI Core components are installed, typically in a path like "<tools_dir>/STEdgeAI/<version>/".

Requirements

To use these scripts, you need Python 3.9+ and the following Python modules:

pyelftools==0.27  
tabulate  
colorama

Additionally, ensure that the Make utility and GNU Arm Embedded tool-chain supporting the Arm® Cortex® M55 (with the ‘arm-none-eabi-’ prefix) are available in your PATH.

Note

The Python interpreter available in the ST Edge AI Core pack can be directly used (all requested Python module are already installed).

Optional Tools

For validating a runtime loadable model on the STM32N6570-DK board, you may optionally install:

STM32CubeIDE supporting the STM32n6 series (version 1.17.0 or higher)
stm_ai_runner Python package

The STM32_CUBE_IDE_DIR system environment variable can be set to indicate the installation folder.

Getting started

Generating a RT loadable model

The generation of a runtime loadable model involves two main steps. The first step allows the generation of the specialized c-file and associated memory initializers. The second step generates the binary object (or relocatable module) representing the runtime loadable model.

Relocatable model generation

Generating the specialized c-files

The first step involves generating the specialized files and associated memory initializers. There are no restrictions on the options used for this step, except the usage of the --all-buffers-info option to have all detailed information about the intermediate/activation buffers. Note that these generated extra information are removed during the generation of the runtime loadable model. For the memory-pool descriptor file, the following attributes are mandatory:

"mode": "USEMODE_RELATIVE“ for external memory pools only (flash and ram). Otherwise, use 'USEMODE_ABSOLUTE'. Note that since v1.3, the external RAM region can be also absolute.
"fformat": "FORMAT_RAW“ for all memory initializers.

Generate the specialized c-files:

$ stedgeai generate -m <quantize_model> --target stm32n6 --st-neural-art <profile>@<conf_file>

By default the generated files are stored in the .\st_ai_output folder:
```
network.c
network_atonbuf.xSPI2.raw
network_c_info.json
```
Note that the memory initializers for the memory regions which are only used for the activations are not requested.

Generating the relocatable binary model

The second step involves postprocessing and compiling the output from the first step to generate a binary object (or relocatable module) representing the runtime loadable model. This step ensures that the final code/data size is accurate and ready for deployment.

The npu_driver.py is the entry point for executing the steps required to generate the loadable model. The '--input/-i' is used to specify the location of the generated "network.c" and associated memory initializers (*.raw files). The default value is ./st_ai_output. The '--output\-o' option is used to specify the output folder (default: ./build).

[Details] npu_driver.py script

usage: npu_driver.py [-h] [--input STR] [--output STR] [--name STR] [--no-secure] [--no-dbg-info]
                     [--ecblob-in-params] [--split] [--llvm] [--st-clang] [--compatible-mode]
                     [--custom [STR]] [--cross-compile STR]
                     [--gen-c-file] [--parse-only] [--no-clean]
                     [--log [STR]] [--verbosity [{0,1,2}]] [--debug] [--no-color]

NPU Utility - Relocatable model generator v1.3

optional arguments:
  -h, --help            show this help message and exit
  --input STR, -i STR   location of the generated c-files (or network.c file path)
  --output STR, -o STR  output directory
  --name STR, -n STR    basename of the generated c-files (default=<network-file-name>)
  --no-secure           generate binary model for non secure context
  --no-dbg-info         generate binary model without LL_ATON_EB_DBG_INFO
  --ecblob-in-params    place the EC blob in param section
  --split               generate a separate binary file for the params/weights
  --llvm                use LLVM compiler and libraries (default: GCC compiler is used)
  --st-clang            use ST CLANG compiler and libraries (default: GCC compiler is used)
  --compatible-mode     set the compatible option (target dependent)
  --custom [STR]        config file for custom build (default: custom.json)
  --cross-compile STR   prefix of the ARM tool-chain (CROSS_COMPILE env variable can be used)
  --gen-c-file          generate c-file image (DEBUG PURPOSE)
  --parse-only          parsing only the generated c-files
  --no-clean            Don't clean the intermediate files
  --log [STR]           log file
  --verbosity [{0,1,2}], -v [{0,1,2}]
                        set verbosity level
  --debug               Enable internal log (DEBUG PURPOSE)
  --no-color            Disable log color support

$ python $STEDGEAI_CORE_DIR/scripts/N6_reloc/npu_driver.py -i st_ai_output/network.c -o build

   ...
   XIP size      = 29,008    (0x7150) data+got+bss sections
   COPY size     = 173,512   (0x2a5c8) +ro sections
   PARAMS offset = 176,064   (0x2afc0)
   PARAMS size   = 1,793,272 (0x1b5cf8)

   ┌────────────────────────┬────────────────────────────────┬────────┬──────────┬─────────┐
   │ name (addr)            │ flags                          │ foff   │ dst      │ size    │
   ├────────────────────────┼────────────────────────────────┼────────┼──────────┼─────────┤
   │ xSPI2 (20009f84)       │ 01010500 RELOC.PARAM.0.RCACHED │ 0      │ 00000000 │ 1793265 │
   │ AXISRAM5 (20009f8a)    │ 03020200 RESET.ACTIV.WRITE     │ 0      │ 342e0000 │ 352800  │
   │ <undefined> (00000000) │ 00000000 UNUSED                │ 0      │ 00000000 │ 0       │
   └────────────────────────┴────────────────────────────────┴────────┴──────────┴─────────┘
    Table: mempool c-descriptors (off=40005800, 3 entries, from RAM)

   rt_ctx: c_name="network", acts_sz=352,800, params_sz=1,793,265, ext_ram_sz=0
   rt_ctx: rt_version_desc="atonn-v1.1.1-13-g0b0b607eb (RELOC.GCC)"

   Generating files...
    creating "build\network_rel.bin" (size=1,969,336)

Final information returned by the npu_driver.py:

item	description
XIP size	Indicates the size (in bytes) of the requested executing RAM memory region (XIP mode)
COPY size	Indicates the size (in bytes) of the requested executing RAM memory region (COPY mode)
PARAMS size	Indicates the size (in bytes) of the weights/params sections
rt_ctx	Indicates the extracted informations from the binary header

The “mempool c-descriptors” table indicates the memory regions (part of the user memory-pools) and the flags which are considered by the ll_aton_reloc_install() function at runtime.

flag	description
RELOC	Indicates a relocatable region, the address (dst=0) will be resolved at runtime during the installation process.
PARAM/ACTIV/MIXED	Indicates the type of contents: PARAM: params/weights only, ACTIV: activations only, MIXED: mixed
RCACHED/WCACHED	Indicates that a part of the memory region can be accessible through the NPU cache. RCACHED is associated with a RELOC.PARAM/read-only region
WRITE	Indicates that the memory region is a read-write memory region.
RESET	Indicates that the memory region can be cleared if the AI_RELOC_RT_LOAD_MODE_CLEAR option is used.
COPY	Indicates that the region is initialized/copied during the installation process.
UNUSED	Last entry in the mempool c-descriptors

The number '0' or '1' indicates the ID of the relocatable memory regions. Currently only two regions are supported: '0' for a params/weights only region in external flash and '1' for a read/write memory region in the external RAM.
'foff' indicates the offset in the 'params/weights' section to find the associated memory initializer when requested.
'dst' indicates the destination address. If not equal to zero, the address is an absolute address otherwise the region is a relocatable region.
size indicates the size in byte.

Example with a tiny model using only the internal NPU RAM for the activations and weights/params.

At runtime during the installation process, the AXIRAM2,3 and 4 (absolute address) is initialized with the contents of the params/weights section. AXIRAM5 is only used for the activations.

   XIP size      = 14,192    (0x3770) data+got+bss sections
   COPY size     = 173,576   (0x2a608) +ro sections
   PARAMS offset = 174,888   (0x2ab28)
   PARAMS size   = 1,966,080 (0x1e0000)

   ┌────────────────────────┬────────────────────────────┬────────┬──────────┬─────────┐
   │ name (addr)            │ flags                      │ foff   │ dst      │ size    │
   ├────────────────────────┼────────────────────────────┼────────┼──────────┼─────────┤
   │ AXISRAM5 (20009f84)    │ 03020200 RESET.ACTIV.WRITE │ 0      │ 342e0000 │ 180000  │
   │ AXISRAM4 (20009f8d)    │ 02030200 COPY.MIXED.WRITE  │ 0      │ 34270000 │ 458752  │
   │ AXISRAM3 (20009f96)    │ 02030200 COPY.MIXED.WRITE  │ 458752 │ 34200000 │ 458752  │
   │ AXISRAM2 (20009f9f)    │ 02010100 COPY.PARAM.READ   │ 917504 │ 34100000 │ 1048572 │
   │ <undefined> (00000000) │ 00000000 UNUSED            │ 0      │ 00000000 │ 0       │
   └────────────────────────┴────────────────────────────┴────────┴──────────┴─────────┘
    Table: mempool c-descriptors (off=40001df8, 5 entries, from RAM)

   rt_ctx: c_name="network", acts_sz=352,815, params_sz=1,793,261, ext_ram_sz=0
   rt_ctx: rt_version_desc="atonn-v1.1.1-13-g0b0b607eb (RELOC.GCC)"

   Generating files...
    creating "build\network_rel.bin" (size=2,140,968)

Example with a model using only the external RAM/FLASH (internal RAMs are not used)

At runtime during the installation process, the references relative to the xSPI1/xSPI2 (relative address) are respectively resolved to the external RAM address (reserved by the application) and the params/weights section (part of the installed relocatable module).

   XIP size      = 70,984    (0x11548) data+got+bss sections
   COPY size     = 174,816   (0x2aae0) +ro sections
   PARAMS offset = 179,784   (0x2be48)
   PARAMS size   = 1,793,272 (0x1b5cf8)

   ┌────────────────────────┬────────────────────────────────┬────────┬──────────┬─────────┐
   │ name (addr)            │ flags                          │ foff   │ dst      │ size    │
   ├────────────────────────┼────────────────────────────────┼────────┼──────────┼─────────┤
   │ xSPI1 (2000a4a0)       │ 01020601 RELOC.ACTIV.1.WCACHED │ 0      │ 00000000 │ 352800  │
   │ xSPI2 (2000a4a6)       │ 01010500 RELOC.PARAM.0.RCACHED │ 0      │ 00000000 │ 1793265 │
   │ <undefined> (00000000) │ 00000000 UNUSED                │ 0      │ 00000000 │ 0       │
   └────────────────────────┴────────────────────────────────┴────────┴──────────┴─────────┘
    Table: mempool c-descriptors (off=4000fbf8, 3 entries, from RAM)

   rt_ctx: c_name="network", acts_sz=352,800, params_sz=1,793,265, ext_ram_sz=352,800
   rt_ctx: rt_version_desc="atonn-v1.1.1-13-g0b0b607eb (RELOC.GCC)"

Epoch controller consideration

If the epoch controller is enabled, the generated command streams (also called ecblobs) are stored as const in the rodata section by default. The npu_driver.py script reports an extra table to know the size of the blob (_ec_blob_X). To be able to update them at runtime with the references of the relocatable memory-pools, extra bss size is considered (reloc=r:binary). Consequently, to install a model, the size of the requested executable RAM (XIP or COPY mode) is significantly more important. If a specific blob references only the absolute addresses,

With epoch controller support:

   XIP size      = 247,448   (0x3c698) data+got+bss sections
   COPY size     = 504,192   (0x7b180) +ro sections
   PARAMS offset = 260,048   (0x3f7d0)
   PARAMS size   = 1,758,216 (0x1ad408)

  ...
   ┌──────────────────────┬─────────┬───────────┬──────────┐
   │ Name                 │ bss     │ ro data   │ reloc    │
   ├──────────────────────┼─────────┼───────────┼──────────┤
   │ _ec_blob_Default_1   │ 123,896 │ 124,248   │ r:binary │
   │ _ec_blob_Default_63  │ 22,176  │ 22,272    │ r:binary │
   │ _ec_blob_Default_74  │ 21,064  │ 21,168    │ r:binary │
   │ _ec_blob_Default_85  │ 70,912  │ 71,112    │ r:binary │
   │ _ec_blob_Default_110 │ 6,576   │ 6,648     │ r:binary │
   │ _ec_blob_Default_116 │         │ 368       │          │
   │ _ec_blob_Default_118 │         │ 832       │          │
   │                      │         │           │          │
   │ total                │ 244,624 │ 246,648   │          │
   └──────────────────────┴─────────┴───────────┴──────────┘
    Table: EC blob objects (7)

Same model, with epoch controller support and –ecblob-in-params option. The 'COPY size' is decreased because the const data does not include the ecblobs. The non-patched ecblobs are fetched from the flash by the epoch controller.

   XIP size      = 247,464   (0x3c6a8) data+got+bss sections
   COPY size     = 257,504   (0x3ede0) +ro sections
   PARAMS offset = 13,384    (0x3448)
   PARAMS size   = 2,004,864 (0x1e9780)

Same model, without epoch controller support

   XIP size      = 29,008    (0x7150) data+got+bss sections
   COPY size     = 173,512   (0x2a5c8) +ro sections
   PARAMS offset = 176,064   (0x2afc0)
   PARAMS size   = 1,793,272 (0x1b5cf8)

Weights/params encryption consideration

If the model is generated with the '--encrypt-weights' option to support the encrypted weights/params, before to generate the relocatable binary model, the weights/params (network_atonbuf.xSPI2.raw file) should be encrypted as for the nonrelocatable model. It is recommended to use the '--split' to be able to fix the address of the weights/params in the FLASH, because the encryption is dependent of the location.

To use a model with the weights/params encrypted, the index/keys for the different bus interfaces should be set before to execute the model.

...
  LL_ATON_RT_Reset_Network(&nn_instance);

  // Set bus interface keys -- used for encrypted inference only
  LL_Busif_SetKeys ( 0 , 0 , BUSIF_LSB_KEY , BUSIF_MSB_KEY );
  LL_Busif_SetKeys ( 0 , 1 , BUSIF_LSB_KEY , BUSIF_MSB_KEY );
  LL_Busif_SetKeys ( 1 , 0 , BUSIF_LSB_KEY , BUSIF_MSB_KEY );
  LL_Busif_SetKeys ( 1 , 1 , BUSIF_LSB_KEY , BUSIF_MSB_KEY );

  do {
    /* Execute first/next step of Cube.AI/ATON runtime */
    ll_aton_rt_ret = LL_ATON_RT_RunEpochBlock(&nn_instance);
    ...

Extra options

The following additional options can be specified:

--no-secure: Indicates that the generated runtime loadable model will be executed in a nonsecure context. The setting of the NPU processing units uses the nonsecure aliased address (NPU_BASE_NS). Otherwise, the secure aliased address is used(NPU_BASE_S).
--split: Specifies that the parameters/weights sections should be generated in a separate binary file.
--pack-dir: If the STEDGEAI_CORE_DIR system environment variable is not defined, this option is used to define the root location of the ST Edge AI Core installation folder.

“–split” option

This option allows generating two separate files. In this case, to deploy the model, both files should be deployed on the target, and the address of the "network_rel_params.bin" should be passed during the installation process at runtime (see ll_aton_reloc_install() function).

$ python $STEDGEAI_CORE_DIR/scripts/N6_reloc/npu_driver.py -i st_ai_output/network.c -o build --split

...
XIP size      = 28,472    (0x6f38) data+got+bss sections
COPY size     = 169,488   (0x29610) +ro sections
PARAMS offset = 0         (0x0)
PARAMS size   = 0         (0x0)

┌────────────────────────┬────────────────────────────────┬────────┬──────────┬─────────┐
│ name (addr)            │ flags                          │ foff   │ dst      │ size    │
├────────────────────────┼────────────────────────────────┼────────┼──────────┼─────────┤
│ xSPI2 (2000a3bf)       │ 01010500 RELOC.PARAM.0.RCACHED │ 0      │ 00000000 │ 1793265 │
│ AXISRAM5 (2000a3c5)    │ 03020200 RESET.ACTIV.WRITE     │ 0      │ 342e0000 │ 352800  │
│ <undefined> (00000000) │ 00000000 UNUSED                │ 0      │ 00000000 │ 0       │
└────────────────────────┴────────────────────────────────┴────────┴──────────┴─────────┘
Table: mempool c-descriptors (off=400055b4, 3 entries, from RAM)

Generating files...
  creating "build\network_rel.bin" (size=172,032)
  creating "build\network_rel_params.bin" (size=1,793,272)

“–ecblob-in-params” option

The --ecblob-in-params option indicates that the ecblobs (const part) are placed in the params/weights memory section. This feature allows to reduce the requested exec memory region to execute the model in COPY mode. The ecblobs are fetched from the external flash if not copied in exec ram for update. Ecblobs are placed in the beginning of the params/weights memory section. The --split option is always supported, the params file includes the ecblobs.

“–name/-n” option

The --name/-n option allows to specify/overwrite the expected c-name/file-name of the loadable runtime model. By default, the name of the generated network files is used.

Default behavior.

$ python npu_driver.py -i <gen-dir>/network.c 
...
Generating files...
    creating "build\network_rel.bin" (size=..)

$ python npu_driver.py -i <gen-dir>/my_model.c
...
Generating files...
    creating "build\network_rel.bin" (size=..)

$ python npu_driver.py -i <gen-dir>/network.c -n toto
...
Generating files...
    creating "build\toto_rel.bin" (size=..)

Format of the relocatable binary model

The following figure illustrates the layout of the generated relocatable binary model ("network_rel.bin" file). By default, the memory initializers (params/weights sections) are included in the image. If the --split option is used, the params/weights sections are generated in a separated binary file ("network_rel_params.bin" file).

Relocatable binary module layout

If the epoch controller is enabled, the generated bitstreams are included in the rodata/data sections by defaut. The –ecblob-in-params option can be used to store the ecblob with the params section.
The description of the params/weights sections is defined in the rodata section.
When the '--split' option is used, the address ('ext_params_add') is passed and defined by the application when the ll_aton_reloc_install() function is called.

Evaluating the RT loadable model (STM32N6570-DK board)

A ready-to-use environment for STM32N6570-DK board (DEV mode) is delivered in the ST Edge AI Core pack. It allows performing a classical validation, 'validate' command or through the stm_ai_runner Python package module. Note that a STM32CubeIDE environment must be available.

Warning

Set the boot mode in development mode (BOOT1 switch position is 1-3, BOOT0 switch position does not matter). After the loading phase, the board must be NOT switched off or disconnected to be able to perform the validation.

After the generation of the relocatable binary model, the st_load_and_run.py Python script is used to flash the binary files at the fixed addresses, to load, and to run the aiValidation firmware. After these steps, a single inference is executed reporting the performance.

[Details] st_load_and_run.py script

usage: st_load_and_run.py [-h] [--input STR] [--board STR] [--address STR] [--mode STR] [--cube-ide-dir STR]
       [--log [STR]] [--verbosity [{0,1,2}]] [--debug] [--no-color]

NPU Utility - ST Load and run (dev environment) v1.2

optional arguments:
  -h, --help            show this help message and exit
  --input STR, -i STR   location of the binary files (default: build/network_rel.bin)
  --board STR           ST development board (default: stm32n6570-dk)
  --address STR         destination address - model(,params) (default: 0x71000000,0x71800000)
  --mode STR            fw variants: copy,xip[no-flash,no-run,usbc,ext,max-speed]
  --cube-ide-dir STR    installation directory of STM32CubeIDE tools (ex. ~/ST/STM32CubeIDE_1.18.0/STM32CubeIDE)
  --log [STR]           log file
  --verbosity [{0,1,2}], -v [{0,1,2}]
                        set verbosity level
  --debug               enable internal log (DEBUG PURPOSE)
  --no-color            disable log color support

$ python $STEDGEAI_CORE_DIR/scripts/N6_reloc/st_load_and_run.py

NPU Utility - ST Load and run (dev environment) (version 1.2)
Creating date : Mon Apr 14 23:14:20 2025

Entry point    : 'build\network_rel.bin'
Board          : 'stm32n6570-dk'
mode           : ['copy']
split model    : 'build\network_rel_params.bin'
clang mode     : False
exec size      : XIP=247,464, COPY=257,504

board size     : int=655,360, ext=4,194,304
install mode   : 'copy'

Resetting the board.
Flashing 'build\network_rel.bin' at address 0x71000000 (size=2136936)..
Loading & start the validation application 'stm32n6570-dk-validation-reloc-copy'..
Deployed model is started and ready to be used.
Executing the deployed model (desc=serial:921600)..
...
  125     -      epoch (HYBRID)   0.071    0.6%    99.5%  [        81         41     56,540 ]   EpochBlock_117
  126     -      epoch            0.025    0.2%    99.7%  [     2,880     15,560      1,323 ]   EpochBlock_118
  127     -      epoch            0.032    0.3%   100.0%  [     7,456     17,432        747 ]   EpochBlock_119
  -------------------------------------------------------------------------------------------------------------
  total                          11.808                   [ 1,948,894  7,004,597    493,125 ]
                            84.69 inf/s                   [     20.6%      74.1%       5.2% ]
  -------------------------------------------------------------------------------------------------------------

Evaluate the performances

$ stedgeai validate -m <quantize_model> --target stm32n6 --mode target -d serial:921600

...
Evaluation report (summary)
 -------------------------------------------------------------------------------------------------------------
 Output       acc    rmse         ...  std        nse        cos        tensor
 -------------------------------------------------------------------------------------------------------------
 X-cross #1   n.a.   0.007084151  ...  0.007081   0.999671   0.999910   10 x uint8(1x3087x6)
 -------------------------------------------------------------------------------------------------------------

Deploy and use a relocatable model

There is no specific service allowing to deploy the model on a target, FOTA-like mechanism, and other stack to manage the firmwares or models are application-specific. However, when the relocatable model is flashed on the target at a given memory-mapped address (file_ptr), the ll_aton_reloc_install() function must be called to install the model.

LL_ATON stack configuration

LL_ATON_XX C-defines	comment
LL_ATON_PLATFORM	`'LL_ATON_PLAT_STM32N6'` mandatory
LL_ATON_EB_DBG_INFO	mandatory
LL_ATON_RT_RELOC	mandatory - Enables the code paths and functionalities required to manage the relocatable mode.
LL_ATON_RT_MODE	`LL_ATON_RT_ASYNC` is recommended but `LL_ATON_RT_POLLING` can be used.
LL_ATON_OSAL	no restriction

Minimal code

The following snippet code illustrates how to install and to use a runtime loadable model within a bare metal environment, single network with a single input tensor, and a single output tensor (epoch controller is used or not). user_model_mgr() function is used to retrieve the address where the module has been flashed.

#include "ll_aton_reloc_network.h"

static NN_Instance_TypeDef nn_instance;  /* LL ATON handle */

uint8_t *input_0, *prediction_0;
uint32_t input_size_0, prediction_size_0;

int ai_init(const uintptr_t file_ptr, const uintptr_t file_params_ptr)
{
  const LL_Buffer_InfoTypeDef *ll_buffer;
  ll_aton_reloc_info rt;
  int res;

  /* Retrieve the requested RAM size to install the model */
  res = ll_aton_reloc_get_info(file_ptr, &rt);
  /* Reserve executable memory region to install the model */
  uintptr_t exec_ram_addr = reserve_exec_memory_region(rt.rt_ram_copy);
  /* Reserve external read/write memory region for external RAM region */
  uintptr_t ext_ram_addr = NULL;
  if (rt.ext_ram_sz > 0)
    ext_ram_addr = reserve_ext_memory_region(rt.ext_ram_sz);

  /* Create and install an instance of the relocatable model */
  ll_aton_reloc_config config;
  config.exec_ram_addr = exec_ram_addr;
  config.exec_ram_size = rt.rt_ram_copy;
  config.ext_ram_addr = ext_ram_addr;
  config.ext_ram_size = rt.ext_ram_sz;
  config.ext_param_addr = NULL;  /* or @ of the weights/params if split mode is used */
  config.mode = AI_RELOC_RT_LOAD_MODE_COPY; // | AI_RELOC_RT_LOAD_MODE_CLEAR;

  res = ll_aton_reloc_install(file_ptr, &config, &nn_instance);

  if (res != 0)
  {
    /* Retrieve the addresses of the input/output buffers */
    ll_buffer = ll_aton_reloc_get_input_buffers_info(&nn_instance, 0);
    input_0 = LL_Buffer_addr_start(ll_buffer);
    input_size_0 = LL_Buffer_len(ll_buffer);
    ll_buffer = ll_aton_reloc_get_output_buffers_info(&nn_instance, 0);
    prediction_0 = LL_Buffer_addr_start(ll_buffer);
    prediction_size_0 = LL_Buffer_len(ll_buffer);

    /* Init the LL ATON stack and the instantiated model */
    LL_ATON_RT_RuntimeInit();
    LL_ATON_RT_Init_Network(&nn_instance);
  }
  return res;
}

void ai_deinit(void)
{
  LL_ATON_RT_DeInit_Network(&NN_Instance_Default); 
  LL_ATON_RT_RuntimeDeInit();
}

void ai_run(void) {
  LL_ATON_RT_RetValues_t ll_aton_ret;
  LL_ATON_RT_Reset_Network(&nn_instance);
  do {
    ll_aton_ret = LL_ATON_RT_RunEpochBlock(&nn_instance);
    if (ll_aton_ret == LL_ATON_RT_WFE) {
      LL_ATON_OSAL_WFE();
      }
  } while (ll_aton_ret != LL_ATON_RT_DONE);
}

void main(void)
{
  uintptr_t file_ptr, file_params_ptr;

  user_system_init();  /* HAL, clocks, NPU sub-system... */

  user_model_mgr(&file_ptr, &file_params_ptr);
  if (ai_init(file_ptr, file_params_ptr)) {
    /*... installation issue ..*/
  }

  while (user_app_not_finished()) {
    /* Fill input buffers */
    user_fill_inputs(input_0);
    /* If requested, perform the NPU/MCU cache operations to guarantee the coherency of the memory. */
    //  LL_ATON_Cache_MCU_Clean_Invalidate_Range(input_0, input_size_0);
    //  LL_ATON_Cache_MCU_Invalidate_Range(prediction_0, prediction_size_0);
    /* Perform a complete inference */
    ai_run();
    /* Post-process the predictions */
    user_post_process(prediction_0);
  }
  ai_deinit();
  //...
}

XIP/COPY execution modes

XIP execution mode

This mode allows the code to be executed directly from the memory where it is stored, without copying it to another location. This mode is efficient in terms of memory usage, only the executable RAM region to store the data/bss/got sections is requested.

COPY execution mode

This mode involves copying the code to a different memory location before execution. This can be useful for optimizing performance or managing memory access speeds. The requested size for the executable RAM region is more important, with the data/bss/got sections, it also includes the txt/rodata sections.

Example of NPU compiler configuration files

Memory-pool descriptor files

The "%STEDGEAI_CORE_DIR%/scripts/N6_reloc/test" contains a set of configuration and memory-pool descriptor files which can be used. Here are two main points to consider:

If nonsecure context is used to execute the deployed model, the base @ of the different memory-pools should be set with a nonsecure address (ex. 0x24350000 instead of 0x34350000 for the AXIRAM6 memory).
For the memory-pools representing an off-chip device, the "USEMODE_RELATIVE" attribute should be used.

For test purpose with STM32N6570-DK board, different ready-to-use configurations are provided.

memory-pool desc. file	description
stm32n6_reloc.mpool	Full memories. All NPU rams (AXIRAM3..6), AXIRAM2 and external RAM/flash are defined. NPU cache can be used for the external memories.
stm32n6_int_reloc.mpool	NPU memories only. Only the NPU rams (AXIRAM3..6) are defined.
stm32n6_int2_reloc.mpool	Internal memories only. Only the NPU RAMs (AXIRAM3..6 and AXIRAM2) are defined.
stm32n6_ext.mpool	External memories only. Only the external RAM/flash are defined. NPU cache can be used for the external memories.

For information and test purpose, the *non” reloc memory-pool descriptor files are provided which can be used with a normal deployment flow.

Part of the ./test/mpool/stm32n6_reloc.mpool file:

    ...
    {
        "fname": "AXISRAM6",
        "name":  "npuRAM6",
        "fformat": "FORMAT_RAW",
        "prop":   { "rights": "ACC_WRITE", "throughput": "HIGH", "latency": "LOW",
                    "byteWidth": 8, "freqRatio": 1.25, "read_power": 18.531, "write_power": 16.201 },
        "offset": { "value": "0x34350000", "magnitude":  "BYTES" },
        "size":   { "value": "448",        "magnitude": "KBYTES" }
    },
    {
        "fname": "xSPI1",
        "name":  "hyperRAM",
        "fformat": "FORMAT_RAW",
        "prop":   { "rights": "ACC_WRITE", "throughput": "MID", "latency": "HIGH",
                    "byteWidth": 2, "freqRatio": 5.00, "cacheable": "CACHEABLE_ON",
                    "read_power": 380, "write_power": 340.0, "constants_preferred": "true" },
        "offset": { "value": "0x90000000", "magnitude":  "BYTES" },
        "size":   { "value": "32",         "magnitude": "MBYTES" },
        "mode":   "USEMODE_RELATIVE"
    },
    {
        "fname": "xSPI2",
        "name":  "octoFlash",
        "fformat": "FORMAT_RAW",
        "prop":   { "rights": "ACC_READ",  "throughput": "MID", "latency": "HIGH", 
                    "byteWidth": 1, "freqRatio": 6.00, "cacheable": "CACHEABLE_ON",
                    "read_power": 110, "write_power": 400.0, "constants_preferred": "true" },
        "offset": { "value": "0x70400000", "magnitude":  "BYTES" },
        "size":   { "value": "64",         "magnitude": "MBYTES" },
        "mode":   "USEMODE_RELATIVE"
    }
    ...

“neural_art.json” file

The "%STEDGEAI_CORE_DIR%/scripts/N6_reloc/test" contains two examples of configuration file using the requested memory-pool descriptor files. They provide different profiles using different memory configurations which are aligned with the generic AI test validation.

profile	description
test	Default profile. Full memories configuration and the epoch controller are not enabled
test-ec	Default profile + support of the epoch controller
test-int	Internal profile. NPU memories only configuration and the epoch controller is not enabled
test-int-ec	Internal profile + support of the epoch controller
test-ext	External profile. NPU memories only configuration and the epoch controller is not enabled
test-ext-ec	External profile + support of the epoch controller

Part of the ./test/neural_art_reloc.json file:

    ...
    "test" : {
        "memory_pool": "./mpools/stm32n6_reloc.mpool",
        "options": "--native-float --mvei --cache-maintenance --Ocache-opt --enable-virtual-mem-pools --Os
                    --optimization 3 --Oauto-sched --all-buffers-info --csv-file network.csv"
        },
    "test-ec" : {
        "memory_pool": "./mpools/stm32n6_reloc.mpool",
        "options": "--native-float --mvei --cache-maintenance --Ocache-opt --enable-virtual-mem-pools --Os
                    --optimization 3 --Oauto-sched --all-buffers-info --csv-file network.csv --enable-epoch-controller"
        },
    ...

Performance impacts

Accuracy

No difference with the nonrelocatable or static implementation.

Set up and inference time

For the inference time (after the install/init steps), no significant difference is expected versus a nonrelocatable or static implementation. Only the set-up time is impacted to install/create an instance of a given model. The installation time is directly proportional to the number of relocations, and the size of the code/data sections according to the used COPY/XIP mode. Note that if the memory initializers need to be copied into the internal RAMs, this extra time is equivalent to the static implementation.

STM32N6570-DK board (DEV mode), overdrive clock setting (NPU 1 GHz, MCU 800 MHz).
By default, the external flash is used to store the weights/params.
Internal RAMs (AXIRAM3,..6) are used to store the activations.
The executable RAM region is located in the internal AXIRAM1 (this is equivalent to the static case where the application is also executed from the AXIRAM1).

yolov5 224 (nano)	static mode (absolute @ only)	reloc mode (copy)	reloc mode (xip) (*)
inference time (w/ ec)	10.4 ms, 95.45 inf/s	10.5 ms, 94.9 inf/s	10.6 ms, 94.6 inf/s
inference time (w/o ec)	13.2 ms, 75.35 inf/s	13.0 ms, 77.1 inf/s	14.3 ms, 69.6 inf/s
install/init time (ms) (w/ ec)	0.0 / 0.03	11.7 / 0.73	10.9 / 0.93
install/init time (ms) (w/o ec)	0.0 / 0.03	11.2 / 0.03	10.8 / 0.03

(*) with the epoch controller, as the blob should be updated, it is fetched from the executable RAM region (AXIRAM1). We only observe an impact in the case where the configuration code is fetched from the external FLASH, w/o epoch controller support.

For the reloc mode with epoch controller support, the 'install/init' time is mainly due to the copy of the blobs in the AXIRAM1 and the requested relocations to resolve the weights/params addresses in the different blobs.

Case where only the AXIRAMx is used for the activations and the weights/params (~2 Mbytes)

yolov5 224 (nano)	static mode (absolute @ only)	reloc mode (copy)	reloc mode (xip)
inference time (w/ ec)	9.5 ms, 105.3 inf/s	9.5 ms, 105.8 inf/s	9.6 ms, 104.6 inf/s
install/init time (ms) (w/ ec)	17+ / 0.03	18.1 / 0.03	18.1 / 0.03

The 'install/init' time is similar in both case. It is mainly represented by the copy of the memory initializers from the FLASH location to the internal RAMs. No extra relocation for the weights/activations is requested (all weighs/activations addresses are absolutes).

Memory layout overhead

In comparison with a static implementation, the relocation mode involves two additional sections, GOT/REL, which are used to support the position-independent code/data. The size is directly proportional to the number of relocating references.

LL ATON runtime API extension

To enable the support of a runtime loadable model, the LL_ATON files should be compiled with the following C-define:

LL_ATON_RT_RELOC

The LL_ATON_RT_RELOC C-define activates the code paths and functionalities required to manage and install runtime loadable models. Ensuring that this macro is defined during compilation is crucial for the successful deployment and execution of runtime loadable models.

ll_aton_reloc_install()

int ll_aton_reloc_install(const uintptr_t file_ptr, const ll_aton_reloc_config *config,
                          NN_Instance_TypeDef *nn_instance);

Description

The ll_aton_reloc_install() function acts as a runtime dynamic loader. It is used to install and to create an instance of a memory-mapped runtime loadable module. By providing the model image pointer (file_ptr), configuration details, and neural network instance, users can set up the model for execution. The function performs compatibility checks, initializes memory pools, and installs/relocates code and data sections as needed.

Parameters

file_ptr: A uintptr_t value representing the pointer to the image of the runtime loadable model. This parameter specifies the location of the model image to be installed.
config: A pointer to an ll_aton_reloc_config structure. This parameter provides the configuration details for how the model should be installed, including memory addresses and sizes.
nn_instance: A pointer to an NN_Instance_TypeDef structure. This parameter is updated to handle the installed model, creating an instance of the neural network.

Return Value

The function returns an integer value. A return value of 0 typically indicates success, while a nonzero value indicates that an error occurred during the installation process.

Steps Executed

Checking step
- This step checks the compatibility of the binary object against the runtime environment (static part of the firmware). The main points checked include:
  - The version and content of the binary file header.
  - MCU type and whether the FPU (floating-point unit) is enabled (context/setting of the caller is used).
  - Secure or nonsecure context.
  - Whether the binary module has been compiled with the LL_ATON_EB_DBG_INFO or LL_ATON_RT_ASYNC C-defines.
  - Version of the used LL_ATON files.
Memory-Pool initialization step
- If requested, this step initializes the used memory regions for the given model. Specifically:
  - If read/write memory pools handle the params/weights section, the associated memory region is initialized with values from the params/weights section.
  - Optionally, if the AI_RELOC_RT_LOAD_MODE_CLEAR flag is set, the read/write memory region handling the activations is cleared.
Code/Data installation and relocation step
- According to the AI_RELOC_RT_LOAD_MODE_COPY or AI_RELOC_RT_LOAD_MODE_XIP flag:
  - The code/data sections are copied into the executable RAM region.
  - The relocation process is performed to update references.
  - Register the call-backs

This function installs and creates an instance of a runtime loadable model referenced by the file_ptr parameter. The config parameter (of type ll_aton_reloc_config) indicates how to install the model, and the nn_instance (of type NN_Instance_TypeDef) is updated to handle the installed model. This function executes the following steps:

ll_aton_reloc_config C-struct

The purpose of the ll_aton_reloc_config C structure is to provide the parameters which are requested to install a runtime loadable model.

typedef struct _ll_aton_reloc_config {
    uintptr_t exec_ram_addr;  /* base@ of the exec memory region to place the relocatable code/data (8-Bytes aligned) */
    uint32_t exec_ram_size;   /* max size in byte of the exec memory region */
    uintptr_t ext_ram_addr;   /* base@ of the external memory region to place the external pool (if requested) */
    size_t ext_ram_size;      /* max size in byte of the external memory region */
    uintptr_t ext_param_addr; /* base@ of the param memory region (if requested) */
    uint32_t mode;
  } ll_aton_reloc_config;

'exec_ram_addr'/'exec_ram_size': These members indicate the base address (8-byte aligned) and the maximum size of the read/write executable RAM memory region. These parameters are mandatory. To determine the required size at runtime, the ll_aton_reloc_get_info function can be used.
'ext_ram_addr'/'ext_ram_addr': These members indicate the base address (8-byte aligned) and the maximum size of the read/write external RAM memory region, if requested. To determine the required size at runtime, the ll_aton_reloc_get_info function can be used.
'ext_param_addr': This member indicates the base address (8-byte aligned) of the memory region containing the parameters/weights of the deployed model. This option is required when the –split option is used; otherwise, it must be set to NULL.
'mode': This member indicates the expected execution mode. Or-ed flags can be used. AI_RELOC_RT_LOAD_MODE_XIP or AI_RELOC_RT_LOAD_MODE_CLEAR flag is mandatory. AI_RELOC_RT_LOAD_MODE_CLEAR flag is optional.

mode	description
`AI_RELOC_RT_LOAD_MODE_XIP`	XIP (Execute In Place) execution mode
`AI_RELOC_RT_LOAD_MODE_COPY`	COPY execution mode
`AI_RELOC_RT_LOAD_MODE_CLEAR`	Reset the used activation memory regions

ll_aton_reloc_set_callbacks()

int ll_aton_reloc_set_callbacks(const NN_Instance_TypeDef *nn_instance, const struct ll_aton_reloc_callback *cbs)

Description

The ll_aton_reloc_set_callbacks function is used to overwrite the default registration of the callbacks done in the ll_aton_reloc_install function. This function is optional.

Callback services	description
assert/lib error	to implement the management of the errors generated by the embedded LL ATON functions
NPU/MCU cache maintenance operations	to implement the NPU/MCU cache maintenance operations
LL_ATON_LIB_xxx	to implement the LL ATON LIB services to support the hybrid epochs

Parameters

nn_instance: A pointer to the neural network instance (NN_Instance_TypeDef).
cbs: A pointer to a ll_aton_reloc_callback structure (see ll_aton_reloc_network.h file).

Return Value

The function returns an integer value. A return value of 0 typically indicates success, while a nonzero value indicates that an error occurred during the installation process.

ll_aton_reloc_get_info()

int ll_aton_reloc_get_info(const uintptr_t file_ptr, ll_aton_reloc_info *rt);

Description

The ll_aton_reloc_get_info function is used to obtain the main dimensioning information from the image of a runtime loadable model. This information can include details such as the size, memory requirements, and other relevant attributes of the model. By providing a pointer to the model image and a reference to an ll_aton_reloc_info structure, users can retrieve and store the necessary information to properly configure and manage the runtime loadable model.

This function is particularly useful for setting up the memory regions and ensuring that the model can be correctly loaded and executed within the available resources.

Parameters

file_ptr: A uintptr_t value representing the pointer to the image of the runtime loadable model. This parameter specifies the location of the model image from which the information will be retrieved.
rt: A pointer to an ll_aton_reloc_info structure. This parameter is used to store the retrieved dimensioning information of the runtime loadable model. The function fills this structure with the relevant details.

Return Value

The function returns an integer value. A return value of 0 typically indicates success, while a nonzero value indicates that an error occurred during the operation.

ll_aton_reloc_info C-struct

  typedef struct _ll_aton_reloc_info
  {
    const char *c_name;          /* c-name of the model */
    uint32_t variant;            /* 32-b word to handle the reloc rt version,
                                    the used ARM Embedded compiler,
                                    Cortex-Mx (CPUID) and if the FPU is requested */
    uint32_t code_sz;            /* size of the code (header + txt + rodata + data + got + rel sections) */
    uint32_t params_sz;          /* size (in bytes) of the weights */
    uint32_t acts_sz;            /* minimum requested RAM size (in bytes) for the activations buffer */
    uint32_t ext_ram_sz;         /* requested external ram size for the activations (and params) */
    uint32_t rt_ram_xip;         /* minimum requested RAM size to install it, XIP mode */
    uint32_t rt_ram_copy;        /* minimum requested RAM size to install it, COPY mode */
    const char *rt_version_desc; /* rt description */
    uint32_t rt_version;         /* rt version */
    uint32_t rt_version_extra;   /* rt version extra */
  } ll_aton_reloc_info;

member	description
`c_name`	indicates the name of the model.
`variant`	or-red 32-bit value indicating the used Arm compiler, CPUID of the Cortex®-M,.. (see `ll_aton_reloc_network.h` file)
`code_sz`	size in bytes of all code/data sections representing the model: header+txt+rodata+data+got+rel sections
`params_sz`	total size (in bytes) of the params/weights section
`acts_sz`	total size (in bytes) of the activations
`ext_ram_sz`	requested size (in bytes) of the external RAM memory
`rt_ram_xip`	requested size (in bytes) of read/write execution memory region (XIP mode)
`rt_ram_copy`	requested size (in bytes) of read/write execution memory region (COPY mode)
`rt_version_desc`	(debug info) string describing the used LL runtime version
`rt_version`	LL runtime version: `major << 24 \| minor << 16 \| sub << 8`
`rt_version`	(debug info) extra dev version value

ll_aton_reloc_get_mem_pool_desc()

ll_aton_reloc_mem_pool_desc *ll_aton_reloc_get_mem_pool_desc(const uintptr_t file_ptr, int index);

Description

The ll_aton_reloc_get_mem_pool_desc function allows to obtain information about the part of the memory-pools which are used for a given model. By providing a pointer to the model image and an index, users can retrieve the necessary information through the returned ll_aton_reloc_mem_pool_desc structure.

The ll_aton_reloc_get_mem_pool_desc function allows users to obtain information about parts of the memory pools used for a given model. By providing a pointer to the model image and an index, users can retrieve the necessary information through the returned ll_aton_reloc_mem_pool_desc structure.

Parameters

file_ptr: A uintptr_t value representing the pointer to the image of the runtime loadable model. This parameter specifies the location of the model image from which the information will be retrieved.
index: Index of the requested descriptor.

Return Value

The function returns a reference of a ll_aton_reloc_mem_pool_desc object. If the specified index is out of range, the function may return NULL.

Example

A typical snippet of code to display memory pool C-descriptors.

  ll_aton_reloc_mem_pool_desc *mem_c_desc;
  int index = 0;

  while ((mem_c_desc = ll_aton_reloc_get_mem_pool_desc((uintptr_t)bin, index)))
  {
    printf(" %d: flags=%x foff=%d dst=%x s=%d\n", index, mem_c_desc->flags,
           mem_c_desc->foff, mem_c_desc->dst, mem_c_desc->size);
    index++;
  }

ll_aton_reloc_mem_pool_desc C-struct

typedef struct _ll_aton_reloc_mem_pool_desc
{
  const char *name; /* name */
  uint32_t flags;   /* type definition: 32b:4x8b <type><data_type><reserved><id> */
  uint32_t foff;    /* offset in the binary file */
  uint32_t dst;     /* dst @ */
  uint32_t size;    /* real size */
} ll_aton_reloc_mem_pool_desc;

AI_RELOC_MPOOL_GET_XXX(flags) macros (see ll_aton_reloc_network.h file) can be used to know the attributes of the memory pool.

ll_aton_reloc_get_input/output_buffers_info()

const LL_Buffer_InfoTypeDef *ll_aton_reloc_get_input_buffers_info(const NN_Instance_TypeDef *nn_instance,
                                                                  int32_t num);
const LL_Buffer_InfoTypeDef *ll_aton_reloc_get_output_buffers_info(const NN_Instance_TypeDef *nn_instance,
                                                                   int32_t num);

Description

The ll_aton_reloc_get_input/output_buffers_info function is used to obtain information about a specific input/output buffer of a neural network instance. This can be useful for understanding the structure and requirements of the input data for the neural network. By providing the neural network instance and the index of the desired input buffer, users can retrieve detailed information about the buffer, such as its size, type, and memory location.

Parameters

nn_instance: A pointer to the neural network instance (NN_Instance_TypeDef). This parameter specifies the neural network instance for which the input/output buffer information is to be retrieved.
num: An integer specifying the index of the input/output buffer whose description is to be retrieved. The index is zero-based, meaning that num = 0 refers to the first input buffer, num = 1 refers to the second input buffer, and so on.

Return Value

The function returns a pointer to a LL_Buffer_InfoTypeDef structure, which contains the description of the specified input/output buffer. If the specified buffer index is out of range, the function may return NULL.

ll_aton_reloc_set_input/output()

LL_ATON_User_IO_Result_t ll_aton_reloc_set_input(const NN_Instance_TypeDef *nn_instance, uint32_t num, void *buffer,
                                                 uint32_t size);
LL_ATON_User_IO_Result_t ll_aton_reloc_set_output(const NN_Instance_TypeDef *nn_instance, uint32_t num, void *buffer,
                                                  uint32_t size);

Description

Both ll_aton_reloc_set_input and ll_aton_reloc_set_output functions are used to configure the address of the input and output buffers for a neural network instance, respectively. By providing the neural network instance, buffer index, buffer pointer, and buffer size, users can set up the necessary memory regions for input and output data.

Warning

These functions should be only used when the deployed model is generated with the ‘–no-inputs-allocation’ or/and ‘–no-outputs-allocation’ respectively.

Parameters

nn_instance: A pointer to the neural network instance (NN_Instance_TypeDef). This parameter specifies the neural network instance for which the output buffer is to be set.
num: An unsigned integer specifying the index of the output buffer to be set. The index is zero-based.
buffer: A pointer to the buffer that will hold the output data. This parameter specifies the memory location where the output data is stored.
size: An unsigned integer specifying the size of the input/output buffer in bytes.

Return Value

The function returns a value of type LL_ATON_User_IO_Result_t. This return value indicates the result of the operation, such as success or an error code.

ll_aton_reloc_get_input/output()

void *ll_aton_reloc_get_input(const NN_Instance_TypeDef *nn_instance, uint32_t num);
void *ll_aton_reloc_get_output(const NN_Instance_TypeDef *nn_instance, uint32_t num);

Description

Both ll_aton_reloc_get_input and ll_aton_reloc_get_output functions are used to retrieve pointers to the input and output buffers for a neural network instance, respectively. By providing the neural network instance and buffer index, users can obtain direct access to the memory regions used for input and output data.

Warning

These functions should be only used when the deployed model is generated with the ‘–no-inputs-allocation’ or/and ‘–no-outputs-allocation’ respectively.

Parameters

nn_instance: A pointer to the neural network instance (NN_Instance_TypeDef). This parameter specifies the neural network instance for which the input/output buffer pointer is to be retrieved.
num: An unsigned integer specifying the index of the output buffer to be retrieved. The index is zero-based.

Return Value

The function returns a pointer to the output buffer. If the specified buffer index is out of range or an error occurs, the function may return NULL.

ST Neural-ART NPU - Runtime loadable model support - r1.2
ST Edge AI Core Technology 2.2.0

ST logo Information in this document is provided solely in connection with ST products. The contents of this document are subject to change without prior notice. © Copyright STMicroelectronics 2025. All rights reserved. www.st.com