2.2.0
Relocatable binary model support


ST Edge AI Core

Relocatable binary model support


for STM32 target, based on ST Edge AI Core Technology 2.2.0



r2.5

Introduction

What is a relocatable binary model?

A relocatable binary model designates a binary object which can be installed and executed anywhere in a STM32 memory sub-system. It contains a compiled version of the generated NN C-files including the requested forward kernel functions and the weights. The principal objective is to provide a flexible way to upgrade an AI-based application w/o re-generating and flashing the whole end-user firmware. This is the primary element to use for example the FOTA (Firmware Over-The-Air) technology.

Generated binary object is a lightweight plug-in. It can run from any address (position-independent code) and have its data anywhere in memory (position-independent data). A simple and efficient AI relocatable run-time allows to instantiate and to use it. No complex and resource-consuming dynamic linker for ARM Cortex-M MCU is embedded in the STM32 firmware, the generated object is a self-contained entity, no external symbols/functions are requested at run-time.

Relocatable binary object

In this article “static” approach designates the case where the generated NN C-files are compiled and linked with the end-user application stack.

Limitations

  • No support to manage the state of a Keras stateful LSTM/GRU layers
  • No support for the STM32 series with Cortex-m0 or Cortex-m0plus (STM32L0, STM32F0, STM32G0)
  • Initial support for Custom layers, only self-containts c-files are supported. Lambda layers are supported.

Comparison with TF Lite for micro-controllers solution

TF Lite for micro-controllers framework provides a way to upgrade an AI-based application. TFLite converter utility allows to deploy a network and associated parameters through a simple container: TFLite file. Based on the flat buffer technology, it is interpreted at run-time to create an executable instance. The main difference is that the code of the forward kernel functions and associated interpreter should be already available in the initial firmware image. For the X-CUBE-AI relocatable solution, the code of the kernels are also embedded in the container.

Getting started

Generating a relocatable binary model

To build a relocatable binary file for a given STM32 series, the --relocatable/-r option is used with the generate command. Pay attention to the specific options to compress or to put the IO buffer in the activations buffer should be always applied as for the “standard” approach.

Note

A GNU ARM Embedded tool-chain (arm-none-eabi- prefix) should be available in the PATH before to launch the command. The optional --lib argument indicates the root location to find the relocatable network runtime libraries from the installed pack. Default value: $X_CUBE_AI_DIR/Middlewares/ST/AI.

$ stedgeai generate -m <model_file_path> <gen_options> --relocatable --target stm32f4

ST Edge AI Core v1.0.0

Generating files for relocatable binary model..
...

 Runtime memory layout (series="stm32f4")
 ------------------------------------------------------------------
 section      size (bytes)
 ------------------------------------------------------------------
 header                100*
 txt                22,952                    network+kernel
 rodata                224                    network+kernel
 data                1,812                    network+kernel
 bss                   116                    network+kernel
 got                   116*
 rel                   516*
 weights            16,688                    network
 ------------------------------------------------------------------
 FLASH size         41,676 + 732* (+1.76%)
 RAM size**          1,928 + 116* (+6.02%)
 ------------------------------------------------------------------
 bin size           42,412                    binary image
 act. size          12,004                    activations buffer
 ------------------------------------------------------------------
 (*)  extra bytes for relocatable support
 (**) Full RAM = RAM + act. + IO if not allocated in activations buffer

Generated files (12)
--------------------------------------------------------------------------------
<output-directory-path>\network_rel.bin
<output-directory-path>\network_img_rel.c
<output-directory-path>\network_img_rel.h
<output-directory-path>\network_config.h
<output-directory-path>\network.h
<output-directory-path>\network.c
<output-directory-path>\network_data_params.h
<output-directory-path>\network_data_params.c
<output-directory-path>\network_data.h
<output-directory-path>\network_data.c
<output-directory-path>\ai_reloc_network.h
<output-directory-path>\ai_reloc_network.c

Creating report file <output-directory-path>\network_generate_report.txt

Supported STM32 series

supported series description
stm32f4/stm32f3/stm32g4/stm32wb default series. All STM32F4xx/STM32F3xx/STM32G4xx devices with a ARM Cortex M4 core and FPU support enabled (simple precision).
stm32l4/stm32l4r all STM32L4xx/STM32L4Rxx devices with a ARM Cortex M4 core and FPU support enabled (simple precision).
stm32l5/stm32u5 all STM32L5xx/STM32U5xx device with a ARM Cortex M33 core and FPU support enabled (simple precision).
stm32f7 all STM32F7xx device with a ARM Cortex M7 core and FPU support enabled (simple precision).
stm32h7 all STM32H7xx device with a ARM Cortex M7 core and FPU support enabled (double precision).

To indicate the root location of the STM32 libraries:

$ stm32ai generate -m <model_file_path> <gen_options> --relocatable --series stm32h7 \
                   --lib $X_CUBE_AI_DIR/Middlewares/ST/AI

Generated files

file description
<network>_rel.bin main binary file (i.e. the “relocatable binary model”). Its contents the compiled version of the model, including the used forward kernel functions and the weights by default. It embeds also the additional sections (.header/.got/.rel) to be able to install the model. Note that the if the --binary option flag is used, the weights are generated in a separated file: <network>_data.bin.
ai_reloc_network.c/.h AI relocatable runtime API files. They are requested to use the relocatable binary model.
<network>.c/.h DEBUG/TEST purpose - generated network C-files which are used to generate the relocatable binary model
<network>_data.c/.h DEBUG/TEST purpose - generated network data C-files which are used to generate the relocatable binary model. <network>_data.c is an empty file.
<network>_img_rel.c/.h DEBUG/TEST purpose - they facilitate the deployment of the relocatable binary model in a test framework like aiSystemPerformance/aiValidation applications. It contents additional macros and a C byte array (image of the binary file) which can be used by the AI relocatable runtime API to install and to use the model. The --no-c-files option flags can be used to avoid generating these additional files.

All files are generated in the <output-directory-path> directory.

Memory layout information

The reported memory layout information completes the provided ROM and RAM memory size metrics (refer to “Memory-related metrics” section), with the AI memory resources, including the specific AI code and data sections which are requested to run the AI stack. Apart the additional sections to manage the relocatable binary model, the size of the other sections are like the “static” code generation approach where the NN C-files are compiled and linked with the code of the application. Only the requested size for the IO buffers are not reported here.

MCU AI memory layout

Following table summarizes the difference in term of memory layout (in bytes) between a static and relocatable approach. Size of the network/kernels sections are dependent of the topology complexity (number of nodes) and the different forward kernel functions which are used. Activations and weights are always the same.

AI object static reloc typically placed in
activations 192 192 RAM type (rw), .bss section
weights 15,560 15,560 FLASH type (ro), .rodata section
network/kernels (FLASH) 25,308 26,024 FLASH type (rx), .txt\.rodata\(.data) section
network/kernels (RAM) 1,888 1,996 RAM type (rw), .data\.bss section

Upgrading the firmware image

This step is out-of-scope of this article, the underlying process is fully application dependent. For the next steps, the snipped code expects that the image (<network>_rel.bin) has been flashed in a memory-mapped region.

To use a relocatable binary model, a specific AI relocatable runtime API (simple c-file) is requested to install and to use it. It is available in the X-CUBE-AI pack and should be integrated during the generation of the firmware. Note that the only the AI runtime header files are requested, the network_runtime.a library is not necessary.

CFLAGS += -mcpu=cortex-m4 -mthumb -mfpu=fpv4-sp-d16  -mfloat-abi=hard

C_SOURCES += $X_CUBE_AI_DIR/Middlewares/ST/AI/Reloc/Src/ai_reloc_network.c

CFLAGS += -I$X_CUBE_AI_DIR/Middlewares/ST/AI/Inc
CFLAGS += -I$X_CUBE_AI_DIR/Middlewares/ST/AI/Reloc/Inc

Creating an instance

After the update of the model (binary file) inside the STM32 device at the address: BIN_ADDRESS, the following sequence of code can be used to create and to install an instance of the generated model.

#include <ai_reloc_network.h>

ai_error err;
ai_rel_network_info rt_info;

err = ai_rel_network_rt_get_info(BIN_ADDRESS, &rt_info);

This allows to retrieve a part of the meta-information embedded in the header of the binary.

...
printf("Load a relocatable binary model, located at the address 0x%08x\r\n",
       (int)BIN_ADDRESS);
printf(" model name                : %s\r\n", rt_info.c_name);
printf(" weights size              : %d bytes\r\n", (int)rt_info.weights_sz);
printf(" activations size          : %d bytes (minimum)\r\n", (int)rt_info.acts_sz);
printf(" compiled for a Cortex-Mx  : 0x%03X\r\n",
         (int)AI_RELOC_RT_GET_CPUID(rt_info.variant));
printf(" FPU should be enabled     : %s\r\n",
         AI_RELOC_RT_FPU_USED(rt_info.variant)?"yes":"no");
printf(" RT RAM minimum size       : %d bytes (%d bytes in COPY mode)\r\n",
        (int)rt_info.rt_ram_xip,
        (int)rt_info.rt_ram_copy);
...

To create an executable instance of the C-model, a dedicated memory buffer (also called AI RT RAM), should be provided. Minimum requested size is model and execution mode dependent. For the XIP execution mode (AI_RELOC_RT_LOAD_MODE_XIP), only a buffer (rw memory mapped region) for the data sections is requested (minimum size = rt_info.rt_ram_xip). Note that the allocated buffer should be 4-bytes aligned. For the COPY execution mode (AI_RELOC_RT_LOAD_MODE_COPY), rt_info.rt_ram_copy minimum size is requested to be able to copy also the code sections. For this last case, the provided memory region, should be executable.

ai_error err;
ai_handle net = AI_HANDLE_NULL;

uint8_t *rt_ai_ram = malloc(rt_info.rt_ram_xip);

err = ai_rel_network_load_and_create(BIN_ADDRESS, rt_ai_ram, rt_info.rt_ram_xip,
                                     AI_RELOC_RT_LOAD_MODE_XIP, &net);

Before to install and to set the instance, the compatibility with the STM32 platform and the provided binary is verified, confirming the Cortex-Mx ID and if the FPU is enabled (if requested by the binary). If all is OK, an instance of the model is ready to be initialized and a handle is returned (net parameter).

As for the “static” approach, the next step is to complete the internal data structure with the activation buffer and the weights buffer. Only the addresses of the associated buffer should be provided. It the weights are loaded as a separated file (--binary option flag), WEIGHTS_ADDRESS indicates the location where the weights have been placed.

ai_handle weights_addr;
ai_bool res;

uint8_t *act_addr = malloc(rt_info.acts_sz)

if (rt_info.weights)
  weights_addr = rt_info.weights;
else
  weights_addr = WEIGHTS_ADDRESS;

res = ai_rel_network_init(net, &weights_addr, &act_addr))

At this stage, the instance is fully ready to be used. To retrieve the whole attributes of the instantiated model, the ai_rel_network_get_report() function can be used.

ai_bool res;
ai_network_report net_info;

res = ai_rel_network_get_report(net, &net_info);

To avoid to allocate the model dependent memory regions through a system heap, a pre-allocate memory region can be used (AI_RT_ADDR address).

rt_ai_ram = (uint8_t *)AI_RT_ADDR;
act_addr = rt_ai_ram + AI_RELOC_ROUND_UP(rt_info.rt_ram_xip);

Note

“Static” allowing only one instance at the time, there is no limitation here of the number of the created instances for a same generated model. Each instance can be created with its own AI RT RAM area. It is initialized with its own activations buffer, concurrent UC can be implemented w/o specific synchro.

Running an inference

The function to run an inference is fully like the “static” case. Following snippet code illustrates the case where the generated model is defined with the simple input and output tensors.

static int ai_run(void *data_in, void *data_out)
{
  ai_i32 batch;

  ai_buffer *ai_input = net_info.inputs;
  ai_buffer *ai_output = net_info.outputs;

  ai_input[0].data = AI_HANDLE_PTR(data_in);
  ai_output[0].data = AI_HANDLE_PTR(data_out);

  batch = ai_rel_network_run(net, ai_input, ai_output);
  if (batch != 1) {
    ai_log_err(ai_rel_network_get_error(net),
        "ai_rel_network_run");
    return -1;
  }

  return 0;
}

Tip

Properties of the input or output tensors are fully accessible through the ai_network_report struct as for the “static” approach (refer to “IO tensor description” [API] section). Payload can be allocated in the activations buffer w/o restrictions.

Generation flow

The following figure illustrates the flow to generate a relocatable binary model. The first step to import and generate the NN C-files is the same as the “static” approach. Only the <network>_data.c/.h files are not fully generated. The second step allows to compile and to link the generated NN C-files against a specific AI runtime library. It is just compiled with the relocatable options and it embeds the requested mathematical and memcopy/memset functions. The last post-processing step generates the binary file by appending a specific section (.rel section) and various information which will be used by the AI relocatable run-time API. The weights are appended as a .weights binary section at the end of the file.

Generation of the relocatable binary model

Note

Code is only compiled with a GCC ARM Embedded tool-chain. It is compiled with the -fpic and -msingle-pic-base options. The ARM Cortex-M r9 register is used as platform register for the Global Offset Table, GOT. The AI relocatable run-time function is in charge to update the r9 register before to call the code. The generated relocatable binary object is independent of the end-user ARM embedded tool-chain used to build the end-user application. Consequently, for a same memory placement and HW setting, the inference time will be the same.

AI run time execution modes

XIP execution mode

This execution mode is the default UC where the code and the weights sections are stored in the STM32 embedded/internal flash. In term of memory placement this is similar of the “static” approach.

XIP execution mode

COPY execution mode

This alternative execution mode should be considered when the weights should be placed in an external memory device, because they do not fit in the internal/embedded STM32 flash. Copying the code from a non-efficient executable memory region to a low latency executable region allows to significantly improve the inference time. Note that the requested AI RT RAM size is more important and that the associated memory region should be executable. Another inconvenient, for the Cortex-M4 based architecture (no Core I/D cache available) is the contention due the code and data memory accesses which can degraded the performances. To avoid this drawback, next UC should be considered.

COPY execution mode

XIP execution mode and separated weight binary file

This mode is an optimal case where the weights should be in an external memory device (<network>_data.bin file). This implies to have an second internal/embedded flash region to store the code (<network>_rel.bin file). In this case, the critical code is executed in place. The drawback is to manage two binary files for the upgrade.

XIP execution mode with a separated weight binary file

AI relocatable run-time API

The proposed API (called AI relocatable run-time API) to manage the relocatable binary model is comparable to the Embedded inference client API (refer to [[API]][X_CUBE_AI_API]) for the “standard” approach. Only the create and initialize functions have been enhanced to take account the specificities. All functions are prefixed by ai_rel_network_ and they are not dependent of the c-name of the model. They are defined and implemented in the ai_reloc_network.c/.h files ($X_CUBE_AI_DIR/Middlewares/ST/AI/Reloc/ folder).

ai_rel_network_rt_get_info()

ai_error ai_rel_network_rt_get_info(const void* obj, ai_rel_network_info* rt);

Allow to retrieve the dimensioning information to instantiate a relocatable binary model.

  • {AI_ERROR_INVALID_HANDLE, AI_ERROR_CODE_INVALID_PTR} error is returned if the referenced object is not valid (i.e. invalid signature or the address is not aligned on 4-bytes).

Following table describes the different fields which are available in the returned ai_rel_network_info C-struct.

field description
c_name pointer on the user c-name of the generated model (debug purpose)
variant 32-bit word. Handle the AI RT version, requested Cortex-M ID,…
code_sz code size in bytes (w/o the weight section)
weights\weights_sz address/size (in bytes) of the weight section if available
acts_sz requested activations size (RAM metric) to run the model
rt_ram_xip requested RAM size (in bytes) to install the model in XIP mode
rt_ram_copy requested RAM size (in bytes) to install the model in COPY mode

ai_rel_network_load_and_create()

ai_error ai_rel_network_load_and_create(const void* obj, ai_handle ram_addr,
    ai_size ram_size, uint32_t mode, ai_handle* hdl);
ai_handle ai_rel_network_destroy(ai_handle hdl);

Create and install an instance of the relocatable binary model (referenced by obj). A RW memory buffer (ram_addr/ram_size) should be provided to create the data sections (.data/.bss/.got) and to fix/resolve the internal references during the relocation process. The mode indicates the expected execution mode. The expected size for the AI RT RAM buffer can be retrieved by the ai_rel_network_rt_get_info() function.

mode description
AI_RELOC_RT_LOAD_MODE_XIP XIP execution mode is requested
AI_RELOC_RT_LOAD_MODE_COPY COPY execution mode is requested
  • ai_handle references a run-time context (opaque object) which must be used for the other functions.
  • before to create the instance, Cortex-M id is verified. If requested the function checks also if the FPU is enable.

Note

If ram_addr or/and ram_size are NULL, default allocation is done through the system heap. Behavior can be overwritten in the ai_reloc_network.c file, see AI_RELOC_MALLOC macro definition.

ai_rel_network_init()

ai_bool ai_rel_network_init(ai_handle hdl, const ai_handle *weights,
    const ai_handle *act);

Finalize the initialization of the instance with the addresses of the weights and the activations buffer.

  • if the weights are stored in the relocatable binary object, ai_rel_network_rt_get_info() should be used to retrieve the address.
  • as for the “static` approach, an activations buffer (or multiple) should be also provided.
ai_handle weights_addr;
ai_handle activations;
...
const ai_handle acts[] = { activations };
res = ai_rel_network_init(net, &weights_addr, acts))
...

ai_rel_network_get_info()

ai_bool ai_rel_network_get_info(ai_handle hdl, ai_network_report* report);

Allow to retrieve the run-time data attributes of an instantiated model. Refer to ai_platform.h file to show the detail of the returned ai_network_report C-struct. It should be called after the call of ai_rel_network_init().

ai_rel_network_get_error()

ai_error ai_rel_network_get_error(ai_handle hdl);

Return the 1st error reported during the execution of a ai_network_xxx() function.

  • see ai_platform.h file to have the list of the returned error type (ai_error_type) and associated code (ai_error_code).

ai_rel_network_run()

ai_i32 ai_rel_network_run(ai_handle hdl, const ai_buffer* input, ai_buffer* output);

Perform one or the inferences. The input and output buffer parameters (ai_buffer type) allow to provide the input tensors and to store the predicted output tensors respectively (refer to [“IO tensor description” [API]][API_io_tensor] section).

ai_rel_platform_observer_register()

ai_bool ai_rel_platform_observer_register(ai_handle hdl,
    ai_observer_node_cb cb, ai_handle cookie, ai_u32 flags);
ai_bool ai_rel_platform_observer_unregister(ai_handle hdl,
    ai_observer_node_cb cb, ai_handle cookie);
ai_bool ai_rel_platform_observer_node_info(ai_handle hdl,
    ai_observer_node *node_info);

As for the “static” approach, these functions permit to register a user callback to be notified before and/or after the execution of a c-node. They is not restriction on the usage of the Platform Observer API with a relocatable binary model.