Relocatable binary model support
for STM32 target, based on ST Edge AI Core Technology 2.2.0
r2.5
Introduction
What is a relocatable binary model?
A relocatable binary model designates a binary object which can be installed and executed anywhere in a STM32 memory sub-system. It contains a compiled version of the generated NN C-files including the requested forward kernel functions and the weights. The principal objective is to provide a flexible way to upgrade an AI-based application w/o re-generating and flashing the whole end-user firmware. This is the primary element to use for example the FOTA (Firmware Over-The-Air) technology.
Generated binary object is a lightweight plug-in. It can run from any address (position-independent code) and have its data anywhere in memory (position-independent data). A simple and efficient AI relocatable run-time allows to instantiate and to use it. No complex and resource-consuming dynamic linker for ARM Cortex-M MCU is embedded in the STM32 firmware, the generated object is a self-contained entity, no external symbols/functions are requested at run-time.
In this article “static” approach designates the case where the generated NN C-files are compiled and linked with the end-user application stack.
Limitations
- No support to manage the state of a Keras stateful LSTM/GRU layers
- No support for the STM32 series with Cortex-m0 or Cortex-m0plus (STM32L0, STM32F0, STM32G0)
- Initial support for Custom layers, only self-containts c-files are supported. Lambda layers are supported.
Comparison with TF Lite for micro-controllers solution
TF Lite for micro-controllers framework provides a way to upgrade an AI-based application. TFLite converter utility allows to deploy a network and associated parameters through a simple container: TFLite file. Based on the flat buffer technology, it is interpreted at run-time to create an executable instance. The main difference is that the code of the forward kernel functions and associated interpreter should be already available in the initial firmware image. For the X-CUBE-AI relocatable solution, the code of the kernels are also embedded in the container.
Getting started
Generating a relocatable binary model
To build a relocatable binary file for a given STM32 series, the
--relocatable/-r
option is used with the
generate
command. Pay attention to the specific options
to compress or to put the IO buffer in the activations buffer should
be always applied as for the “standard” approach.
Note
A GNU
ARM Embedded tool-chain (arm-none-eabi-
prefix)
should be available in the PATH before to launch the command. The
optional --lib
argument indicates the root location to find the relocatable network
runtime libraries from the installed pack. Default value:
$X_CUBE_AI_DIR/Middlewares/ST/AI
.
-m <model_file_path> <gen_options> --relocatable --target stm32f4
$ stedgeai generate
.0.0
ST Edge AI Core v1
Generating files for relocatable binary model..
...
(series="stm32f4")
Runtime memory layout ------------------------------------------------------------------
(bytes)
section size ------------------------------------------------------------------
100*
header 22,952 network+kernel
txt 224 network+kernel
rodata data 1,812 network+kernel
116 network+kernel
bss 116*
got 516*
rel 16,688 network
weights ------------------------------------------------------------------
41,676 + 732* (+1.76%)
FLASH size ** 1,928 + 116* (+6.02%)
RAM size------------------------------------------------------------------
42,412 binary image
bin size . size 12,004 activations buffer
act------------------------------------------------------------------
(*) extra bytes for relocatable support
(**) Full RAM = RAM + act. + IO if not allocated in activations buffer
(12)
Generated files --------------------------------------------------------------------------------
<output-directory-path>\network_rel.bin
<output-directory-path>\network_img_rel.c
<output-directory-path>\network_img_rel.h
<output-directory-path>\network_config.h
<output-directory-path>\network.h
<output-directory-path>\network.c
<output-directory-path>\network_data_params.h
<output-directory-path>\network_data_params.c
<output-directory-path>\network_data.h
<output-directory-path>\network_data.c
<output-directory-path>\ai_reloc_network.h
<output-directory-path>\ai_reloc_network.c
<output-directory-path>\network_generate_report.txt Creating report file
Supported STM32 series
supported series | description |
---|---|
stm32f4 /stm32f3 /stm32g4 /stm32wb |
default series. All STM32F4xx/STM32F3xx/STM32G4xx devices with a ARM Cortex M4 core and FPU support enabled (simple precision). |
stm32l4 /stm32l4r |
all STM32L4xx/STM32L4Rxx devices with a ARM Cortex M4 core and FPU support enabled (simple precision). |
stm32l5 /stm32u5 |
all STM32L5xx/STM32U5xx device with a ARM Cortex M33 core and FPU support enabled (simple precision). |
stm32f7 |
all STM32F7xx device with a ARM Cortex M7 core and FPU support enabled (simple precision). |
stm32h7 |
all STM32H7xx device with a ARM Cortex M7 core and FPU support enabled (double precision). |
To indicate the root location of the STM32 libraries:
-m <model_file_path> <gen_options> --relocatable --series stm32h7 \
$ stm32ai generate --lib $X_CUBE_AI_DIR/Middlewares/ST/AI
Generated files
file | description |
---|---|
<network>_rel.bin |
main binary file (i.e. the
“relocatable binary model”). Its contents the compiled version of
the model, including the used forward kernel functions and the
weights by default. It embeds also the additional sections
(.header/.got/.rel) to be able to install the model. Note that the
if the --binary option flag is used, the weights are
generated in a separated file:
<network>_data.bin . |
ai_reloc_network.c/.h |
AI relocatable runtime API files. They are requested to use the relocatable binary model. |
<network>.c/.h |
DEBUG/TEST purpose - generated network C-files which are used to generate the relocatable binary model |
<network>_data.c/.h |
DEBUG/TEST purpose - generated network
data C-files which are used to generate the relocatable binary
model. <network>_data.c is an empty file. |
<network>_img_rel.c/.h |
DEBUG/TEST purpose - they facilitate
the deployment of the relocatable binary model in a test framework
like aiSystemPerformance/aiValidation applications. It
contents additional macros and a C byte array (image of the binary
file) which can be used by the AI relocatable
runtime API to install and to use the model. The
--no-c-files option flags can be used to avoid
generating these additional files. |
All files are generated in the
<output-directory-path>
directory.
Memory layout information
The reported memory layout information completes the provided
ROM
and RAM
memory size metrics (refer to
“Memory-related
metrics” section), with the AI memory resources, including
the specific AI code and data sections which are requested to run
the AI stack. Apart the additional sections to manage the
relocatable binary model, the size of the other sections are like
the “static” code generation approach where the NN C-files are
compiled and linked with the code of the application. Only the
requested size for the IO buffers are not reported here.
Following table summarizes the difference in term of memory layout (in bytes) between a static and relocatable approach. Size of the network/kernels sections are dependent of the topology complexity (number of nodes) and the different forward kernel functions which are used. Activations and weights are always the same.
AI object | static | reloc | typically placed in |
---|---|---|---|
activations | 192 | 192 | RAM type (rw), .bss
section |
weights | 15,560 | 15,560 | FLASH type (ro), .rodata
section |
network/kernels (FLASH) | 25,308 | 26,024 | FLASH type (rx),
.txt\.rodata\(.data) section |
network/kernels (RAM) | 1,888 | 1,996 | RAM type (rw), .data\.bss
section |
Upgrading the firmware image
This step is out-of-scope of this article, the underlying process
is fully application dependent. For the next steps, the snipped code
expects that the image (<network>_rel.bin
) has
been flashed in a memory-mapped region.
To use a relocatable binary model, a specific AI relocatable runtime API (simple c-file) is
requested to install and to use it. It is available in the X-CUBE-AI
pack and should be integrated during the generation of the firmware.
Note that the only the AI runtime header files are requested, the
network_runtime.a
library is not necessary.
CFLAGS += -mcpu=cortex-m4 -mthumb -mfpu=fpv4-sp-d16 -mfloat-abi=hard
C_SOURCES += $X_CUBE_AI_DIR/Middlewares/ST/AI/Reloc/Src/ai_reloc_network.c
CFLAGS += -I$X_CUBE_AI_DIR/Middlewares/ST/AI/Inc
CFLAGS += -I$X_CUBE_AI_DIR/Middlewares/ST/AI/Reloc/Inc
Creating an instance
After the update of the model (binary file) inside the STM32
device at the address: BIN_ADDRESS
, the following
sequence of code can be used to create and to install an instance of
the generated model.
#include <ai_reloc_network.h>
;
ai_error err;
ai_rel_network_info rt_info
= ai_rel_network_rt_get_info(BIN_ADDRESS, &rt_info); err
This allows to retrieve a part of the meta-information embedded in the header of the binary.
...
("Load a relocatable binary model, located at the address 0x%08x\r\n",
printf(int)BIN_ADDRESS);
(" model name : %s\r\n", rt_info.c_name);
printf(" weights size : %d bytes\r\n", (int)rt_info.weights_sz);
printf(" activations size : %d bytes (minimum)\r\n", (int)rt_info.acts_sz);
printf(" compiled for a Cortex-Mx : 0x%03X\r\n",
printf(int)AI_RELOC_RT_GET_CPUID(rt_info.variant));
(" FPU should be enabled : %s\r\n",
printf(rt_info.variant)?"yes":"no");
AI_RELOC_RT_FPU_USED(" RT RAM minimum size : %d bytes (%d bytes in COPY mode)\r\n",
printf(int)rt_info.rt_ram_xip,
(int)rt_info.rt_ram_copy);
...
To create an executable instance of the C-model, a dedicated
memory buffer (also called AI RT RAM), should be provided.
Minimum requested size is model and execution mode dependent. For
the XIP
execution mode (AI_RELOC_RT_LOAD_MODE_XIP
),
only a buffer (rw memory mapped region) for the data sections is
requested (minimum size = rt_info.rt_ram_xip
). Note
that the allocated buffer should be 4-bytes aligned. For the
COPY
execution mode (AI_RELOC_RT_LOAD_MODE_COPY
),
rt_info.rt_ram_copy
minimum size is requested to be
able to copy also the code sections. For this last case, the
provided memory region, should be executable.
;
ai_error err= AI_HANDLE_NULL;
ai_handle net
uint8_t *rt_ai_ram = malloc(rt_info.rt_ram_xip);
= ai_rel_network_load_and_create(BIN_ADDRESS, rt_ai_ram, rt_info.rt_ram_xip,
err , &net); AI_RELOC_RT_LOAD_MODE_XIP
Before to install and to set the instance, the compatibility with
the STM32 platform and the provided binary is verified, confirming
the Cortex-Mx ID and if the FPU is enabled (if requested by the
binary). If all is OK, an instance of the model is ready to be
initialized and a handle is returned (net
parameter).
As for the “static” approach, the next step is to complete the
internal data structure with the activation buffer and the weights
buffer. Only the addresses of the associated buffer should be
provided. It the weights are loaded as a separated file
(--binary
option flag), WEIGHTS_ADDRESS
indicates the location where the weights have been placed.
;
ai_handle weights_addr;
ai_bool res
uint8_t *act_addr = malloc(rt_info.acts_sz)
if (rt_info.weights)
= rt_info.weights;
weights_addr else
= WEIGHTS_ADDRESS;
weights_addr
= ai_rel_network_init(net, &weights_addr, &act_addr)) res
At this stage, the instance is fully ready to be used. To
retrieve the whole attributes of the instantiated model, the
ai_rel_network_get_report()
function can be used.
;
ai_bool res;
ai_network_report net_info
= ai_rel_network_get_report(net, &net_info); res
To avoid to allocate the model dependent memory regions through a
system heap, a pre-allocate memory region can be used
(AI_RT_ADDR
address).
= (uint8_t *)AI_RT_ADDR;
rt_ai_ram = rt_ai_ram + AI_RELOC_ROUND_UP(rt_info.rt_ram_xip); act_addr
Note
“Static” allowing only one instance at the time, there is no limitation here of the number of the created instances for a same generated model. Each instance can be created with its own AI RT RAM area. It is initialized with its own activations buffer, concurrent UC can be implemented w/o specific synchro.
Running an inference
The function to run an inference is fully like the “static” case. Following snippet code illustrates the case where the generated model is defined with the simple input and output tensors.
static int ai_run(void *data_in, void *data_out)
{
;
ai_i32 batch
*ai_input = net_info.inputs;
ai_buffer *ai_output = net_info.outputs;
ai_buffer
[0].data = AI_HANDLE_PTR(data_in);
ai_input[0].data = AI_HANDLE_PTR(data_out);
ai_output
= ai_rel_network_run(net, ai_input, ai_output);
batch if (batch != 1) {
(ai_rel_network_get_error(net),
ai_log_err"ai_rel_network_run");
return -1;
}
return 0;
}
Tip
Properties of the input or output tensors are fully accessible
through the ai_network_report struct
as for the
“static” approach (refer to “IO tensor
description” [API] section). Payload can be allocated in the
activations buffer w/o restrictions.
Generation flow
The following figure illustrates the flow to generate a
relocatable binary model. The first step to import and generate the
NN C-files is the same as the “static” approach. Only the
<network>_data.c/.h
files are not fully
generated. The second step allows to compile and to link the
generated NN C-files against a specific AI runtime library. It is
just compiled with the relocatable options and it embeds the
requested mathematical and memcopy/memset functions. The last
post-processing step generates the binary file by appending a
specific section (.rel
section) and various information
which will be used by the AI relocatable run-time API. The weights
are appended as a .weights
binary section at the end of
the file.
Note
Code is only compiled with a GCC ARM Embedded tool-chain. It is
compiled with the -fpic
and
-msingle-pic-base
options. The ARM
Cortex-M r9
register is used as
platform register for the Global Offset Table, GOT. The AI
relocatable run-time function is in charge to update the
r9
register before to call the code.
The generated relocatable binary object is independent of the
end-user ARM embedded tool-chain used to build the end-user
application. Consequently, for a same memory placement and HW
setting, the inference time will be the same.
AI run time execution modes
XIP execution mode
This execution mode is the default UC where the code and the weights sections are stored in the STM32 embedded/internal flash. In term of memory placement this is similar of the “static” approach.
COPY execution mode
This alternative execution mode should be considered when the weights should be placed in an external memory device, because they do not fit in the internal/embedded STM32 flash. Copying the code from a non-efficient executable memory region to a low latency executable region allows to significantly improve the inference time. Note that the requested AI RT RAM size is more important and that the associated memory region should be executable. Another inconvenient, for the Cortex-M4 based architecture (no Core I/D cache available) is the contention due the code and data memory accesses which can degraded the performances. To avoid this drawback, next UC should be considered.
XIP execution mode and separated weight binary file
This mode is an optimal case where the weights should be in an
external memory device (<network>_data.bin
file).
This implies to have an second internal/embedded flash region to
store the code (<network>_rel.bin
file). In this
case, the critical code is executed in place. The drawback is to
manage two binary files for the upgrade.
AI relocatable run-time API
The proposed API (called AI relocatable run-time API) to manage
the relocatable binary model is comparable to the Embedded inference
client API (refer to [[API]][X_CUBE_AI_API]) for the “standard”
approach. Only the create and initialize functions have been
enhanced to take account the specificities. All functions are
prefixed by ai_rel_network_
and they are not dependent
of the c-name of the model. They are defined and implemented in the
ai_reloc_network.c/.h
files
($X_CUBE_AI_DIR/Middlewares/ST/AI/Reloc/
folder).
ai_rel_network_rt_get_info()
(const void* obj, ai_rel_network_info* rt); ai_error ai_rel_network_rt_get_info
Allow to retrieve the dimensioning information to instantiate a relocatable binary model.
{AI_ERROR_INVALID_HANDLE, AI_ERROR_CODE_INVALID_PTR}
error is returned if the referenced object is not valid (i.e. invalid signature or the address is not aligned on 4-bytes).
Following table describes the different fields which are
available in the returned ai_rel_network_info
C-struct.
field | description |
---|---|
c_name |
pointer on the user c-name of the generated model (debug purpose) |
variant |
32-bit word. Handle the AI RT version, requested Cortex-M ID,… |
code_sz |
code size in bytes (w/o the weight section) |
weights\weights_sz |
address/size (in bytes) of the weight section if available |
acts_sz |
requested activations size
(RAM metric) to run the model |
rt_ram_xip |
requested RAM size (in bytes) to install the model in XIP mode |
rt_ram_copy |
requested RAM size (in bytes) to install the model in COPY mode |
ai_rel_network_load_and_create()
(const void* obj, ai_handle ram_addr,
ai_error ai_rel_network_load_and_create, uint32_t mode, ai_handle* hdl);
ai_size ram_size(ai_handle hdl); ai_handle ai_rel_network_destroy
Create and install an instance of the relocatable binary model
(referenced by obj
). A RW memory buffer
(ram_addr/ram_size
) should be provided to create the
data sections (.data/.bss/.got
) and to fix/resolve the
internal references during the relocation process. The
mode
indicates the expected execution mode. The
expected size for the AI RT RAM buffer can be retrieved by the ai_rel_network_rt_get_info()
function.
mode | description |
---|---|
AI_RELOC_RT_LOAD_MODE_XIP |
XIP execution mode is requested |
AI_RELOC_RT_LOAD_MODE_COPY |
COPY execution mode is requested |
ai_handle
references a run-time context (opaque object) which must be used for the other functions.- before to create the instance, Cortex-M id is verified. If requested the function checks also if the FPU is enable.
Note
If ram_addr
or/and ram_size
are
NULL
, default allocation is done through the system
heap. Behavior can be overwritten in the
ai_reloc_network.c
file, see
AI_RELOC_MALLOC
macro definition.
ai_rel_network_init()
(ai_handle hdl, const ai_handle *weights,
ai_bool ai_rel_network_initconst ai_handle *act);
Finalize the initialization of the instance with the addresses of the weights and the activations buffer.
- if the weights are stored in the relocatable binary object,
ai_rel_network_rt_get_info()
should be used to retrieve the address. - as for the “static` approach, an activations buffer (or multiple) should be also provided.
;
ai_handle weights_addr;
ai_handle activations...
const ai_handle acts[] = { activations };
= ai_rel_network_init(net, &weights_addr, acts))
res ...
ai_rel_network_get_info()
(ai_handle hdl, ai_network_report* report); ai_bool ai_rel_network_get_info
Allow to retrieve the run-time data attributes of an instantiated
model. Refer to ai_platform.h
file to show the detail
of the returned ai_network_report
C-struct. It should
be called after the call of ai_rel_network_init()
.
ai_rel_network_get_error()
(ai_handle hdl); ai_error ai_rel_network_get_error
Return the 1st error reported during the execution of a
ai_network_xxx()
function.
- see
ai_platform.h
file to have the list of the returned error type (ai_error_type
) and associated code (ai_error_code
).
ai_rel_network_run()
(ai_handle hdl, const ai_buffer* input, ai_buffer* output); ai_i32 ai_rel_network_run
Perform one or the inferences. The input and output buffer
parameters (ai_buffer
type) allow to provide the input
tensors and to store the predicted output tensors respectively
(refer to [“IO tensor description” [API]][API_io_tensor]
section).
ai_rel_platform_observer_register()
(ai_handle hdl,
ai_bool ai_rel_platform_observer_register, ai_handle cookie, ai_u32 flags);
ai_observer_node_cb cb(ai_handle hdl,
ai_bool ai_rel_platform_observer_unregister, ai_handle cookie);
ai_observer_node_cb cb(ai_handle hdl,
ai_bool ai_rel_platform_observer_node_info*node_info); ai_observer_node
As for the “static” approach, these functions permit to register a user callback to be notified before and/or after the execution of a c-node. They is not restriction on the usage of the Platform Observer API with a relocatable binary model.