TensorFlow Lite for Microcontroller support
for STM32 target, based on ST Edge AI Core Technology 2.2.0
r3.1
Overview
X-CUBE-AI solution
X-CUBE-AI Expansion Package integrates a specific path which allows to generate a ready-to-use STM32 IDE project embedding a [TensorFlow Lite for Microcontrollers][REF_TFLM] run-time (also called TFLm in this article) and its associated TFLite model. This can be considered as an alternative of the default X-CUBE-AI solution to deploy a AI solution based on a TFLite model.
Thanks to STM32CubeMX and its eco-system, the user can create and configure, in one-click, a complete STM32 firmware including different middleware/drivers and the sources of the TFLm interpreter and associated kernels. They are directly added in the exported source tree and the build system is adapted to use the options for speed.
- aiSystemPerformance and aiValidation test
applications can be used to evaluate the performances of the
deployed TFLite model. The aiTemplate application is also
available to develop an application.
- all STM32 IDE toolchains are supported:
- STMicroelectronics - STM32CubeIDE version 1.0.1 or later
- Keil® - MDK-ARM Professional Version - µVision® with Arm
Compiler v6
- IAR Systems - IAR Embedded Workbench® IDE - ARM v8.x
- GNU Arm Embedded toolchain
- STMicroelectronics - STM32CubeIDE version 1.0.1 or later
- only the TFLm source files (
.h
and.cc
files) from the official https://github.com/tensorflow/tflite-micro GitHub repository (sha-1 = 6a1803
,main
branch) which are requested, are included in the X-CUBE-AI pack. It contains also the CMSIS files (in particular the CMSIS-NN files) which are requested to implement the optimized version of some operators. Version of the source files is aligned with the TensorFlow python module version which is used by X-CUBE-AI pack. - the
TFLm importer
module processes the TFLite file allowing to retrieve the list of used TFLite operators and to estimate the requested'arena'
size. These information are used to generate the simple C-wrapper files which can be directly (or not) used by the client application.
- no command-line option is available to generate the STM32 IDE project.
TFLite file?
TensorFlow Lite framework is used to deploy a deep learning model on mobile and embedded devices. Generated TFLite file is a self-contained file containing a frozen description of the graph (or inference model), setting of the operators and the tensors (including the data). 32b floating point and quantized models are supported.
This file is directly used by a runtime interpreter, see [TensorFlow Lite for Microcontrollers][REF_TFLM], or as entry point for a compiler like Coral Edge TPUs compiler or code generator like X-CUBE-AI to create an adapted and optimized version targeting a particular hardware: MPU, MCU or hardware assist IP.
Functional point of view, content of the TFLite file is similar
to the generated <network>.c
and
<network>_data.c
files. The implementation of the
kernels is provided separately as a static library for the default
X-CUBE-AI solution or
by an embedded runtime (interpreter approach) for the default TFLm
solution.
TensorFlow Lite Micro stack
Following figure illustrates a typical TFLm stack. The
'tflite::MicroInterpreter'
class is used by the client
application to create and to use an instance of the model. The
'arena'
buffer is requested to allocate the internal,
input and output tensors and associated data structure to manage the
instance (see “Memory
Management in TensorFlow Lite Micro” article for more details).
No system heap is requested. The model itself should be passed as a
simple memory-mapped c-array. The “Get
started with microcontrollers” article explains step-by-step how
to train and to run inference with the C++ API (hello-world
example).
Resolver
The resolver module can be considered as a registry which is used by the interpreter to access the operators that are used by the model. By default, a generic resolver can be declared to support all available operators. In this case, the code of all operators is embedded in the firmware.
static tflite::AllOpsResolver resolver;
To avoid this situation, it is possible to declare a specific
resolver and to register only the requested operators. As for the
size of the arena
buffer, the X-CUBE-AI TFLm importer
generates automatically this list, to optimize the requested flash
size (see 'tflm_c.cc'
file).
static tflite::MicroMutableOpResolver<5> resolver;
.AddConv2D();
resolver.AddDepthwiseConv2D();
resolver.AddFullyConnected();
resolver.AddSoftmax();
resolver.AddMaxPool2D(); resolver
Supported operators
As for the [supported TFLite][X_CUBE_AI_TFLITE_TOOLBOX] operators
by X-CUBE-AI, all TFLIte operators are not supported by the TFLm
run-time. The list of the supported operators can be found in the
following file: tensorflow\lite\micro\kernels\micro_ops.h
Operator (tflite namespace) | [X-CUBE-AI][X_CUBE_AI_TFLITE_TOOLBOX] support |
---|---|
ABS | yes |
ADD | yes |
ADD_N | - |
ARG_MAX | yes |
ARG_MIN | yes |
ASSIGN_VARIABLE | - |
AVERAGE_POOL_2D | yes |
BATCH_TO_SPACE_ND | yes |
BROADCAST_ARGS | - |
BROADCAST_TO | - |
CALL_ONCE | - |
CAST | yes |
CEIL | yes |
CIRCULAR_BUFFER | - |
CUMSUM | - |
CONCATENATION | yes |
CONV_2D | yes |
COS | yes |
DEPTH_TO_SPACE | - |
DEPTHWISE_CONV_2D | yes |
DEQUANTIZE | yes (including uint8 type) |
DIV | yes |
ELU | yes |
EQUAL | yes |
ETHOSU | yes |
EXP | yes |
EXPAND_DIMS | yes |
FILL | yes |
FLOOR | yes |
FLOOR_DIV | yes |
FLOOR_MOD | yes |
FULLY_CONNECTED | yes |
GATHER | yes |
GATHER_ND | - |
GREATER | yes |
GREATER_EQUAL | yes |
HARD_SWISH | yes |
IF | - |
L2_NORMALIZATION | yes |
L2_POOL_2D | - |
LEAKY_RELU | yes |
LESS | yes |
LESS_EQUAL | yes |
LOG | yes |
LOGICAL_AND | yes |
LOGICAL_NOT | yes |
LOGICAL_OR | yes |
LOG_SOFTMAX | yes |
LOGICAL_AND | yes |
LOGICAL_OR | yes |
LOGISTIC | yes |
MAX_POOL_2D | yes |
MAXIMUM | yes |
MEAN | yes |
MINIMUM | yes |
MIRROR_PAD | yes |
MUL | yes |
NEG | yes |
NOT_EQUAL | yes |
PACK | yes |
PAD | yes |
PADV2 | yes |
PRELU | yes |
QUANTIZE | yes (including uint8 type) |
READ_VARIABLE | - |
REDUCE_MAX | yes |
RELU | yes |
RELU6 | yes |
RESIZE_BILINEAR | yes |
RESIZE_NEAREST_NEIGHBOR | yes |
RSQRT | yes |
SELECT_V2 | yes |
SHAPE | yes |
SIN | yes |
SLICE | yes |
SOFTMAX | yes |
SPACE_TO_BATCH_ND | yes |
SPACE_TO_DEPTH | yes |
SPLIT | yes |
SPLIT_V | yes |
SQRT | yes |
SQUARE | yes |
SQUARED_DIFFERENCE | yes |
SQUEEZE | yes |
STRIDED_SLICE | yes |
SUB | yes |
SUM | yes |
SVDF | - |
TANH | yes |
TRANSPOSE | yes |
TRANSPOSE_CONV | yes |
UNIDIRECTIONAL_SEQUENCE_LSTM | yes (float only) |
UNPACK | yes |
VAR_HANDLE | - |
WHILE | - |
ZEROS_LIKE | - |
Operator (tflite.ops.micro namespace) | [X-CUBE-AI][X_CUBE_AI_TFLITE_TOOLBOX] support |
---|---|
RESHAPE | yes |
ROUND | yes |
Optimized CMSIS-NN-based operators are exported by X-CUBE-AI (tensorflow\lite\micro\kernels\cmsis_nn\
directory). This is equivalent to use the
'TAGS=cmsis-nn'
option to build the library with the
original TFLm build system.
ADD, CONV_2D, DEPTHWISE_CONV_2D, FULLY_CONNECTED, MUL, AVERAGE_POOL_2D,
MAX_POOL_2D, SOFTMAX, SVDF
Note
The source files of the default version of the optimized operators are not available in the X-CUBE-AI pack.
Generated files
X-CUBE-AI exports a set of additional files which can be considered as the helper files to facilitate the usage of TFLm stack. Note that the exported TFLm files respect the source tree of the original repository.
%root_project_directory%
|-- Middlewares
| \_ tensorflow /* TensorFlow lite for micro files */
| |_ tensorflow
| | \_ lite ..
| | |- c
| | |- core
| | |- kernels
| | \ ..
| \_ third_party ..
| |- cmsis
| \ ..
|
|-- X-CUBE-AI -- App /* Generated files (c-wrapper) */
| |- network.c
.. |- network_tflite_data.h
|- tflm_c.cc
|- tflm_c.h
|- debug_log_imp.cc
..
The tflm_c.h
provides an optional light TFLite
inference C-API on the top of the TFlite interpreter C++ API. It
provides the requested services to initialize a model and to use it.
For debug/profiling purpose, a profiler API is also provided.
file | description |
---|---|
network.c | contains the c-array representation of the TFLite file. |
network_tflite_data.h | contains the pre-calculated arena size
TFLM_NETWORK_TENSOR_AREA_SIZE |
debug_log_imp.cc | implements the DebugLog()
function which is requested by the TFLm files. |
tfm_c.cc | implements the c-wrapper. Most part is
generic, only the creation of specialized resolver is specific. The
global 'TFLM_RUNTIME_USE_ALL_OPERATORS' C-define can be
set to '1' to use the 'AllOpsResolver'
object. |
tfm_c.h | c-wrapper definition |
AI System Performance application
AI System Performance test application allows to evaluate the
inference time. Random data are used to feed the model. Outputs are
skipped. Reported inference time ('duration'
) is an
average on 16 inferences. The effective used arena size
'Allocated size'
is also reported.
Note
With a GCC base project, the following linker options should be
used to monitor also the used stack
and
used heap
during the execution of an inference:
-u _printf_float -Wl,--wrap=malloc -Wl,--wrap=free
#
# AI system performance measurement TFLM 2.1
#
Compiled with GCC 9.3.1
STM32 Runtime configuration...
Device : DevID:0x0431 (STM32F411xC/E) RevID:0x1000
Core Arch. : M4 - FPU used
HAL version : 0x01070600
system clock : 100 MHz
FLASH conf. : ACR=0x00000703 - Prefetch=True $I/$D=(True,True) latency=3
Timestamp : SysTick + DWT (HAL_Delay(1)=1.004 ms)
Instancing the network.. (cWrapper: v2.0)
TFLM version : 2.10.0
TFLite file : 0x0801bae8 (16956 bytes)
Arena location : 0x200000d0
Opcode size : 2
Operator size : 4
Tensor size : 11
Allocated size : 1232 / 20480
Inputs size : 1
- 0:FLOAT32:396:(1, 1, 1, 99)
Outputs size : 1
- 0:FLOAT32:20:(1, 1, 1, 5)
Used heap : 224 bytes (max=224 bytes) (for c-wrapper 2.0)
Running PerfTest with random inputs (16 iterations)...
................
Results TFLM 2.10.0, 16 inferences @100MHz/100MHz
duration : 0.404 ms (average)
CPU cycles : 40404 (average)
CPU Workload : 0% (duty cycle = 1s)
used stack : 380 bytes
used heap : 0:0 0:0 (req:allocated,req:released) max=0 cur=0 (cfg=3)
Inference time by c-node
kernel : 0.393ms (time passed in the c-kernel fcts)
user : 0.022ms (time passed in the user cb)
idx name time (ms)
---------------------------------------------------
0 FULLY_CONNECTED 0.305 77.75 %
1 FULLY_CONNECTED 0.057 14.63 %
2 FULLY_CONNECTED 0.017 4.45 %
3 SOFTMAX 0.012 3.17 %
---------------------------------------------------
0.393 ms
AI Validation application
Warning
Validation on desktop is not supported.
Note
ai_runner
module for Python-base environment can be also used with a STM32
aiValidation firmware based on the TFLM runtime.
If the STM32 is flashed with the TFLm aiValidation test application, the following validation flow can be used. It allows to use the validate command with the random data or the user data to evaluate different metrics.
$ stm32ai validate <tflite_file_model> --mode stm32
Neural Network Tools for STM32AI v1.6.0 (STM.ai v7.3.0)
Setting validation data...
generating random data, size=10, seed=42, range=default
I[1]: (10, 1, 1, 99)/float32, min/max=[0.005, 1.000], mean/std=[0.490, 0.292], dense_1_input
No output/reference samples are provided
Running the STM AI c-model (AI RUNNER)...(name=network, mode=stm32)
STM Proto-buffer protocol 2.2 (SERIAL:COM5:115200:connected) ['network']
Summary "network" - ['network']
--------------------------------------------------------------------------------
inputs/outputs : 1/1
input_1 : (1,1,1,99), float32, 396 bytes, user, in activations buffer
ouputs_1 : (1,1,1,5), float32, 20 bytes, user, in activations buffer
n_nodes : 4
compile_datetime : Sep 8 2022 12:17:09 (NULL)
activations : 1232
weights : 16956
macc : n.a.
--------------------------------------------------------------------------------
runtime : Protocol 2.3 - TFLM (/gcc) 2.10.0 (Tools 2.10.0)
capabilities : ['IO_ONLY']
device : 0x431 - STM32F411xC/E @100/100MHz fpu,art_lat=3,
art_prefetch,art_icache,art_dcache
--------------------------------------------------------------------------------
Warning: C-network signature checking has been skipped
Results for 10 inference(s) - average per inference
device : 0x431 - STM32F411xC/E @100/100MHz fpu,art_lat=3,
art_prefetch,art_icache,art_dcache
duration : 0.395ms
CPU cycles : 39538
cycles/MACC : n.a.
Running the TFlite model...
Saving validation data...
output directory: C:\local\ai_tests\non_reg\stm32ai_output
creating C:\local\ai_tests\non_reg\stm32ai_output\network_val_io.npz
m_outputs_1: (10, 1, 1, 5)/float32, min/max=[0.000, 0.998], mean/std=[0.200, 0.330], nl_3
c_outputs_1: (10, 1, 1, 5)/float32, min/max=[0.000, 0.998], mean/std=[0.200, 0.330], nl_3
Computing the metrics...
Cross accuracy report #1 (reference vs C-model)
--------------------------------------------------------------------------------
notes: - the output of the reference model is used as ground truth/reference value
- 10 samples (5 items per sample)
acc=100.00%, rmse=0.000000245, mae=0.000000093, l2r=0.000000634
5 classes (10 samples)
---------------------------------
C0 0 . . . .
C1 . 0 . . .
C2 . . 0 . .
C3 . . . 8 .
C4 . . . . 2
Evaluation report (summary)
------------------------------ ... ------------------------------------------------
Output acc rmse ... std tensor
------------------------------ ... ------------------------------------------------
X-cross #1 100.00% 0.000000245 ... 0.000000247 nl_3, ai_float, (1,1,1,5), m_id=[3]
------------------------------ ... ------------------------------------------------
Creating txt report file <output-directory-path>\network_validate_report.txt
elapsed time (validate): 1.786s