2.2.0
TensorFlow Lite for Microcontroller support


ST Edge AI Core

TensorFlow Lite for Microcontroller support


for STM32 target, based on ST Edge AI Core Technology 2.2.0



r3.1

Overview

X-CUBE-AI solution

X-CUBE-AI Expansion Package integrates a specific path which allows to generate a ready-to-use STM32 IDE project embedding a [TensorFlow Lite for Microcontrollers][REF_TFLM] run-time (also called TFLm in this article) and its associated TFLite model. This can be considered as an alternative of the default X-CUBE-AI solution to deploy a AI solution based on a TFLite model.

TFLm in X-CUBE-AI

Thanks to STM32CubeMX and its eco-system, the user can create and configure, in one-click, a complete STM32 firmware including different middleware/drivers and the sources of the TFLm interpreter and associated kernels. They are directly added in the exported source tree and the build system is adapted to use the options for speed.

  • aiSystemPerformance and aiValidation test applications can be used to evaluate the performances of the deployed TFLite model. The aiTemplate application is also available to develop an application.
  • all STM32 IDE toolchains are supported:
    • STMicroelectronics - STM32CubeIDE version 1.0.1 or later
    • Keil® - MDK-ARM Professional Version - µVision® with Arm Compiler v6
    • IAR Systems - IAR Embedded Workbench® IDE - ARM v8.x
    • GNU Arm Embedded toolchain
  • only the TFLm source files (.h and .cc files) from the official https://github.com/tensorflow/tflite-micro GitHub repository (sha-1 = 6a1803, main branch) which are requested, are included in the X-CUBE-AI pack. It contains also the CMSIS files (in particular the CMSIS-NN files) which are requested to implement the optimized version of some operators. Version of the source files is aligned with the TensorFlow python module version which is used by X-CUBE-AI pack.
  • the TFLm importer module processes the TFLite file allowing to retrieve the list of used TFLite operators and to estimate the requested 'arena' size. These information are used to generate the simple C-wrapper files which can be directly (or not) used by the client application.
  • no command-line option is available to generate the STM32 IDE project.

TFLite file?

TensorFlow Lite framework is used to deploy a deep learning model on mobile and embedded devices. Generated TFLite file is a self-contained file containing a frozen description of the graph (or inference model), setting of the operators and the tensors (including the data). 32b floating point and quantized models are supported.

This file is directly used by a runtime interpreter, see [TensorFlow Lite for Microcontrollers][REF_TFLM], or as entry point for a compiler like Coral Edge TPUs compiler or code generator like X-CUBE-AI to create an adapted and optimized version targeting a particular hardware: MPU, MCU or hardware assist IP.

TFLite file deployment

Functional point of view, content of the TFLite file is similar to the generated <network>.c and <network>_data.c files. The implementation of the kernels is provided separately as a static library for the default X-CUBE-AI solution or by an embedded runtime (interpreter approach) for the default TFLm solution.

TensorFlow Lite Micro stack

Following figure illustrates a typical TFLm stack. The 'tflite::MicroInterpreter' class is used by the client application to create and to use an instance of the model. The 'arena' buffer is requested to allocate the internal, input and output tensors and associated data structure to manage the instance (see “Memory Management in TensorFlow Lite Micro” article for more details). No system heap is requested. The model itself should be passed as a simple memory-mapped c-array. The “Get started with microcontrollers” article explains step-by-step how to train and to run inference with the C++ API (hello-world example).

Default TFLite for Microcontrollers stack

Resolver

The resolver module can be considered as a registry which is used by the interpreter to access the operators that are used by the model. By default, a generic resolver can be declared to support all available operators. In this case, the code of all operators is embedded in the firmware.

static tflite::AllOpsResolver resolver;

To avoid this situation, it is possible to declare a specific resolver and to register only the requested operators. As for the size of the arena buffer, the X-CUBE-AI TFLm importer generates automatically this list, to optimize the requested flash size (see 'tflm_c.cc' file).

static tflite::MicroMutableOpResolver<5> resolver;
resolver.AddConv2D();
resolver.AddDepthwiseConv2D();
resolver.AddFullyConnected();
resolver.AddSoftmax();
resolver.AddMaxPool2D();

Supported operators

As for the [supported TFLite][X_CUBE_AI_TFLITE_TOOLBOX] operators by X-CUBE-AI, all TFLIte operators are not supported by the TFLm run-time. The list of the supported operators can be found in the following file: tensorflow\lite\micro\kernels\micro_ops.h

Operator (tflite namespace) [X-CUBE-AI][X_CUBE_AI_TFLITE_TOOLBOX] support
ABS yes
ADD yes
ADD_N -
ARG_MAX yes
ARG_MIN yes
ASSIGN_VARIABLE -
AVERAGE_POOL_2D yes
BATCH_TO_SPACE_ND yes
BROADCAST_ARGS -
BROADCAST_TO -
CALL_ONCE -
CAST yes
CEIL yes
CIRCULAR_BUFFER -
CUMSUM -
CONCATENATION yes
CONV_2D yes
COS yes
DEPTH_TO_SPACE -
DEPTHWISE_CONV_2D yes
DEQUANTIZE yes (including uint8 type)
DIV yes
ELU yes
EQUAL yes
ETHOSU yes
EXP yes
EXPAND_DIMS yes
FILL yes
FLOOR yes
FLOOR_DIV yes
FLOOR_MOD yes
FULLY_CONNECTED yes
GATHER yes
GATHER_ND -
GREATER yes
GREATER_EQUAL yes
HARD_SWISH yes
IF -
L2_NORMALIZATION yes
L2_POOL_2D -
LEAKY_RELU yes
LESS yes
LESS_EQUAL yes
LOG yes
LOGICAL_AND yes
LOGICAL_NOT yes
LOGICAL_OR yes
LOG_SOFTMAX yes
LOGICAL_AND yes
LOGICAL_OR yes
LOGISTIC yes
MAX_POOL_2D yes
MAXIMUM yes
MEAN yes
MINIMUM yes
MIRROR_PAD yes
MUL yes
NEG yes
NOT_EQUAL yes
PACK yes
PAD yes
PADV2 yes
PRELU yes
QUANTIZE yes (including uint8 type)
READ_VARIABLE -
REDUCE_MAX yes
RELU yes
RELU6 yes
RESIZE_BILINEAR yes
RESIZE_NEAREST_NEIGHBOR yes
RSQRT yes
SELECT_V2 yes
SHAPE yes
SIN yes
SLICE yes
SOFTMAX yes
SPACE_TO_BATCH_ND yes
SPACE_TO_DEPTH yes
SPLIT yes
SPLIT_V yes
SQRT yes
SQUARE yes
SQUARED_DIFFERENCE yes
SQUEEZE yes
STRIDED_SLICE yes
SUB yes
SUM yes
SVDF -
TANH yes
TRANSPOSE yes
TRANSPOSE_CONV yes
UNIDIRECTIONAL_SEQUENCE_LSTM yes (float only)
UNPACK yes
VAR_HANDLE -
WHILE -
ZEROS_LIKE -
Operator (tflite.ops.micro namespace) [X-CUBE-AI][X_CUBE_AI_TFLITE_TOOLBOX] support
RESHAPE yes
ROUND yes

Optimized CMSIS-NN-based operators are exported by X-CUBE-AI (tensorflow\lite\micro\kernels\cmsis_nn\ directory). This is equivalent to use the 'TAGS=cmsis-nn' option to build the library with the original TFLm build system.

ADD, CONV_2D, DEPTHWISE_CONV_2D, FULLY_CONNECTED, MUL, AVERAGE_POOL_2D,
MAX_POOL_2D, SOFTMAX, SVDF

Note

The source files of the default version of the optimized operators are not available in the X-CUBE-AI pack.

Generated files

X-CUBE-AI exports a set of additional files which can be considered as the helper files to facilitate the usage of TFLm stack. Note that the exported TFLm files respect the source tree of the original repository.

%root_project_directory%
    |-- Middlewares
    |     \_ tensorflow                             /* TensorFlow lite for micro files */
    |          |_ tensorflow
    |          |    \_ lite  ..
    |          |       |- c
    |          |       |- core
    |          |       |- kernels
    |          |       \ ..
    |          \_ third_party ..
    |                |- cmsis
    |                \ ..
    |
    |-- X-CUBE-AI -- App                            /* Generated files (c-wrapper) */
    |                 |- network.c
    ..                |- network_tflite_data.h
                      |- tflm_c.cc
                      |- tflm_c.h
                      |- debug_log_imp.cc
                      ..

The tflm_c.h provides an optional light TFLite inference C-API on the top of the TFlite interpreter C++ API. It provides the requested services to initialize a model and to use it. For debug/profiling purpose, a profiler API is also provided.

TFLm C-wrapper
file description
network.c contains the c-array representation of the TFLite file.
network_tflite_data.h contains the pre-calculated arena size TFLM_NETWORK_TENSOR_AREA_SIZE
debug_log_imp.cc implements the DebugLog() function which is requested by the TFLm files.
tfm_c.cc implements the c-wrapper. Most part is generic, only the creation of specialized resolver is specific. The global 'TFLM_RUNTIME_USE_ALL_OPERATORS' C-define can be set to '1' to use the 'AllOpsResolver' object.
tfm_c.h c-wrapper definition

AI System Performance application

AI System Performance test application allows to evaluate the inference time. Random data are used to feed the model. Outputs are skipped. Reported inference time ('duration') is an average on 16 inferences. The effective used arena size 'Allocated size' is also reported.

Note

With a GCC base project, the following linker options should be used to monitor also the used stack and used heap during the execution of an inference: -u _printf_float -Wl,--wrap=malloc -Wl,--wrap=free

#
# AI system performance measurement TFLM 2.1
#
Compiled with GCC 9.3.1
STM32 Runtime configuration...
 Device       : DevID:0x0431 (STM32F411xC/E) RevID:0x1000
 Core Arch.   : M4 - FPU  used
 HAL version  : 0x01070600
 system clock : 100 MHz
 FLASH conf.  : ACR=0x00000703 - Prefetch=True $I/$D=(True,True) latency=3
 Timestamp    : SysTick + DWT (HAL_Delay(1)=1.004 ms)

Instancing the network.. (cWrapper: v2.0)
 TFLM version       : 2.10.0
 TFLite file        : 0x0801bae8 (16956 bytes)
 Arena location     : 0x200000d0
 Opcode size        : 2
 Operator size      : 4
 Tensor size        : 11
 Allocated size     : 1232 / 20480
 Inputs size        : 1
 - 0:FLOAT32:396:(1, 1, 1, 99)
 Outputs size       : 1
 - 0:FLOAT32:20:(1, 1, 1, 5)
 Used heap          : 224 bytes (max=224 bytes) (for c-wrapper 2.0)

Running PerfTest with random inputs (16 iterations)...
................

Results TFLM 2.10.0, 16 inferences @100MHz/100MHz
 duration     : 0.404 ms (average)
 CPU cycles   : 40404 (average)
 CPU Workload : 0% (duty cycle = 1s)
 used stack   : 380 bytes
 used heap    : 0:0 0:0 (req:allocated,req:released) max=0 cur=0 (cfg=3)

 Inference time by c-node
  kernel  : 0.393ms (time passed in the c-kernel fcts)
  user    : 0.022ms (time passed in the user cb)

 idx   name                      time (ms)
 ---------------------------------------------------
 0     FULLY_CONNECTED                0.305  77.75 %
 1     FULLY_CONNECTED                0.057  14.63 %
 2     FULLY_CONNECTED                0.017   4.45 %
 3     SOFTMAX                        0.012   3.17 %
 ---------------------------------------------------
                                      0.393 ms

AI Validation application

Warning

Validation on desktop is not supported.

Note

ai_runner module for Python-base environment can be also used with a STM32 aiValidation firmware based on the TFLM runtime.

If the STM32 is flashed with the TFLm aiValidation test application, the following validation flow can be used. It allows to use the validate command with the random data or the user data to evaluate different metrics.

TFLm STM32 validation flow
$ stm32ai validate <tflite_file_model> --mode stm32
Neural Network Tools for STM32AI v1.6.0 (STM.ai v7.3.0)

Setting validation data...
 generating random data, size=10, seed=42, range=default
 I[1]: (10, 1, 1, 99)/float32, min/max=[0.005, 1.000], mean/std=[0.490, 0.292], dense_1_input
 No output/reference samples are provided

Running the STM AI c-model (AI RUNNER)...(name=network, mode=stm32)

 STM Proto-buffer protocol 2.2 (SERIAL:COM5:115200:connected) ['network']

 Summary "network" - ['network']
 --------------------------------------------------------------------------------
 inputs/outputs       : 1/1
 input_1              : (1,1,1,99), float32, 396 bytes, user, in activations buffer
 ouputs_1             : (1,1,1,5), float32, 20 bytes, user, in activations buffer
 n_nodes              : 4
 compile_datetime     : Sep  8 2022 12:17:09 (NULL)
 activations          : 1232
 weights              : 16956
 macc                 : n.a.
 --------------------------------------------------------------------------------
 runtime              : Protocol 2.3 - TFLM (/gcc) 2.10.0 (Tools 2.10.0)
 capabilities         : ['IO_ONLY']
 device               : 0x431 - STM32F411xC/E @100/100MHz fpu,art_lat=3,
                        art_prefetch,art_icache,art_dcache
 --------------------------------------------------------------------------------

 Warning: C-network signature checking has been skipped

 Results for 10 inference(s) - average per inference
  device              : 0x431 - STM32F411xC/E @100/100MHz fpu,art_lat=3,
                        art_prefetch,art_icache,art_dcache
  duration            : 0.395ms
  CPU cycles          : 39538
  cycles/MACC         : n.a.

Running the TFlite model...

Saving validation data...
 output directory: C:\local\ai_tests\non_reg\stm32ai_output
 creating C:\local\ai_tests\non_reg\stm32ai_output\network_val_io.npz
 m_outputs_1: (10, 1, 1, 5)/float32, min/max=[0.000, 0.998], mean/std=[0.200, 0.330], nl_3
 c_outputs_1: (10, 1, 1, 5)/float32, min/max=[0.000, 0.998], mean/std=[0.200, 0.330], nl_3

Computing the metrics...

 Cross accuracy report #1 (reference vs C-model)
 --------------------------------------------------------------------------------
 notes: - the output of the reference model is used as ground truth/reference value
        - 10 samples (5 items per sample)

  acc=100.00%, rmse=0.000000245, mae=0.000000093, l2r=0.000000634

  5 classes (10 samples)
  ---------------------------------
  C0        0    .    .    .    .
  C1        .    0    .    .    .
  C2        .    .    0    .    .
  C3        .    .    .    8    .
  C4        .    .    .    .    2

Evaluation report (summary)
------------------------------ ... ------------------------------------------------
Output     acc     rmse        ...  std         tensor
------------------------------ ... ------------------------------------------------
X-cross #1 100.00% 0.000000245 ...  0.000000247 nl_3, ai_float, (1,1,1,5), m_id=[3]
------------------------------ ... ------------------------------------------------

Creating txt report file <output-directory-path>\network_validate_report.txt
elapsed time (validate): 1.786s