Overview

X-CUBE-AI solution

X-CUBE-AI Expansion Package integrates a specific path which allows to generate a ready-to-use STM32 IDE project embedding a [TensorFlow Lite for Microcontrollers][REF_TFLM] run-time (also called TFLm in this article) and its associated TFLite model. This can be considered as an alternative of the default X-CUBE-AI solution to deploy a AI solution based on a TFLite model.

Thanks to STM32CubeMX and its eco-system, the user can create and configure, in one-click, a complete STM32 firmware including different middleware/drivers and the sources of the TFLm interpreter and associated kernels. They are directly added in the exported source tree and the build system is adapted to use the options for speed.

aiSystemPerformance and aiValidation test applications can be used to evaluate the performances of the deployed TFLite model. The aiTemplate application is also available to develop an application.
all STM32 IDE toolchains are supported:
- STMicroelectronics - STM32CubeIDE version 1.0.1 or later
- Keil® - MDK-ARM Professional Version - µVision® with Arm Compiler v6
- IAR Systems - IAR Embedded Workbench® IDE - ARM v8.x
- GNU Arm Embedded toolchain
only the TFLm source files (.h and .cc files) from the official https://github.com/tensorflow/tflite-micro GitHub repository (sha-1 = 6a1803, main branch) which are requested, are included in the X-CUBE-AI pack. It contains also the CMSIS files (in particular the CMSIS-NN files) which are requested to implement the optimized version of some operators. Version of the source files is aligned with the TensorFlow python module version which is used by X-CUBE-AI pack.
the TFLm importer module processes the TFLite file allowing to retrieve the list of used TFLite operators and to estimate the requested 'arena' size. These information are used to generate the simple C-wrapper files which can be directly (or not) used by the client application.
no command-line option is available to generate the STM32 IDE project.

TFLite file?

TensorFlow Lite framework is used to deploy a deep learning model on mobile and embedded devices. Generated TFLite file is a self-contained file containing a frozen description of the graph (or inference model), setting of the operators and the tensors (including the data). 32b floating point and quantized models are supported.

This file is directly used by a runtime interpreter, see [TensorFlow Lite for Microcontrollers][REF_TFLM], or as entry point for a compiler like Coral Edge TPUs compiler or code generator like X-CUBE-AI to create an adapted and optimized version targeting a particular hardware: MPU, MCU or hardware assist IP.

Functional point of view, content of the TFLite file is similar to the generated <network>.c and <network>_data.c files. The implementation of the kernels is provided separately as a static library for the default X-CUBE-AI solution or by an embedded runtime (interpreter approach) for the default TFLm solution.

TensorFlow Lite Micro stack

Following figure illustrates a typical TFLm stack. The 'tflite::MicroInterpreter' class is used by the client application to create and to use an instance of the model. The 'arena' buffer is requested to allocate the internal, input and output tensors and associated data structure to manage the instance (see “Memory Management in TensorFlow Lite Micro” article for more details). No system heap is requested. The model itself should be passed as a simple memory-mapped c-array. The “Get started with microcontrollers” article explains step-by-step how to train and to run inference with the C++ API (hello-world example).

Default TFLite for Microcontrollers stack

Resolver

The resolver module can be considered as a registry which is used by the interpreter to access the operators that are used by the model. By default, a generic resolver can be declared to support all available operators. In this case, the code of all operators is embedded in the firmware.

static tflite::AllOpsResolver resolver;

To avoid this situation, it is possible to declare a specific resolver and to register only the requested operators. As for the size of the arena buffer, the X-CUBE-AI TFLm importer generates automatically this list, to optimize the requested flash size (see 'tflm_c.cc' file).

static tflite::MicroMutableOpResolver<5> resolver;
resolver.AddConv2D();
resolver.AddDepthwiseConv2D();
resolver.AddFullyConnected();
resolver.AddSoftmax();
resolver.AddMaxPool2D();

Supported operators

As for the [supported TFLite][X_CUBE_AI_TFLITE_TOOLBOX] operators by X-CUBE-AI, all TFLIte operators are not supported by the TFLm run-time. The list of the supported operators can be found in the following file: tensorflow\lite\micro\kernels\micro_ops.h

Operator (tflite namespace)	[X-CUBE-AI][X_CUBE_AI_TFLITE_TOOLBOX] support
ABS	yes
ADD	yes
ADD_N	-
ARG_MAX	yes
ARG_MIN	yes
ASSIGN_VARIABLE	-
AVERAGE_POOL_2D	yes
BATCH_TO_SPACE_ND	yes
BROADCAST_ARGS	-
BROADCAST_TO	-
CALL_ONCE	-
CAST	yes
CEIL	yes
CIRCULAR_BUFFER	-
CUMSUM	-
CONCATENATION	yes
CONV_2D	yes
COS	yes
DEPTH_TO_SPACE	-
DEPTHWISE_CONV_2D	yes
DEQUANTIZE	yes (including uint8 type)
DIV	yes
ELU	yes
EQUAL	yes
ETHOSU	yes
EXP	yes
EXPAND_DIMS	yes
FILL	yes
FLOOR	yes
FLOOR_DIV	yes
FLOOR_MOD	yes
FULLY_CONNECTED	yes
GATHER	yes
GATHER_ND	-
GREATER	yes
GREATER_EQUAL	yes
HARD_SWISH	yes
IF	-
L2_NORMALIZATION	yes
L2_POOL_2D	-
LEAKY_RELU	yes
LESS	yes
LESS_EQUAL	yes
LOG	yes
LOGICAL_AND	yes
LOGICAL_NOT	yes
LOGICAL_OR	yes
LOG_SOFTMAX	yes
LOGICAL_AND	yes
LOGICAL_OR	yes
LOGISTIC	yes
MAX_POOL_2D	yes
MAXIMUM	yes
MEAN	yes
MINIMUM	yes
MIRROR_PAD	yes
MUL	yes
NEG	yes
NOT_EQUAL	yes
PACK	yes
PAD	yes
PADV2	yes
PRELU	yes
QUANTIZE	yes (including uint8 type)
READ_VARIABLE	-
REDUCE_MAX	yes
RELU	yes
RELU6	yes
RESIZE_BILINEAR	yes
RESIZE_NEAREST_NEIGHBOR	yes
RSQRT	yes
SELECT_V2	yes
SHAPE	yes
SIN	yes
SLICE	yes
SOFTMAX	yes
SPACE_TO_BATCH_ND	yes
SPACE_TO_DEPTH	yes
SPLIT	yes
SPLIT_V	yes
SQRT	yes
SQUARE	yes
SQUARED_DIFFERENCE	yes
SQUEEZE	yes
STRIDED_SLICE	yes
SUB	yes
SUM	yes
SVDF	-
TANH	yes
TRANSPOSE	yes
TRANSPOSE_CONV	yes
UNIDIRECTIONAL_SEQUENCE_LSTM	yes (float only)
UNPACK	yes
VAR_HANDLE	-
WHILE	-
ZEROS_LIKE	-

Operator (tflite.ops.micro namespace)	[X-CUBE-AI][X_CUBE_AI_TFLITE_TOOLBOX] support
RESHAPE	yes
ROUND	yes

Optimized CMSIS-NN-based operators are exported by X-CUBE-AI (tensorflow\lite\micro\kernels\cmsis_nn\ directory). This is equivalent to use the 'TAGS=cmsis-nn' option to build the library with the original TFLm build system.

ADD, CONV_2D, DEPTHWISE_CONV_2D, FULLY_CONNECTED, MUL, AVERAGE_POOL_2D,
MAX_POOL_2D, SOFTMAX, SVDF

Note

The source files of the default version of the optimized operators are not available in the X-CUBE-AI pack.

Generated files

X-CUBE-AI exports a set of additional files which can be considered as the helper files to facilitate the usage of TFLm stack. Note that the exported TFLm files respect the source tree of the original repository.

%root_project_directory%
    |-- Middlewares
    |     \_ tensorflow                             /* TensorFlow lite for micro files */
    |          |_ tensorflow
    |          |    \_ lite  ..
    |          |       |- c
    |          |       |- core
    |          |       |- kernels
    |          |       \ ..
    |          \_ third_party ..
    |                |- cmsis
    |                \ ..
    |
    |-- X-CUBE-AI -- App                            /* Generated files (c-wrapper) */
    |                 |- network.c
    ..                |- network_tflite_data.h
                      |- tflm_c.cc
                      |- tflm_c.h
                      |- debug_log_imp.cc
                      ..

The tflm_c.h provides an optional light TFLite inference C-API on the top of the TFlite interpreter C++ API. It provides the requested services to initialize a model and to use it. For debug/profiling purpose, a profiler API is also provided.

file	description
network.c	contains the c-array representation of the TFLite file.
network_tflite_data.h	contains the pre-calculated arena size `TFLM_NETWORK_TENSOR_AREA_SIZE`
debug_log_imp.cc	implements the `DebugLog()` function which is requested by the TFLm files.
tfm_c.cc	implements the c-wrapper. Most part is generic, only the creation of specialized resolver is specific. The global `'TFLM_RUNTIME_USE_ALL_OPERATORS'` C-define can be set to `'1'` to use the `'AllOpsResolver'` object.
tfm_c.h	c-wrapper definition

AI System Performance application

AI System Performance test application allows to evaluate the inference time. Random data are used to feed the model. Outputs are skipped. Reported inference time ('duration') is an average on 16 inferences. The effective used arena size 'Allocated size' is also reported.

Note

With a GCC base project, the following linker options should be used to monitor also the used stack and used heap during the execution of an inference: -u _printf_float -Wl,--wrap=malloc -Wl,--wrap=free

#
# AI system performance measurement TFLM 2.1
#
Compiled with GCC 9.3.1
STM32 Runtime configuration...
 Device       : DevID:0x0431 (STM32F411xC/E) RevID:0x1000
 Core Arch.   : M4 - FPU  used
 HAL version  : 0x01070600
 system clock : 100 MHz
 FLASH conf.  : ACR=0x00000703 - Prefetch=True $I/$D=(True,True) latency=3
 Timestamp    : SysTick + DWT (HAL_Delay(1)=1.004 ms)

Instancing the network.. (cWrapper: v2.0)
 TFLM version       : 2.10.0
 TFLite file        : 0x0801bae8 (16956 bytes)
 Arena location     : 0x200000d0
 Opcode size        : 2
 Operator size      : 4
 Tensor size        : 11
 Allocated size     : 1232 / 20480
 Inputs size        : 1
 - 0:FLOAT32:396:(1, 1, 1, 99)
 Outputs size       : 1
 - 0:FLOAT32:20:(1, 1, 1, 5)
 Used heap          : 224 bytes (max=224 bytes) (for c-wrapper 2.0)

Running PerfTest with random inputs (16 iterations)...
................

Results TFLM 2.10.0, 16 inferences @100MHz/100MHz
 duration     : 0.404 ms (average)
 CPU cycles   : 40404 (average)
 CPU Workload : 0% (duty cycle = 1s)
 used stack   : 380 bytes
 used heap    : 0:0 0:0 (req:allocated,req:released) max=0 cur=0 (cfg=3)

 Inference time by c-node
  kernel  : 0.393ms (time passed in the c-kernel fcts)
  user    : 0.022ms (time passed in the user cb)

 idx   name                      time (ms)
 ---------------------------------------------------
 0     FULLY_CONNECTED                0.305  77.75 %
 1     FULLY_CONNECTED                0.057  14.63 %
 2     FULLY_CONNECTED                0.017   4.45 %
 3     SOFTMAX                        0.012   3.17 %
 ---------------------------------------------------
                                      0.393 ms

AI Validation application

Warning

Validation on desktop is not supported.

Note

ai_runner module for Python-base environment can be also used with a STM32 aiValidation firmware based on the TFLM runtime.

If the STM32 is flashed with the TFLm aiValidation test application, the following validation flow can be used. It allows to use the validate command with the random data or the user data to evaluate different metrics.

$ stm32ai validate <tflite_file_model> --mode stm32
Neural Network Tools for STM32AI v1.6.0 (STM.ai v7.3.0)

Setting validation data...
 generating random data, size=10, seed=42, range=default
 I[1]: (10, 1, 1, 99)/float32, min/max=[0.005, 1.000], mean/std=[0.490, 0.292], dense_1_input
 No output/reference samples are provided

Running the STM AI c-model (AI RUNNER)...(name=network, mode=stm32)

 STM Proto-buffer protocol 2.2 (SERIAL:COM5:115200:connected) ['network']

 Summary "network" - ['network']
 --------------------------------------------------------------------------------
 inputs/outputs       : 1/1
 input_1              : (1,1,1,99), float32, 396 bytes, user, in activations buffer
 ouputs_1             : (1,1,1,5), float32, 20 bytes, user, in activations buffer
 n_nodes              : 4
 compile_datetime     : Sep  8 2022 12:17:09 (NULL)
 activations          : 1232
 weights              : 16956
 macc                 : n.a.
 --------------------------------------------------------------------------------
 runtime              : Protocol 2.3 - TFLM (/gcc) 2.10.0 (Tools 2.10.0)
 capabilities         : ['IO_ONLY']
 device               : 0x431 - STM32F411xC/E @100/100MHz fpu,art_lat=3,
                        art_prefetch,art_icache,art_dcache
 --------------------------------------------------------------------------------

 Warning: C-network signature checking has been skipped

 Results for 10 inference(s) - average per inference
  device              : 0x431 - STM32F411xC/E @100/100MHz fpu,art_lat=3,
                        art_prefetch,art_icache,art_dcache
  duration            : 0.395ms
  CPU cycles          : 39538
  cycles/MACC         : n.a.

Running the TFlite model...

Saving validation data...
 output directory: C:\local\ai_tests\non_reg\stm32ai_output
 creating C:\local\ai_tests\non_reg\stm32ai_output\network_val_io.npz
 m_outputs_1: (10, 1, 1, 5)/float32, min/max=[0.000, 0.998], mean/std=[0.200, 0.330], nl_3
 c_outputs_1: (10, 1, 1, 5)/float32, min/max=[0.000, 0.998], mean/std=[0.200, 0.330], nl_3

Computing the metrics...

 Cross accuracy report #1 (reference vs C-model)
 --------------------------------------------------------------------------------
 notes: - the output of the reference model is used as ground truth/reference value
        - 10 samples (5 items per sample)

  acc=100.00%, rmse=0.000000245, mae=0.000000093, l2r=0.000000634

  5 classes (10 samples)
  ---------------------------------
  C0        0    .    .    .    .
  C1        .    0    .    .    .
  C2        .    .    0    .    .
  C3        .    .    .    8    .
  C4        .    .    .    .    2

Evaluation report (summary)
------------------------------ ... ------------------------------------------------
Output     acc     rmse        ...  std         tensor
------------------------------ ... ------------------------------------------------
X-cross #1 100.00% 0.000000245 ...  0.000000247 nl_3, ai_float, (1,1,1,5), m_id=[3]
------------------------------ ... ------------------------------------------------

Creating txt report file <output-directory-path>\network_validate_report.txt
elapsed time (validate): 1.786s