ST Edge AI Developer Cloud

r1.0

Introduction

The ST Edge AI Developer Cloud is a set of services allowing to optimize, quantize benchmark and deploy your trained neural network on a STM32, Stellar-E and ISPU targets.

The entry point of the service is a neural network model that you can directly upload in the tool or choose from a subset of the STM32 Model Zoo.

Your uploaded models are stored in a workspace accessible only by you and will be automatically deleted after 6 months of inactivity on the model.

To start from a model in the STM32 Model Zoo, first press Import to import the model in your workspace, then press Start on your selected model.

The flow consist in 6 steps:

Model Selection: select your uploaded model or import a model from the subset of STM32 Model Zoo
Quantize:
- For float Keras models, this steps allows you to trigger the TFLite post training quantization and create a quantized integer model
- For float ONNX models, this steps allows you to trigger ONNX Runtime post-training-quantization and create a quantized integer model
Optimize:
- For MCU targets: Allows you to choose the code generation optimization options
- For MPU targets: Allows you to generate an optimized model, dedicated to STM32MP2x with hardware acceleration
Benchmark: Run your network with the selected optimization options on a STM32 boards farm hosted by ST and get inference time
Results: List and compare all the benchmarks done
Generate: Generate a project or just the neural network code on a selected STM32 target

Select a platform

When entering this flow, you will be prompted to select a given platform. This platform can be:

STM32 MCUs: Start your session for STM32 Discovery Kits and Nucleos
STM32 MCU with Neural-ART™: Start with STM32 MCUs including Neural-ART™ to accelerate your AI applications
STM32 MPU: Start with STM32 Microprocessors embedding Cortex-A loaded with X-LINUX-AI
Stellar Platforms : Start with Stellar platforms to empower neural network architectures on automotive MCUs
MEMS Sensors with ISPU: Start with MEMS Sensors embedding ISPU, an ultralow power, computationally efficient, high-performance programmable core that can execute signal processing and AI algorithms in the edge

This platform selection will allow different runtime and optimizations, for better performance based on your use-case (general purpose, accelerated AI…)

Quantize

This panel can be used to create a quantized model (8b integer format) from a Keras float model or from an ONNX model.

Quantization (also called calibration) is an optimization technique to compress a 32-bit floating-point model by reducing the size (smaller storage size and less memory peak usage at runtime), by improving CPU/MCU usage and latency (including power consumption) with a small degradation of accuracy. A quantized model executes some or all of the operations on tensors with integers rather than floating point values.

The quantize service use both Tensorflow post-training quantization, with interface offered by TFLiteConverter using full 8b integer quantization of weights and activations, and ONNX Runtime which provides python APIs for converting 32-bit floating point model to an 8-bit integer model

Following code snippet illustrates the TFLiteConverter options used to enforce full integer post-training quantization for all operators including the input/output tensors.

def representative_dataset_gen():
  data = tload(...)

  for _ in range(num_calibration_steps):
    # Get sample input data as a numpy array in a method of your choosing.
    input = get_sample(data)
    yield [input]

converter = tf.lite.TFLiteConverter.from_keras_model_file(<keras_model_path>)
converter.representative_dataset = representative_dataset_gen
# This enables quantization
converter.optimizations = [tf.lite.Optimize.DEFAULT]
# This ensures that if any ops can't be quantized, the converter throws an error
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
# For full integer quantization, though supported types defaults to int8 only
converter.target_spec.supported_types = [tf.int8]
# These set the input and output tensors to uint8 (added in r2.3)
converter.inference_input_type = tf.uint8  # or tf.int8/tf.float32
converter.inference_output_type = tf.uint8  # or tf.int8/tf.float32
quant_model = converter.convert()

# Save the quantized file
with open(<tflite_quant_model_path>, "wb") as f:
    f.write(quant_model)
...

Quantization flow

In the “Apply post-training quantization” box select your input and output type and optionally upload an input/output quantization dataset as described below
Press the “Lauch Quantization” button, the quantized model will appear in the box below “Quantized models”
Select a model in the list of quantized models and press “Optimize selected quantized model” to get the Flash and Ram sizes according to the parameters in the Current parameters box
You can then select a result and go to Benchmark
To change the optimization parameters
- click on the “Change parameters” button in the “Current parameters” box, this will open the Optimize panel
- Select the desired quantized model in the model selection drop down
- Select your new optimization options and press Optimize
- The optimization results will be displayed in the History table. If your model is quantized, it will be flagged as quantized
- You can then select one optimization run and press “Go to Benchmark” to benchmark your model with the selected options

Post Training Quantization options

Input and Output type

Even if the model is int8 quantized, you can specify the format of the input and output buffers.

It is recommended to use int8 for input and output buffer, but you can use uint8 or float32 to ease the integration in the final application.
Optional NPZ Input/Output quantization dataset

If no data is provided, the quantization will be done with random data and the resulting quantized model can only be used for benchmarks to get the needed flash and ram size as well as the inference time, but no accuracy will be calculated.

If you care about the resulting network accuracy, the quantization must be done using input and output data provided in a Numpy npz file.

For a correct accuracy post quantization it is necessary to provide a representative set of data. For example for a classifier you must provide a set of data balanced over all the labels to be classified.

Following code snippet illustrates a typical usage of the Keras ImageDataGenerator class to build the input/output quantization file with data augmentation from a well-defined image data-set.

...

from tensorflow.keras.preprocessing.image import ImageDataGenerator
import numpy as np

test_data_gen = ImageDataGenerator(rescale=1. / 255)
test_generator = test_data_gen.flow_from_directory('./test',
                                                   target_size=(224, 224),
                                                   batch_size=nb_test_files,
                                                   class_mode='categorical')
test_data = next(test_generator)
x_test, y_test = test_data


# or to have a simple file
np.savez("mydata.npz",x_test=x_test,y_test=y_test)

Optimize (MCU, Stellar, ISPU)

When entering this flow a default optimization is done on the network using the following parameter:

Balanced optimization
Input and Ouput buffers allocated in the activation buffer

You can modify the default choice and press Optimize to see the impact.

Depending on the platform selected (MCU, Stellar, ISPU), you might observe differences in RAM and Flash consumptions. It allows to filter the boards or smart-sensors available in the “Benchmark” step.

Balanced, Ram or Time optimization

This optimization option is used to indicate the objective of the optimization passes which are applied to deploy the c-model. Note that the accuracy/precision of the generated model is not impacted.

objective	description
`time`	applied the optimization passes to reduce the inference time (or latency). In this case, the size of the used RAM (activations buffer) can be impacted.
`ram`	applied the optimization passes to reduce the used RAM for the activations. In this case, the inference time can be impacted.
`balanced`	trade-off between the `'time'` and the `'ram'` objectives, similar to default behavior with the previous releases.

Following figure illustrates the usage of the optimization option. It is based on the 'Nucleo-H743ZI@480MHz' board with the small MLPerf Tiny quantized models from https://github.com/mlcommons/tiny/tree/master/benchmark/training.

Input/Output buffer in activation buffer

If selected, this option indicates that the “activations” buffer will be also used to handle the input/output buffers.

For input buffer, it also implies that the data in the input buffer will be lost at the end of the inference. For output buffer, it implies that the next inference will erase the output result.

If not selected they will be allocated separately in the user memory space and could be reused in the application.

Depending on the size of the input data, the “activations” buffer may be bigger but overall less than the sum of the activation buffer plus the input buffer.

Those parameters have only an impact on memory with no impact on inference time.

Optimization results

At this stage the result table only contains the impact on the Flash and Ram needed to run the network. Inference time will be measured in the Benchmark step.

Click on a line of the table to get detailed information on your run and select the desired optimization options to go to the next steps.

Show graph button

This button allows you to run the web Netron application on the generated network and see a detailed memory allocation graph.
Compare with reference button

This button will run Netron using the default option run and compare it with the current run. You’ll see the impact of the optimization on the topology or the memory allocation.
Go to Benckmark button

Use the selected optimization options in the Benchmark steps.

Optimize (MCU with Neural-ART™)

When entering this flow a default optimization is done on the network using the following parameter:

Default optimization
Internal and external memories used, priority for internal memory, and 1MB reserved for code

You can modify the default choices and press Optimize to see the impact.

A submenu is available to configure compiler options. For experimented users, you can append Extra command line interface arguments text-input with specific flags:

Example:

--atonnOptions.Oauto : add option Oauto (auto optimization) for the compiler

Speed, default and automatic configurations

This optimization option is used to indicate the objective of the optimization passes which are applied to deploy the c-model. Note that the accuracy/precision of the generated model is not impacted.

optimization	description
`Default Configuration`	Trade-off in order to consume less memory for a given inference time. Selected by default.
`Automatic Configuration`	Try multiple configurations to have the best inference time and memory consumption. This can take a long time

Memory pool settings

Memory pool settings can force the compiler to use specific memory regions to improve performance or to give end-users available memory regions for application cd

optimization	description
`Internal and external memories (1MB reserved for code)`	All memory range with 1MB reserved for user application
`Internal memories`	Allocate only in internal memory
`Manual`	Select manually which memories are allowed for allocation. “0” means the memory will not be used.

Optimize (Microprocessors)

Optimization step for STM32 MPUs (supporting H/W acceleration) consists of executing our STM32 MPU Optimizer which generates a “Network Binary Graph” (.nb). Doing this action will allow the usage of H/W Accelerators such as Neural Processing Unit (NPU) and Graphics Processing Unit (GPU) embedded on some STM32 MPU series

Benchmark

This service allows you to run the selected network on several STMicroelectronics boards remotely and get the required internal or external Flash and Ram as well as the inference time.

You can choose the float or quantized network, or different optimization options, using the top combo box.

STMicroelectronics boards and smart-sensors are hosted in ST premises and will be available via a waiting queue.

The MCU clock frequency is set at the maximum possible for that board using the default option of the board. For example on the STM32H747I-DISCO the maximum clock frequency is 400MHz because the power supply is configured to use the internal SMPS. The mounted STM32H747XIHx MCU can go up to 480MHz but with a different configuration of the power supply that will require a physical change on the board.

On the same board, STM32H747I-DISCO, the MCU has 1MB of RAM and 2MB of FLASH but the MCU is a double core MCU. The benchmark service will run the benchmark on the Arm® Cortex®-M7 and only 512KB of RAM and 1MB of FLASH is allocated to the Cortex-M7.

If the network is too big to fit in internal Flash or Ram, it will be automatically balanced between internal and external memories if available.

You can run a maximum of 10 benchmarks in parallel. If the board is already processing another benchmark request, you’ll be placed in the waiting list and you will be updated on your place in the waiting list.

Once you have performed the benchmark, you can directly select this board for the code generation using the board menu.

Results

The results page displays a table of all the benchmarks done with the ST Edge AI Developer Cloud.

The performance summary graph allows you to compare all the runs.

You can directly select a benchmark and select it for the code generation part.

Generate (MCU, Stellar, ISPU)

The Generate page can generate all what you need to just update an existing project, create a new STM32CubeMX project, create a STM32CubeIDE project with all the sources included or just download a compiled firmware just to estimate the inference time on your board.

The first step is to select your board based on your criteria (if not already selected from the Benchmark or Results page).

Then you have multiple generations options:

Download C Code
Download Firmware

For MCUs: - Download STM32CubeMX IOC file - Download STM32CubeIDE Project

For Stellar and ISPUs: - Download Makefile project

Download C Code

Generates a zip with all the network c and h files as well as the stm32ai command line reports and output.

You can then copy the files in your project to replace the existing ones.

Download STM32CubeMX IOC file

Generates a zip with a STM32CubeIDE project already configured and with the code generated.

The board is the selected board
X-CUBE-AI is activated in the project and SystemPerformance application is selected
The neural network is configured in the project and is available in the zip
The USART needed for system performance is already configured

Just open the project with STM32CubeMX v6.6.1 or above and either generate directly the code or start to add the peripherals you need for your application.

Download STM32CubeIDE Project

Generates a zip with a STM32CubeIDE project already configured and generated:

The board is the selected board
X-CUBE-AI is activated in the project and SystemPerformance application is selected
The neural network is configured in the project and is available in the zip
The USART needed for system performance application is already configured
the STM32CubeIDE project and all the code and library is generated

Just open the project with STM32CubeIDE 1.10.0 or above and you can compile and flash the program on your board.

Download Firmware

Generates a elf file that can be directly flashed on your board using STM32CubeProgrammer for STM32 MCUs and ISPU, OpenOCD for Stellar. The System Performance application is enabled on the application.

The AI system performance application is a self and bare-metal on-device application, which allows the out-of the-box measurement of the critical system integration aspects of the generated NN. The accuracy performance aspect is not and cannot be considered here. The reported measurements are:

CPU cycles by inference (duration in ms, CPU cycles, CPU workload)
Used stack and used heap (in bytes)

Download Makefile project

Generates a zip with a Makefile project already configured and generated. Given your user parameters, it generates a template with C-code associated to your model.

How to get output from the System Performance application running on the STM32 board

Execute the following series of steps in sequence to run the application:

Open and configure a host serial terminal console connected via a COM port (usually supported by a Virtual COM port over a USB connection, such as an ST-LINK/V2 feature).
Set the COM setting
- 115200 bauds
- 8 bits
- 1 stop bit
- No parity
Reset the board to launch the application

The application embeds aminimal interactive console, which supports the following commands:


Possible key for the interactive console:
  [q,Q] quit the application
  [r,R] re-start (NN de-init and re-init)
  [p,P] pause the main loop
  [h,H,?] this information
  xx continue immediately

Generate (Microprocessors)

Download Optimized Network Binary

Downloads the generated Network Binary Graph (.nb) previously optimized in the optimization step

Download original model

Downloads the model selected in “Model” dropdown. Can be the one originally uploaded by you or the output of a quantization step

Deployment architecture and data protection

ST Edge AI Developer Cloud are deployed on Azure using the following high level infrastructure

ST Edge AI Developer Cloud deployment architecture

Customer External access and data upload

External access to the service is always done through a firewall, a load balancer and a route dispatcher. All accesses are performed using encrypted secured https.

All users are authenticated using my.st.com authentication.

There is no direct access to the internal Azure services nor uploaded resoures.

Uploaded data are checked against malicious data.

Model and Data storage

Uploaded models are stored in an Azure storage servive and accessible only by the user that has uploaded the model, the stm32ai micro services and the benchmark farm for the purpose of their service.

Models are automatically deleted after 6 months of inactivity.

Uploaded data is kept only for the time of the action.

Access to the storage is only allowed by private end points not visible outside the DMZ

ST hosted benchmark farm

The benchmark farm servers are dedicated servers with a dedicated tunnel to access the azure services through the ST outband firewall and the Azure services inbound firewall.

Customer models are used on the benchmark farm only for the purpose of the benchmark and are deleted upon completion.

Embedded Documentation - ST Edge AI Developer Cloud
r1.0

Information in this document is provided solely in connection with ST products. The contents of this document are subject to change without prior notice.
© Copyright STMicroelectronics 2023. All rights reserved. www.st.com