Quantized model support

ST Edge AI Core Technology 2.2.0

r1.3

References

[1] “A Survey of Quantization Methods for Efficient Neural Network Inference” - https://arxiv.org/abs/2103.13630v3
[2] “Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference” - https://arxiv.org/abs/1712.05877
[3] “Quantization and benchmarking of deep learning models using ONNX Runtime and STM32Cube.AI Developer Cloud” - https://github.com/STMicroelectronics/stm32ai-modelzoo-services/blob/main/tutorials/notebooks

[https://github.com/STMicroelectronics/stm32ai-modelzoo-services/tree/main/tutorials/notebooks][https://github.com/STMicroelectronics/stm32ai-modelzoo-services/tree/main/tutorials/notebooks]

Overview

ST Edge AI Core CLI can be used to deploy a quantized model. This article covers only the 8-bit integer-based quantized model. The support for lower-than-8-bit quantization is detailed in the “Deep Quantized Neural Network (DQNN) support” article.

Quantization is an optimization technique [1] used to compress a 32-bit floating-point model. This compression reduces the model’s size, decreases memory usage at runtime, and enhances processing unit efficiency and latency, including power consumption, with only a minor loss in accuracy. In a quantized model, some or all operations are performed on tensors using integers instead of floating-point values. Quantization is a key component of various optimization strategies, including topology-oriented optimization, feature-map reduction, pruning, and weight compression, which are crucial for deploying models in resource-constrained environments.

There are two primary methods of quantization:

Post-Training Quantization (PTQ): This method is easier to use. It allows you to quantize a pretrained model using a limited, representative dataset (also known as a calibration dataset).
Quantization-Aware Training (QAT): This method is integrated into the training process and generally results in better model accuracy.

Supported file format

ST Edge AI Core can import different format of quantized model:

The quantized TensorFlow lite models generated by a posttraining or training aware process. The calibration is performed by the TensorFlow Lite framework, principally through the “TFLite converter” utility exporting a TensorFlow lite file.
The quantized ONNX models based on the operator-oriented (QOperator) or the tensor-oriented (QDQ; Quantize and DeQuantize) format. The first format depends on the supported QOperators (see the “QLinearXXX* operators”) while the second is more generic. The DeQuantizeLinear(QuantizeLinear(tensor)) operators are inserted between the original operators (in float) to simulate the quantization and dequantization process. Both formats can be generated with the ONNX runtime module.

End-to-end flow to quantize a Pytorch, Tensorflow, Keras model

For performance reasons (memory peak usage and processing capabilities), the deployed kernels support only quantized weights and quantized activations (“Integer only quantization” mode). If this pattern is not detected by the optimizing and rendering passes of the code generator, a fallback to a floating-point version of the operator is used. This fallback involves inserting QUANTIZE/DEQUANTIZE operators, a process known as fake, or simulated quantization. Subsequently:

The dynamic quantizations approach allowing to determine dynamically the quantization parameters during the inference is NOT supported. If the user provided this type of model, a fully float32 implementation will generated.
The mixed or hybrid models are supported.

The “analyze”, “validate” and “generate” commands can be used w/o limitations.

$ stedgeai analyze -m <quantized_model_file>.tflite --target stm32
$ stedgeai validate -m <quantized_model>.tflite --target stm32 -vi test_data.npz
$ stedgeai analyze -m <quantized_model_file>.onnx --target stm32

Privileged quantization scheme

target	requested quantization scheme
stm32xx devices based on Arm Cortex®-M processor	ss/sa per channel, binary quantization (other 8-bit quantization scheme including per-tensor are supported but not necessarily optimized)
stm32xx with ST Neural-ART IP	ss/sa per channel
stellar-e devices based on Arm Cortex® M7 processor	ss/sa per channel, binary quantization (other 8-bit quantization scheme including per-tensor are supported but not necessarily optimized)
stellar-pg[xx] devices based on Arm Cortex® R52+ and Arm Cortex® M4 processors	ss/sa per channel, binary quantization (other 8-bit quantization scheme are supported but not necessarily optimized)
ISPU target	ss/sa per channel, binary quantization
MLC target	see “ST Edge AI Core for MLC” article
ISPU target	see “ST Edge AI Core for ISPU” article
stm32mp target	see “ST Edge AI Core for STM32MPU series” article

Quantized tensors ([1], [2])

ST Edge AI Core supports 8-bit integer-based (int8 or uint8 data type) arithmetic for quantized tensors, which are based on the representative convention used by Google for quantized models ([2]). Each real number r is represented as a function of the quantized value q, a scale factor (an arbitrary positive real number), and a zero_point parameter. The quantization scheme is an affine mapping of the integers q to real numbers r. zero_point has the same integer C-type as the q data.

The precision is dependent of a scale factor and the quantized values are linearly distributed around the zero_point value. In both cases, resolution/precision is constant vs floating-point representation.

8-bit affine quantization - integer precision

Per-axis vs per-tensor quantization

In per-tensor (or layer wise) quantization, the same format (that is, scale/zero_point) is used for the entire tensor (weights or activations). In contrast, in per-axis (or per-channel or channel wise) quantization, used in convolution-based operators, there is one scale and/or zero_point per filter. They are independent of the other channels, ensuring better quantization (from an accuracy point of view) with negligible computation overhead.

Per-axis approach is currently the standard method used for quantized convolutional kernels (weight tensors). Activation tensors are always in per-tensor. This approach mainly drives the design of the optimized C-kernels.

Symmetric vs Asymmetric

In asymmetric quantization, the tensor can have zero_point anywhere within the signed 8-bit range [-128, 127] or unsigned 8-bit range [0, 255]. In contrast, in symmetric quantization, the tensor is forced to have zero_point equal to zero. By enforcing zero_point to zero, some kernel implementation optimizations are possible to limit the cost of operations (offline precalculation).

In TFLite and ONNX quantized models, the weights/bias are always symmetrically quantized, so the symmetric format for the weights/bias is not supported. For activations, both asymmetric and symmetric formats are supported.

Signed integer vs Unsigned integer - supported schemes

Implementations do not support all possible combinations of type and symmetric/asymmetric format for the weights and/or activations, even though you can define signed or unsigned integer types for them. Only the following integer schemes or combinations are supported:

scheme	weights	activations
ua/ua	unsigned and asymmetric	unsigned and asymmetric
ss/sa	signed and symmetric	signed and asymmetric
ss/ua	signed and symmetric	unsigned and asymmetric

Quantized Tensorflow models

ST Edge AI Core is able to import the quantization aware training and the posttraining quantized TensorFlow lite models. Previous quantized aware training models were based on the “ua/ua” scheme. Now, TensorFlow v1.15 or v2.x quantized models are based on the “ss/sa” and per-channel scheme. Activations are asymmetric and signed (int8), weights/bias are symmetric and signed (int8). This scheme is also the privileged scheme to address efficiently the Coral Edge TPUs or TensorFlow Lite for Microcontrollers runtime.

Supported/recommended methods

Posttraining quantization: https://www.tensorflow.org/lite/performance/post_training_quantization
Quantization aware training: https://www.tensorflow.org/model_optimization/guide/quantization/training

method/option	supported/recommended
Dynamic range quantization	not supported, only static approach is considered
Full integer quantization	supported, representative dataset should be used for the calibration
Integer with float fallback (using default float input/output)	supported, mixed model, representative dataset should be used for the calibration
Integer only	recommended, representative dataset should be used for the calibration
Weight Only Quantization	not supported
Input/output data type	`'uint8'`, `'int8'`, and `'float32'` can be used
Float16 quantization	not supported
Per Channel	default behavior can not be modified
Activation type	`'int8'`, “ss/sa” scheme
Weight type	`'int8'`, “ss/sa” scheme

“Integer only” method

The following code snippet illustrates the recommended TFLiteConverter options to enforce a full integer scheme (posttraining quantization for all operators including the input/output tensors).

def representative_dataset_gen():
  data = tload(...)

  for _ in range(num_calibration_steps):
    # Get sample input data as a numpy array in a method of your choosing.
    input = get_sample(data)
    yield [input]

converter = tf.lite.TFLiteConverter.from_saved_model(<saved_model_dir>)
# converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.representative_dataset = representative_dataset_gen
# This enables quantization
converter.optimizations = [tf.lite.Optimize.DEFAULT]
# This ensures that if any ops can't be quantized, the converter throws an error
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
# These set the input and output tensors to int8
converter.inference_input_type = tf.int8  # or tf.uint8
converter.inference_output_type = tf.int8  # or tf.uint8
quant_model = converter.convert()

# Save the quantized file
with open(<tflite_quant_model_path>, "wb") as f:
    f.write(quant_model)
...

“Full integer quantization” method

As the mixed models are supported by ST Edge AI Core, the full integer quantization method can be used. However, it is also preferable to enable the TensorFlow Lite ops (TFLITE_BUILTINS) as only a limited number of TensorFlow operators are supported (see the supported [TFLITE] operators).

...
converter = tf.lite.TFLiteConverter.from_saved_model(<saved_model_dir>)
# converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.representative_dataset = representative_dataset_gen
# This enables quantization
converter.optimizations = [tf.lite.Optimize.DEFAULT]
# This optional option, ensures that TF Lite operators are used.
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS]
# These set the input and output tensors to float32
converter.inference_input_type = tf.float32  # or tf.uint8, tf.int8
converter.inference_output_type = tf.float32  # or tf.uint8, tf.int8
quant_model = converter.convert()
...

Tip

Quantization of the input or/and output tensors are optional. They can be conserved in float for convenience and deployment facility, for example to keep the pre or/and postprocesses in float.

Warning

'tf.lite.Optimize.DEFAULT' option enables the quantization process. However, to be sure to have the quantized weights and quantized activations, 'representative_dataset' attribute should be always used. Else only the weights/params will be quantized allowing to reduce the size of the generated file by ~4. But in this case, as for the deprecated option OPTIMIZE_FOR_SIZE, the tflite file is deployed as a fully floating-point c-model with the weights in float (dequantized values).

TensorFlow lite, OPTIMIZE_FOR_SIZE option support

Post-quantization TensorFlow lite script (TFLiteConverter, TF 1.15) allows to generate a weight-only quantized file ('OPTIMIZE_FOR_SIZE' option). This simplest scheme (also called “hybrid” quantization) allows to reduce the size of the generated file (~by 4). Only the weights are quantized from floating-point to 8 bits of precision. At inference, weights are converted from 8 bits of precision to floating-point and computed using floating-point kernels.

This quantization scheme is not supported by ST Edge AI Core, in particular by the C-inference engine and operator implementations (network runtime library), mainly because of the devices resource constraints. Additional RAM memory to cache the uncompressed parameters would be requested to reduce the latency. If this model is imported, the parameters are converted to floating-point before code generation. Only the full 8b integer quantization of weights and activations scheme is supported.

Quantized ONNX models

https://onnxruntime.ai/docs/performance/quantization.html

Quantize and DeQuantize (QDQ) format

To quantize an ONNX model, it is recommended to use the quantized services from the ONNX runtime module. The tensor-oriented (QDQ; Quantize and DeQuantize) format is privileged. As illustrated by the following figure, the additional DeQuantizeLinear(QuantizeLinear(tensor)) between the original operators will be automatically detected and removed to deploy the associated optimized integer kernels.

Notes

The merging of Batch Normalization operators is automatically done by the quantize process. Model can also be previously optimized for inference before quantizing it (ONNX Simplifier can also be used)
```
  $ python -m onnxruntime.quantization.preprocess --input model.onnx --output model-infer.onnx
```
Since the implementation of the deployed kernels is mainly channel last (NHWC data format) and to respect the original input data representation, a ‘Transpose’ operator is added in the generated graph. Note that, depending on the target, this operation can be performed by software and the cost may not be negligible. To improve this behavior and also to take account for application constraints, advanced options can be used to specify the expected input or output data layout (refer to “How to change the I/O data type or layout (NHWC vs NCHW)” article for more details)
By default, the I/O data type (float32) of the original ONNX model is not preserved but optimized to align with the used activation type for the quantization scheme. Advanced options can be used to specific the expected data type. As illustrated in the following figure, the option “–input-data-type float32” is used to insert a quantizer operation and preserve the original data type. (Refer to “How to change the I/O data type or layout (NHWC vs NCHW)” article for more details)

Supported/recommended methods/options

method/option	supported/recommended
Dynamic quantization	not supported, only static approach is considered
Static Quantization	recommended, representative dataset should be used for the calibration
Quant format	`'QuantFormat.QDQ'` is recommended, `'QuantFormat.QOperator'` is not recommended (not tested, and limited supported “QLinearXXX operators”*)
Activation type	`'QuantType.QInt8'` is recommended (default), `'QuantType.QUInt8'` is not recommended and not tested
Weight type	`'QuantType.QInt8'` is recommended (default), `'QuantType.QUInt8'` is not recommended and not tested
Calibration Method	`'CalibrationMethod.MinMax'`, `'CalibrationMethod.Entropy'` and `'CalibrationMethod.Percentile'` can be used
Per Channel	recommended: `True`, per tensor (= `False`) is not recommended and not tested
nodes_to_exclude/nodes_to_quantize	supported, the mixed models can be deployed by ST Edge AI Core

Warning

As mentioned in the ONNX runtime documentation, “Data type selection” section, on x86-64 machines with AVX2 and AVX512 extensions, ONNX runtime uses the VPMADDUBSW instruction for U8S8 for performance. This instruction might suffer from saturation issues: it can happen that the output does not fit into a 16-bit integer and has to be clamped (saturated) to fit. Generally, this is not a significant issue for the final result. However, if you do encounter a large accuracy drop, it may caused by saturation. In this case, you can either try reduce_range or the U8U8 format which does not have saturation issues. There is no such issue on other CPU architectures (x64 with VNNI and Arm).

“Static Quantization” method

Note

To illustrate the usage of ONNX quantizer, a complete notebook is available - “Quantization and benchmarking of deep learning models using ONNX runtime and STM32Cube.AI Developer Cloud” [3].

The following code snippet illustrates a typical Python script (posttraining quantization) to quantize an NN model processing images (classifier or object-detector applications). It is based on the end-to-end example from the “Quantize ONNX Models” article. Note that only the inputs (images used for the calibration) are requested and the associated data can be transposed (hwc to chw) to be conform to the expected input data representation.

import numpy
import onnxruntime
from onnxruntime.quantization import QuantFormat, QuantType, StaticQuantConfig, quantize, CalibrationMethod
from onnxruntime.quantization import CalibrationDataReader
from PIL import Image

input_model_path = 'my_model.onnx'
output_model_path = 'my_model_quantized.onnx'

calibration_dataset_path = '/path/to/data/for/calibration'

def _preprocess_images(images_folder: str, height: int, width: int):
    """
    Load a batch of images and preprocess them..
    """
    image_names = os.listdir(images_folder)
    batch_filenames = image_names
    unconcatenated_batch_data = []

    for image_name in batch_filenames:
        image_filepath = images_folder + "/" + image_name
        pillow_img = Image.new("RGB", (width, height))
        pillow_img.paste(Image.open(image_filepath).resize((width, height)))
        input_data = numpy.float32(pillow_img) - numpy.array(
            [123.68, 116.78, 103.94], dtype=numpy.float32
        )
        nhwc_data = numpy.expand_dims(input_data, axis=0)
        nchw_data = nhwc_data.transpose(0, 3, 1, 2)  # ONNX Runtime standard
        unconcatenated_batch_data.append(nchw_data)
    batch_data = numpy.concatenate(
        numpy.expand_dims(unconcatenated_batch_data, axis=0), axis=0
    )
    return batch_data

class XXXDataReader(CalibrationDataReader):
    def __init__(self, calibration_image_folder: str, model_path: str):
        self.enum_data = None

        # Use inference session to get input shape.
        session = onnxruntime.InferenceSession(model_path, None)
        (_, _, height, width) = session.get_inputs()[0].shape

        # Convert image to input data
        self.nhwc_data_list = _preprocess_images(
            calibration_image_folder, height, width)
      
        self.input_name = session.get_inputs()[0].name
        self.datasize = len(self.nhwc_data_list)

    def get_next(self):
        if self.enum_data is None:
            self.enum_data = iter(
                [{self.input_name: nhwc_data} for nhwc_data in self.nhwc_data_list]
            )
        return next(self.enum_data, None)

    def rewind(self):
        self.enum_data = None

dr = XXXDataReader(
        calibration_dataset_path, input_model_path
    )

conf = StaticQuantConfig(
    calibration_data_reader=dr,
    quant_format=QuantFormat.QDQ,
    calibrate_method=CalibrationMethod.MinMax,
    optimize_model=True,
    activation_type=QuantType.QInt8,
    weight_type=QuantType.QInt8,
    # nodes_to_exclude=['resnetv17_dense0_fwd', ..],
    # nodes_to_quantize=['resnetv17_dense0_fwd', ..],
    per_channel=True)
      
quantize(infer_model,
    output_model_path, conf)

Note that for a quick evaluation in term of inference time and memory footprint, the XXXDataReader object can be updated to generate the fake image with the random data.

import numpy

class XXXDataReader(CalibrationDataReader):
    def __init__(self, calibration_image_folder: str, model_path: str):
        self.enum_data = None

        # Use inference session to get input shape.
        session = onnxruntime.InferenceSession(model_path, None)
        (_, chnannel, height, width) = session.get_inputs()[0].shape

        # Generate the random data in the half-open interval [0.0, 1.0).
        self.nhwc_data_list = [np.random.random_sample((1, chnannel, height, width)).astype(np.float32)
                               for i in range(20)]

        self.input_name = session.get_inputs()[0].name
        self.datasize = len(self.nhwc_data_list)

    def get_next(self):
        if self.enum_data is None:
            self.enum_data = iter(
                [{self.input_name: nhwc_data} for nhwc_data in self.nhwc_data_list]
            )
        return next(self.enum_data, None)

    def rewind(self):
        self.enum_data = None

Requested ONNX Opset

Models must be opset7 or higher to be quantized. Models with opset < 10 must be reconverted to ONNX from their original framework using a later opset. However, to perform some advanced optimizations (BN folding…), it is recommended to use opset13.

import onnx

original_model_path = 'original_model.onnx'
new_model_path = 'new_model.onnx'

new_opset = 13

onnx_model = onnx.load(original_model_path)
converted_model = onnx.version_converter.convert_version(onnx_model, new_opset)
onnx.save(converted_model, new_model_path)

ONNX Simplifier

https://github.com/daquexian/onnx-simplifier

Before to quantize the ONNX model, it is possible to simplify the model to have an efficient ONNX inference model before quantization or deployment.

$ onnxsim my_model.onnx my_simplified_model.onnx --overwrite-input-shape 1,3,224,224

Known limitations

ONNX QuantizeLinear and ONNX DequantizeLinear are supported only when the axis attribute corresponds to Channel dimension. In other cases, the generated code can be incorrect and not explicitly reported by the ST Edge AI Core. If the following assertion is raised during the validation of the model, a possible work-around is to quantize the ONNX model per tensor (per_channel=False).

Assertion failed: n_channel_in == ( (ai_shape_dimension*)((((&((p_tensor_weights)->shape))))->data) )[(0x0)],
file <root_file_location>\layers_conv2d_stm32_integer.c, line 3628

Quantize Pytorch models

https://pytorch.org/docs/stable/quantization.html

Pytorch models (Python and .pth files) are not natively supported by ST Edge AI Core. To be able to deploy it, the model should be converted to ONNX format (with fixed dimensions, batch size = 1, see following code snippet). To quantize a Pytorch model, it is currently recommended to use the ONNX runtime services. Pytorch provides an initial beta version of a quantization API but the generated quantized model can be not currently exported to a “standard” ONNX format importable by ST Edge AI Core.

https://pytorch.org/tutorials/advanced/super_resolution_with_onnxruntime.html

import torch.onnx
from torch import nn

torch_model = MyPytorchModel(..)

dummy_input = torch.randn(1, 3, 224, 224) # fixed dimension
input_names = [ "actual_input" ]
output_names = [ "output" ]

torch.onnx.export(
    torch_model, # pytorch model (with the weights)
    dummy_input, # model input (or a tuple for multiple inputs)
    "my_model.onnx", # where to save the model
    do_constant_folding=True, # whether to execute constant folding for optimization
    input_names=input_names, # the model's input names
    output_names=output_names, # the model's output names
    opset_version=13, # the ONNX version to export the model to
    export_params=True, # store the trained parameter weights
    verbose=False
    )

Quantized model support - r1.3
ST Edge AI Core Technology 2.2.0

ST logo Information in this document is provided solely in connection with ST products. The contents of this document are subject to change without prior notice. © Copyright STMicroelectronics 2025. All rights reserved. www.st.com