Quantized model support
ST Edge AI Core Technology 2.2.0
r1.3
References
[1] “A Survey of Quantization Methods for
Efficient Neural Network Inference” - https://arxiv.org/abs/2103.13630v3
[2] “Quantization and Training of Neural Networks
for Efficient Integer-Arithmetic-Only Inference” - https://arxiv.org/abs/1712.05877
[3] “Quantization and benchmarking of deep learning
models using ONNX Runtime and STM32Cube.AI Developer Cloud” - https://github.com/STMicroelectronics/stm32ai-modelzoo-services/blob/main/tutorials/notebooks
[https://github.com/STMicroelectronics/stm32ai-modelzoo-services/tree/main/tutorials/notebooks][https://github.com/STMicroelectronics/stm32ai-modelzoo-services/tree/main/tutorials/notebooks]
Overview
ST Edge AI Core CLI can be used to deploy a quantized model. This article covers only the 8-bit integer-based quantized model. The support for lower-than-8-bit quantization is detailed in the “Deep Quantized Neural Network (DQNN) support” article.
Quantization is an optimization technique [1] used to compress a 32-bit floating-point model. This compression reduces the model’s size, decreases memory usage at runtime, and enhances processing unit efficiency and latency, including power consumption, with only a minor loss in accuracy. In a quantized model, some or all operations are performed on tensors using integers instead of floating-point values. Quantization is a key component of various optimization strategies, including topology-oriented optimization, feature-map reduction, pruning, and weight compression, which are crucial for deploying models in resource-constrained environments.
There are two primary methods of quantization:
- Post-Training Quantization (PTQ): This method is easier to use. It allows you to quantize a pretrained model using a limited, representative dataset (also known as a calibration dataset).
- Quantization-Aware Training (QAT): This method is integrated into the training process and generally results in better model accuracy.
Supported file format
ST Edge AI Core can import different format of quantized model:
- The quantized TensorFlow lite models generated
by a posttraining or training aware process. The calibration is
performed by the TensorFlow Lite framework, principally through the
“TFLite converter” utility exporting a TensorFlow lite file.
- The quantized ONNX models based on the operator-oriented (QOperator) or the tensor-oriented (QDQ; Quantize and DeQuantize) format. The first format depends on the supported QOperators (see the “QLinearXXX* operators”) while the second is more generic. The DeQuantizeLinear(QuantizeLinear(tensor)) operators are inserted between the original operators (in float) to simulate the quantization and dequantization process. Both formats can be generated with the ONNX runtime module.
For performance reasons (memory peak usage and processing capabilities), the deployed kernels support only quantized weights and quantized activations (“Integer only quantization” mode). If this pattern is not detected by the optimizing and rendering passes of the code generator, a fallback to a floating-point version of the operator is used. This fallback involves inserting QUANTIZE/DEQUANTIZE operators, a process known as fake, or simulated quantization. Subsequently:
- The dynamic quantizations approach allowing to
determine dynamically the quantization parameters during the
inference is NOT supported. If the user provided this type
of model, a fully float32 implementation will generated.
- The mixed or hybrid models are supported.
The “analyze”, “validate” and “generate” commands can be used w/o limitations.
-m <quantized_model_file>.tflite --target stm32
$ stedgeai analyze -m <quantized_model>.tflite --target stm32 -vi test_data.npz
$ stedgeai validate -m <quantized_model_file>.onnx --target stm32 $ stedgeai analyze
Privileged quantization scheme
target | requested quantization scheme |
---|---|
stm32xx devices based on Arm Cortex®-M processor | ss/sa per channel, binary quantization (other 8-bit quantization scheme including per-tensor are supported but not necessarily optimized) |
stm32xx with ST Neural-ART IP | ss/sa per channel |
stellar-e devices based on Arm Cortex® M7 processor | ss/sa per channel, binary quantization (other 8-bit quantization scheme including per-tensor are supported but not necessarily optimized) |
stellar-pg[xx] devices based on Arm Cortex® R52+ and Arm Cortex® M4 processors | ss/sa per channel, binary quantization (other 8-bit quantization scheme are supported but not necessarily optimized) |
ISPU target | ss/sa per channel, binary quantization |
MLC target | see “ST Edge AI Core for MLC” article |
ISPU target | see “ST Edge AI Core for ISPU” article |
stm32mp target | see “ST Edge AI Core for STM32MPU series” article |
Quantized tensors ([1], [2])
ST Edge AI Core supports 8-bit integer-based (int8 or uint8 data type) arithmetic for quantized tensors, which are based on the representative convention used by Google for quantized models ([2]). Each real number r is represented as a function of the quantized value q, a scale factor (an arbitrary positive real number), and a zero_point parameter. The quantization scheme is an affine mapping of the integers q to real numbers r. zero_point has the same integer C-type as the q data.
The precision is dependent of a scale factor and the quantized values are linearly distributed around the zero_point value. In both cases, resolution/precision is constant vs floating-point representation.
Per-axis vs per-tensor quantization
In per-tensor (or layer wise) quantization, the same format (that is, scale/zero_point) is used for the entire tensor (weights or activations). In contrast, in per-axis (or per-channel or channel wise) quantization, used in convolution-based operators, there is one scale and/or zero_point per filter. They are independent of the other channels, ensuring better quantization (from an accuracy point of view) with negligible computation overhead.
Per-axis approach is currently the standard method used for quantized convolutional kernels (weight tensors). Activation tensors are always in per-tensor. This approach mainly drives the design of the optimized C-kernels.
Symmetric vs Asymmetric
In asymmetric quantization, the tensor can have zero_point anywhere within the signed 8-bit range [-128, 127] or unsigned 8-bit range [0, 255]. In contrast, in symmetric quantization, the tensor is forced to have zero_point equal to zero. By enforcing zero_point to zero, some kernel implementation optimizations are possible to limit the cost of operations (offline precalculation).
In TFLite and ONNX quantized models, the weights/bias are always symmetrically quantized, so the symmetric format for the weights/bias is not supported. For activations, both asymmetric and symmetric formats are supported.
Signed integer vs Unsigned integer - supported schemes
Implementations do not support all possible combinations of type and symmetric/asymmetric format for the weights and/or activations, even though you can define signed or unsigned integer types for them. Only the following integer schemes or combinations are supported:
scheme | weights | activations |
---|---|---|
ua/ua | unsigned and asymmetric | unsigned and asymmetric |
ss/sa | signed and symmetric | signed and asymmetric |
ss/ua | signed and symmetric | unsigned and asymmetric |
Quantized Tensorflow models
ST Edge AI Core is able to import the quantization aware training and the posttraining quantized TensorFlow lite models. Previous quantized aware training models were based on the “ua/ua” scheme. Now, TensorFlow v1.15 or v2.x quantized models are based on the “ss/sa” and per-channel scheme. Activations are asymmetric and signed (int8), weights/bias are symmetric and signed (int8). This scheme is also the privileged scheme to address efficiently the Coral Edge TPUs or TensorFlow Lite for Microcontrollers runtime.
Supported/recommended methods
- Posttraining quantization: https://www.tensorflow.org/lite/performance/post_training_quantization
- Quantization aware training: https://www.tensorflow.org/model_optimization/guide/quantization/training
method/option | supported/recommended |
---|---|
Dynamic range quantization | not supported, only static approach is considered |
Full integer quantization | supported, representative dataset should be used for the calibration |
Integer with float fallback (using default float input/output) | supported, mixed model, representative dataset should be used for the calibration |
Integer only | recommended, representative dataset should be used for the calibration |
Weight Only Quantization | not supported |
Input/output data type | 'uint8' ,
'int8' , and 'float32' can be used |
Float16 quantization | not supported |
Per Channel | default behavior can not be modified |
Activation type | 'int8' , “ss/sa” scheme |
Weight type | 'int8' , “ss/sa” scheme |
“Integer only” method
The following code snippet illustrates the recommended TFLiteConverter options to enforce a full integer scheme (posttraining quantization for all operators including the input/output tensors).
def representative_dataset_gen():
= tload(...)
data
for _ in range(num_calibration_steps):
# Get sample input data as a numpy array in a method of your choosing.
input = get_sample(data)
yield [input]
= tf.lite.TFLiteConverter.from_saved_model(<saved_model_dir>)
converter # converter = tf.lite.TFLiteConverter.from_keras_model(model)
= representative_dataset_gen
converter.representative_dataset # This enables quantization
= [tf.lite.Optimize.DEFAULT]
converter.optimizations # This ensures that if any ops can't be quantized, the converter throws an error
= [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.target_spec.supported_ops # These set the input and output tensors to int8
= tf.int8 # or tf.uint8
converter.inference_input_type = tf.int8 # or tf.uint8
converter.inference_output_type = converter.convert()
quant_model
# Save the quantized file
with open(<tflite_quant_model_path>, "wb") as f:
f.write(quant_model) ...
“Full integer quantization” method
As the mixed models are supported by ST Edge AI Core, the full
integer quantization method can be used. However, it is also
preferable to enable the TensorFlow Lite ops
(TFLITE_BUILTINS
) as only a limited number of
TensorFlow operators are supported (see the supported [TFLITE] operators).
...= tf.lite.TFLiteConverter.from_saved_model(<saved_model_dir>)
converter # converter = tf.lite.TFLiteConverter.from_keras_model(model)
= representative_dataset_gen
converter.representative_dataset # This enables quantization
= [tf.lite.Optimize.DEFAULT]
converter.optimizations # This optional option, ensures that TF Lite operators are used.
= [tf.lite.OpsSet.TFLITE_BUILTINS]
converter.target_spec.supported_ops # These set the input and output tensors to float32
= tf.float32 # or tf.uint8, tf.int8
converter.inference_input_type = tf.float32 # or tf.uint8, tf.int8
converter.inference_output_type = converter.convert()
quant_model ...
Tip
Quantization of the input or/and output tensors are optional. They can be conserved in float for convenience and deployment facility, for example to keep the pre or/and postprocesses in float.
Warning
'tf.lite.Optimize.DEFAULT'
option enables the
quantization process. However, to be sure to have the quantized
weights and quantized activations,
'representative_dataset'
attribute should be
always used. Else only the weights/params will be quantized
allowing to reduce the size of the generated file by ~4. But in this
case, as for the deprecated option OPTIMIZE_FOR_SIZE, the tflite file
is deployed as a fully floating-point c-model with the weights in
float (dequantized values).
TensorFlow lite, OPTIMIZE_FOR_SIZE option support
Post-quantization TensorFlow lite script
(TFLiteConverter, TF 1.15) allows to generate a
weight-only quantized file
('OPTIMIZE_FOR_SIZE'
option). This simplest scheme
(also called “hybrid” quantization) allows to reduce the size of the
generated file (~by 4). Only the weights are quantized from
floating-point to 8 bits of precision. At inference, weights are
converted from 8 bits of precision to floating-point and computed
using floating-point kernels.
This quantization scheme is not supported by ST Edge AI Core, in particular by the C-inference engine and operator implementations (network runtime library), mainly because of the devices resource constraints. Additional RAM memory to cache the uncompressed parameters would be requested to reduce the latency. If this model is imported, the parameters are converted to floating-point before code generation. Only the full 8b integer quantization of weights and activations scheme is supported.
Quantized ONNX models
Quantize and DeQuantize (QDQ) format
To quantize an ONNX model, it is recommended to use the quantized services from the ONNX runtime module. The tensor-oriented (QDQ; Quantize and DeQuantize) format is privileged. As illustrated by the following figure, the additional DeQuantizeLinear(QuantizeLinear(tensor)) between the original operators will be automatically detected and removed to deploy the associated optimized integer kernels.
Notes
The merging of Batch Normalization operators is automatically done by the
quantize
process. Model can also be previously optimized for inference before quantizing it (ONNX Simplifier can also be used)$ python -m onnxruntime.quantization.preprocess --input model.onnx --output model-infer.onnx
Since the implementation of the deployed kernels is mainly channel last (NHWC data format) and to respect the original input data representation, a ‘Transpose’ operator is added in the generated graph. Note that, depending on the target, this operation can be performed by software and the cost may not be negligible. To improve this behavior and also to take account for application constraints, advanced options can be used to specify the expected input or output data layout (refer to “How to change the I/O data type or layout (NHWC vs NCHW)” article for more details)
By default, the I/O data type (float32) of the original ONNX model is not preserved but optimized to align with the used activation type for the quantization scheme. Advanced options can be used to specific the expected data type. As illustrated in the following figure, the option “–input-data-type float32” is used to insert a quantizer operation and preserve the original data type. (Refer to “How to change the I/O data type or layout (NHWC vs NCHW)” article for more details)
Supported/recommended methods/options
method/option | supported/recommended |
---|---|
Dynamic quantization | not supported, only static approach is considered |
Static Quantization | recommended, representative dataset should be used for the calibration |
Quant format | 'QuantFormat.QDQ' is
recommended, 'QuantFormat.QOperator'
is not recommended (not tested, and limited supported “QLinearXXX*
operators”) |
Activation type | 'QuantType.QInt8' is
recommended (default),
'QuantType.QUInt8' is not recommended and not
tested |
Weight type | 'QuantType.QInt8' is
recommended (default),
'QuantType.QUInt8' is not recommended and not
tested |
Calibration Method | 'CalibrationMethod.MinMax' ,
'CalibrationMethod.Entropy' and
'CalibrationMethod.Percentile' can be used |
Per Channel | recommended:
True , per tensor (= False ) is not
recommended and not tested |
nodes_to_exclude/nodes_to_quantize | supported, the mixed models can be deployed by ST Edge AI Core |
Warning
As mentioned in the ONNX runtime documentation, “Data type selection” section, on x86-64 machines with AVX2 and AVX512 extensions, ONNX runtime uses the VPMADDUBSW instruction for U8S8 for performance. This instruction might suffer from saturation issues: it can happen that the output does not fit into a 16-bit integer and has to be clamped (saturated) to fit. Generally, this is not a significant issue for the final result. However, if you do encounter a large accuracy drop, it may caused by saturation. In this case, you can either try reduce_range or the U8U8 format which does not have saturation issues. There is no such issue on other CPU architectures (x64 with VNNI and Arm).
“Static Quantization” method
Note
To illustrate the usage of ONNX quantizer, a complete notebook is available - “Quantization and benchmarking of deep learning models using ONNX runtime and STM32Cube.AI Developer Cloud” [3].
The following code snippet illustrates a typical Python script (posttraining quantization) to quantize an NN model processing images (classifier or object-detector applications). It is based on the end-to-end example from the “Quantize ONNX Models” article. Note that only the inputs (images used for the calibration) are requested and the associated data can be transposed (hwc to chw) to be conform to the expected input data representation.
import numpy
import onnxruntime
from onnxruntime.quantization import QuantFormat, QuantType, StaticQuantConfig, quantize, CalibrationMethod
from onnxruntime.quantization import CalibrationDataReader
from PIL import Image
= 'my_model.onnx'
input_model_path = 'my_model_quantized.onnx'
output_model_path
= '/path/to/data/for/calibration'
calibration_dataset_path
def _preprocess_images(images_folder: str, height: int, width: int):
"""
Load a batch of images and preprocess them..
"""
= os.listdir(images_folder)
image_names = image_names
batch_filenames = []
unconcatenated_batch_data
for image_name in batch_filenames:
= images_folder + "/" + image_name
image_filepath = Image.new("RGB", (width, height))
pillow_img open(image_filepath).resize((width, height)))
pillow_img.paste(Image.= numpy.float32(pillow_img) - numpy.array(
input_data 123.68, 116.78, 103.94], dtype=numpy.float32
[
)= numpy.expand_dims(input_data, axis=0)
nhwc_data = nhwc_data.transpose(0, 3, 1, 2) # ONNX Runtime standard
nchw_data
unconcatenated_batch_data.append(nchw_data)= numpy.concatenate(
batch_data =0), axis=0
numpy.expand_dims(unconcatenated_batch_data, axis
)return batch_data
class XXXDataReader(CalibrationDataReader):
def __init__(self, calibration_image_folder: str, model_path: str):
self.enum_data = None
# Use inference session to get input shape.
= onnxruntime.InferenceSession(model_path, None)
session = session.get_inputs()[0].shape
(_, _, height, width)
# Convert image to input data
self.nhwc_data_list = _preprocess_images(
calibration_image_folder, height, width)
self.input_name = session.get_inputs()[0].name
self.datasize = len(self.nhwc_data_list)
def get_next(self):
if self.enum_data is None:
self.enum_data = iter(
self.input_name: nhwc_data} for nhwc_data in self.nhwc_data_list]
[{
)return next(self.enum_data, None)
def rewind(self):
self.enum_data = None
= XXXDataReader(
dr
calibration_dataset_path, input_model_path
)
= StaticQuantConfig(
conf =dr,
calibration_data_reader=QuantFormat.QDQ,
quant_format=CalibrationMethod.MinMax,
calibrate_method=True,
optimize_model=QuantType.QInt8,
activation_type=QuantType.QInt8,
weight_type# nodes_to_exclude=['resnetv17_dense0_fwd', ..],
# nodes_to_quantize=['resnetv17_dense0_fwd', ..],
=True)
per_channel
quantize(infer_model, output_model_path, conf)
Note that for a quick evaluation in term of inference time and
memory footprint, the XXXDataReader
object can be
updated to generate the fake image with the random
data.
import numpy
class XXXDataReader(CalibrationDataReader):
def __init__(self, calibration_image_folder: str, model_path: str):
self.enum_data = None
# Use inference session to get input shape.
= onnxruntime.InferenceSession(model_path, None)
session = session.get_inputs()[0].shape
(_, chnannel, height, width)
# Generate the random data in the half-open interval [0.0, 1.0).
self.nhwc_data_list = [np.random.random_sample((1, chnannel, height, width)).astype(np.float32)
for i in range(20)]
self.input_name = session.get_inputs()[0].name
self.datasize = len(self.nhwc_data_list)
def get_next(self):
if self.enum_data is None:
self.enum_data = iter(
self.input_name: nhwc_data} for nhwc_data in self.nhwc_data_list]
[{
)return next(self.enum_data, None)
def rewind(self):
self.enum_data = None
Requested ONNX Opset
Models must be opset7 or higher to be quantized. Models with opset < 10 must be reconverted to ONNX from their original framework using a later opset. However, to perform some advanced optimizations (BN folding…), it is recommended to use opset13.
import onnx
= 'original_model.onnx'
original_model_path = 'new_model.onnx'
new_model_path
= 13
new_opset
= onnx.load(original_model_path)
onnx_model = onnx.version_converter.convert_version(onnx_model, new_opset)
converted_model onnx.save(converted_model, new_model_path)
ONNX Simplifier
Before to quantize the ONNX model, it is possible to simplify the model to have an efficient ONNX inference model before quantization or deployment.
$ onnxsim my_model.onnx my_simplified_model.onnx --overwrite-input-shape 1,3,224,224
Known limitations
- ONNX QuantizeLinear and ONNX DequantizeLinear are supported only
when the axis attribute corresponds to Channel dimension. In other
cases, the generated code can be incorrect and not explicitly
reported by the ST Edge AI Core. If the following assertion
is raised during the validation of the model, a possible work-around
is to quantize the ONNX model per tensor
(
per_channel=False
).
Assertion failed: n_channel_in == ( (ai_shape_dimension*)((((&((p_tensor_weights)->shape))))->data) )[(0x0)],
file <root_file_location>\layers_conv2d_stm32_integer.c, line 3628
Quantize Pytorch models
Pytorch models (Python and .pth
files) are not
natively supported by ST Edge AI Core. To
be able to deploy it, the model should be converted to ONNX format
(with fixed dimensions, batch size = 1, see following code
snippet). To quantize a Pytorch model, it is currently
recommended to use the ONNX
runtime services. Pytorch provides an initial beta version of a
quantization
API but the generated quantized model
can be not currently exported to a “standard” ONNX format importable
by ST Edge AI Core.
import torch.onnx
from torch import nn
= MyPytorchModel(..)
torch_model
= torch.randn(1, 3, 224, 224) # fixed dimension
dummy_input = [ "actual_input" ]
input_names = [ "output" ]
output_names
torch.onnx.export(# pytorch model (with the weights)
torch_model, # model input (or a tuple for multiple inputs)
dummy_input, "my_model.onnx", # where to save the model
=True, # whether to execute constant folding for optimization
do_constant_folding=input_names, # the model's input names
input_names=output_names, # the model's output names
output_names=13, # the ONNX version to export the model to
opset_version=True, # store the trained parameter weights
export_params=False
verbose )