Deep Quantized Neural Network (DQNN) support
ST Edge AI Core Technology 2.2.0
r1.1
Overview
The ST Edge AI Core can be used to deploy a pre-trained Deep Quantized Neural Network (DQNN) model designed and trained with the QKeras and Larq libraries when targeting STM32, STELLAR, and ISPU. The purpose of this article is to highlight the supported configurations and limitations to be able to deploy an efficient and optimized c-inference model for ST targets. For detailed explanations and recommendations to design a DQNN model checkout the respective user guide or provided notebook(s).
Deep Quantized Neural Network
Quantized models generally refer to models which use an 8-bit signed/unsigned integer data format to encode each weight and activation. After an optimization/quantization process (Post Training Quantization, PTQ or Quantization Aware Training, QAT fashion), it allows to deploy a floating-point network using smaller integers arithmetic to be more efficient in terms of computational resources. DQNN denotes models where bit width used in some weight and/or activation is smaller than 8 bit. Mixed data types (hybrid layer) can be also considered for a given operator (i.e. weight in binary, activation in 8-bit signed integer or 32-b floating point) allowing to manage a trade-off in terms of accuracy/precision and memory peak usage. To ensure performance, QKeras and Larq libraries only train in a quantization-aware mode (QAT).
Each library is designed as an extension of the high-level (custom layer) Keras API that provides an easy way to quickly create a deep quantized version of original Keras network. As shown in the left part of the following figure, based on the concept of quantized layers and quantizers, the user can transform a full precision layer describing how to quantize incoming and outcoming activations and weights. Note that the quantized layers are fully compatible with the Keras API so they can be used with Keras Layers interchangeably. This property allows the user to design the mixed models with layers which are kept in float.
Note that after using a classical quantized model (8-b format only), the DQNN model can be considered as an advanced optimization approach/alternative to deploy a model in the resource-constrained environments like a ST device w/o significant loss of accuracy. “Advanced”, because the design of this type of model is by construction, not direct.
1b and 8b signed format support
The ARM Cortex-M and ISPU instructions and the requested data manipulations (pack/unpack data operations during memory transfers,..) do not allow to have an efficient implementation for all the combinations of data types. The ST Edge AI Core focuses primarily on implementations to improve memory peak usage (flash and/or ram) and reduce latency (execution time), which means that optimization for size is not supported. Therefore, only the 32b-float, 8-bit signed and binary signed (1-bit) data types are considered by the code generator to deploy the optimized c-kernels (see the next “Optimized C-kernels” section). Otherwise, if possible, fallback to 32-b floating point c-kernel will be used with pre-post quantizer/dequantized operations.
- Data type of the input tensors is defined by the
'input_quantizer'
argument for Larq or inferred from the data type of the previous operator (Larq/QKeras)
- Data type of the output tensors is inferred by the outcoming
operator-chain.
- In the case where the input/output and weight tensors are quantized with a classical 8-b integer scheme (like for the TFlite quantized models), the respective optimized int8 c-kernel implementations will be used.
QKeras library
QKeras is a
quantization extension framework developed by Google on top of
Keras. It provides drop-in replacement for some of the Keras layers,
in particular for computational layers such as convolution, dense,
and non-linearities. For this reason the developer can quickly
create a deep quantized version of a Keras network. QKeras is being
designed to extend the functionality of Keras using Keras design
principle, i.e. being user friendly, modular and extensible, trying
to be “minimally intrusive” of Keras native functionality. It
provides also the QTools
and AutoQKeras
tools to assist the user to deploy a quantized model on a specific
hardware implementation or to treat the quantization as a
hyperparameter search in a Keras-tuner
environment.
import tensorflow as tf
import qkeras
...= tf.keras.Input(shape=(28, 28, 1))
x = qkeras.QActivation(qkeras.quantized_relu(bits=8, alpha=1))(x)
y = qkeras.QConv2D(16, (3, 3),
y =qkeras.binary(alpha=1),
kernel_quantizer=qkeras.binary(alpha=1),
bias_quantizer=False,
use_bias="conv2d_0")(y)
name= tf.keras.layers.MaxPooling2D(pool_size=(2,2))(y)
y = tf.keras.layers.BatchNormalization()(y)
y = qkeras.QActivation(qkeras.binary(alpha=1))(y)
y
...= tf.keras.Model(inputs=x, outputs=y) model
Supported QKeras quantizers/layers
QActivation
QBatchNormalization
QConv2D
QConv2DTranspose
'padding'
parameter should be'valid'
'stride'
must be'(1, 1)'
QDense
- Only 2D input shape is supported:
[batch_size, input_dim]
. A rank greater than 2 is not supported,Flatten
layer before the QuantDense/QDense operator should be added
- Only 2D input shape is supported:
QDepthwiseConv2D
The following quantizers and associated configurations are supported:
Quantizer | comments/limitations |
---|---|
quantized_bits() |
only 8 bit size (bits=8 )
is supported |
quantized_relu() |
only 8 bit size (bits=8 )
is supported |
quantized_tanh() |
only 8 bit size (bits=8 )
is supported |
binary() |
only supported in signed version
(use_01=False ), without scale
(alpha=1 ) |
stochastic_binary() |
only supported in signed version
(use_01=False ), without scale
(alpha=1 ) |
Typically,
'quantized_relu()'
quantizer can be used to quantize the inputs which are normalized between'0.0'
and'1.0'
. Note that'quantized_relu(bits=8, integer=8, keep_negative=False)'
can be considered if the range of the input values are between'0.0'
and'256.0'
.= tf.keras.Input(shape=(..)) x = qkeras.QActivation(qkeras.quantized_relu(bits=8, integer=0))(x) y = qkeras.QConv2D(..) y ...
The
'quantized_bits()'
quantizer can also be used to quantize the inputs which are normalized between'-1.0'
and'1.0'
. Note that'quantized_bits(bits=8, integer=7)'
can be considered if the range of the input values are between'-128.0'
and'127.0'
.= tf.keras.Input(shape=(..)) x = qkeras.QActivation(qkeras.quantized_bits(bits=8, integer=0, symmetric=0, alpha=1))(x) y = qkeras.QConv2D(..) y ...
To have a fully binarized operation without bias and a normalized and binarized output:
...= qkeras.QActivation(qkeras.binary(alpha=1))(y) y = qkeras.QConv2D(.. y ="binary(alpha=1)", kernel_quantizer=qkeras.binary(alpha=1), bias_quantizer=False, use_bias )(y)= keras.MaxPooling2D(...)(y) y = keras.BatchNormalization()(y) y = qkeras.QActivation(qkeras.binary(alpha=1))(y) y ...
Larq library
Larq is an open-source Python Library for training neural networks with extremely low-precision weights and activations, such as Binarized Neural Networks (BNNs). The approach is similar to the QKeras library with a preliminary focus on the BNN models. To deploy the trained model, a specific highly optimized inference engine (Larq Compute Engine, LCE) is also provided for various mobile platform.
import tensorflow as tf
import larq as lq
...= tf.keras.Input(shape=(28, 28, 1))
x = tf.keras.layers.Flatten()(x)
y = lq.layers.QuantDense(
y 512,
="ste_sign",
kernel_quantizer="weight_clip")(y)
kernel_constraint= lq.layers.QuantDense(
y 10,
="ste_sign",
input_quantizer="ste_sign",
kernel_quantizer="weight_clip")(y)
kernel_constraint= tf.layers.Activation("softmax")(y)
y
...= tf.keras.Model(inputs=x, outputs=y) model
Supported Larq layers
QuantConv2D
- for binary quantization,
'pad_values=-1 or 1'
is requested if'padding="same"'
- with
'DoReFa(..)'
quantizer,'use_bias==False'
is expected
- for binary quantization,
QuantDense
'DoReFa(..)'
quantizer is not supported
- Only 2D input shape is supported:
[batch_size, input_dim]
. A rank greater than 2 is not supported,Flatten
layer before the QuantDense/QDense operator should be added
QuantDepthwiseConv2D
- for binary quantization,
'pad_values=-1 or 1'
is requested if'padding="same"'
'DoReFa(..)'
quantizer is not supported
- for binary quantization,
Only the following quantizers and associated configurations are
supported. Larq quantizers are fully described in the Larq
documantation section: “https://docs.larq.dev/larq/api/quantizers/”:
Quantizer | comments/limitations |
---|---|
'SteSign' |
used for binary quantization |
'ApproxSign' |
used for binary quantization |
'SwishSign' |
used for binary quantization |
'DoReFa' |
only 8 bit size (k_bit=8 )
is supported for the QuantConv2D layer |
Typically,
'DoReFa(k_bit=8, mode="activations")'
quantizer can be used to quantize the inputs which are normalized between'0.0'
and'1.0'
. Note that'DoReFa(k_bit=8, mode="weights")'
quantizes the weights between'-1.0'
and'1.0'
.= tf.keras.Input(shape=(..)) x = larq.layers.QuantConv2D(.., y =larq.Dorefa(k_bits=8, mode="activations", input_quantizer=larq.Dorefa(k_bits=8, mode="weights", kernel_quantizer=False, use_bias )(x) ...
Optimized C-kernel configurations
Implementation conventions
This section shows the optimized data-type combinations using the naming conversion:
f32
to identify absence of quantization (i.e., 32-b floating point)
s8
to refer the 8-b signed quantizers
s1
to refer to the binary signed quantizers
c-layout of the s1 type
Elements of a binary activation tensors are packed on 32-b words
along the last dimenion ('axis=-1'
) with the following
rules:
- bit order: little or MSB first
- pad value:
'0b'
- a positive value is coded with
'0b'
, while a negative value is coded with'1b'
Note
It is recommended to have the number of channels as a multiple of 32 to optimize Flash/RAM size and MAC/Cycle, but it is not a must have.
Quantized Dense layers
input format | output format | weight format | bias format (1) | notes |
---|---|---|---|---|
s1 | s1 | s1 | s1 | (2) |
s1 | s1 | s1 | f32 | (2) |
s1 | s1 | s8 | s1 | (2) |
s1 | s8 | s1 | s8 | (2) |
s1 | s8 | s8 | s8 | (2) |
s1 | f32 | s1 | s1 | (2) |
s1 | f32 | s1 | f32 | (2) |
s1 | f32 | s8 | s8 | (2) |
s1 | f32 | f32 | f32 | (2) |
s8 | s1 | s1 | s1 | (2) |
s8 | s8 | s1 | s1 | (2) |
s8 | f32 | s1 | s1 | (2) |
s8 | s8 | s8 | s8 | (2), bias are stored as s32 format |
s8 | s8 | s8 | s32 | (2), int8-tflite kernels |
s8 | f32 | s1 | s1 | (2) |
f32 | s1 | s1 | s1 | (2) |
f32 | s1 | s1 | f32 | (2) |
f32 | f32 | s1 | s1 | (2) |
f32 | f32 | s1 | f32 | (2) |
(1) usage of the bias is optional
(2) batch-normalization can be fused
Optimized Convolution layers
input format | output format | weight format | bias format (1) | notes |
---|---|---|---|---|
s1 | s1 | s1 | s1 | (2), (3), including pointwise and depthwise version |
s1 | s8 | s1 | s8 | (2), including pointwise version |
s1 | s8 | s1 | f32 | (2), including pointwise version |
s1 | f32 | s1 | f32 | (2), including pointwise version |
s8 | s1 | s8 | ||
s8 (Dorefa) | s1 | s8 (Dorefa) | (2), use_bias=False | |
s8 | s8 | s8 | s8 | (2) bias are stored as s32 format |
s8 | s8 | s8 | s32 | (2) int8-tflite kernels |
(1) usage of the bias is optional
(2) batch-normalization can be fused
(3) maxpool can be inserted between the convolution and the
batch-normalization operators
Misc layers
Following layers are also available to support the more complex topology, for example with the residual connections.
input format | input format | output format | notes |
---|---|---|---|
maxpool | s1 | s1 | s8/f32 data type is also supported by the “standard” C-kernels |
concat | s1 | s1 | s8/f32 data type is also supported by the “standard” C-kernels |
Evidence of efficient code generation
Similar to the 'qkeras.print_qstats()'
function or
the extended 'summary()'
function in Larq, the “analyze”
command reports a summary of the number of operations used for each
generated C-layer according the type of datas. The number of
operation types for the entire generated C-model is also reported.
This last information makes it possible to know if the deployed
model is entirely or partially based on the optimiezd
binarized/quantized c-kernels.
Note
For the size of the deployed weights, 'ROM'
/'weights (ro)'
metric indicates the espected size to store the quantized
weights on the target. Note that the reported value is compared with
the size to store the weights from the original format (32-b
floating point). Detailed information by c-layer and by associated
tensors is available in the generated reports.
Following example shows that 90% of the operations are the binary
operations provided by the main contributor:
"quant_conv2d_1_conv2d"
layer.
$ stedgeai -m <model_file.h5> --target stm32
...
params # : 93,556 items (365.45 KiB)
macc : 2,865,718
weights (ro) : 14,496 B (14.16 KiB) (1 segment) / -359,728(-96.1%) vs float model
activations (rw) : 86,528 B (84.50 KiB) (1 segment)
ram (total) : 89,704 B (87.60 KiB) = 86,528 + 3,136 + 40
...
Number of operations and param per c-layer
-------------------------------------------------------------------------------------------
c_id m_id name (type) #op (type)
-------------------------------------------------------------------------------------------
0 2 quant_conv2d_conv2d (conv2d) 194,720 (smul_f32_f32)
1 3 quant_conv2d_1_conv (conv) 43,264 (conv_f32_s1)
2 1 max_pooling2d (pool) 21,632 (op_s1_s1)
3 3 quant_conv2d_1_conv2d (conv2d_dqnn) 2,230,272 (sxor_s1_s1)
4 5 max_pooling2d_1 (pool) 6,400 (op_s1_s1)
5 7 quant_conv2d_2_conv2d (conv2d_dqnn) 331,776 (sxor_s1_s1)
6 10 quant_dense_quantdense (dense_dqnn_dqnn) 36,864 (sxor_s1_s1)
7 13 quant_dense_1_quantdense (dense_dqnn_dqnn) 640 (sxor_s1_s1)
8 15 activation (nl) 150 (op_f32_f32)
-------------------------------------------------------------------------------------------
total 2,865,718
Number of operation types
---------------------------------------------
smul_f32_f32 194,720 6.8%
conv_f32_s1 43,264 1.5%
op_s1_s1 28,032 1.0%
sxor_s1_s1 2,599,552 90.7%
op_f32_f32 150 0.0%
Example of supported patterns
As already mentioned in the overview, the ST Edge AI Core infers the data-type of the input and output tensors from the data-type of the incoming and outcome chains respectively. This section illustrates the typical patterns which are considered to deploy an optimized c-kernel.
Activation layer
Activation layer (including the QActivation
layer)
is mainly supported by fusing it in the previous or following layer.
Supported arguments are the supported quantizer configurations.
= tf.keras.Input(shape=(..))
x = qkeras.QActivation(qkeras.binary(alpha=1))(x)
y = qkeras.QConv2D(..)(y)
y ...
is equivalent to (code gen point of view):
= tf.keras.Input(shape=(..))
x = larq.layers.QuantConv2D(..,
y ='ste_sign',..)(x)
input_quantizer ...
Quantized dense layer
Dense layer patterns are a sequence of layers composed of:
- an optional
QActivation
to specify the input format. If missing the input format is f32
- the
QDense
layer, use of bias is optional
- an optional other layer:
e.g.
(Q)BatchNormalization
which can be merged info the previous dense layer
- an optional
QActivation
to specify the output format. If missing the output format is f32
The first example illustrates the case where 8-b quantized input
(s8
: symmetric with 7-bit fractional part) is used for
the QDense
layer exploiting 1-b quantized (or binary)
weights with bias. The output is normalized and in f32.
import qkeras
import tensorflow as tf
shape_in = (128, 128)
x = tf.keras.Input(shape=shape_in)
y = qkeras.QActivation(qkeras.quantized_bits(bits=8, integer=7))(x)
y = tf.keras.layers.Flatten()(y)
y = qkeras.QDense(16,
kernel_quantizer=qkeras.binary(alpha=1),
bias_quantizer=qkeras.binary(alpha=1),
use_bias=True,
name="dense_0")(y)
y = tf.keras.layers.BatchNormalization()(y)
y = tf.keras.layers.Activation("softmax")(y)
model = tf.keras.Model(inputs=x, outputs=y)
Number of operations per c-layer
------- ------ --------------------------- --------- ------------
c_id m_id name (type) #op type
------- ------ --------------------------- --------- ------------
0 3 dense_0 (Dense) 262,144 smul_s8_s1
1 5 activation (Nonlinearity) 240 op_f32_f32
------- ------ --------------------------- --------- ------------
total 262,384
Number of operation types
---------------- --------- -----------
operation type # %
---------------- --------- -----------
smul_s8_s1 262,144 99.9%
op_f32_f32 240 0.1%
Second example shows the case, where two dense layers are used. The
first uses the binary weights with a 8-b quantized input to compute
the binary output. Second layer uses the binary weights and the
previous binary output to compute the f32 output.
= tf.keras.Input(shape=shape_in)
x = tf.keras.layers.Flatten()(x)
y = qkeras.QActivation(qkeras.quantized_bits(bits=8, integer=7))(y)
y = qkeras.QDense(64,
y =qkeras.binary(alpha=1),
kernel_quantizer=qkeras.binary(alpha=1),
bias_quantizer=True,
use_bias="dense_0")(y)
name= tf.keras.layers.BatchNormalization()(y)
y = qkeras.QActivation(qkeras.binary(alpha=1))(y)
y = qkeras.QDense(10,
y =qkeras.binary(alpha=1),
kernel_quantizer=qkeras.binary(alpha=1),
bias_quantizer=True,
use_bias="dense_1")(y)
name= tf.keras.layers.Activation("softmax")(y) y
param per c-layer
Number of operations and -------------------------------------------------------------------------------
(type) #op (type)
c_id m_id name -------------------------------------------------------------------------------
0 3 dense_0_qdense (dense_dqnn_dqnn) 50,176 (smul_s8_s1)
1 6 dense_1_qdense (dense_dqnn_dqnn) 650 (sxor_s1_s1)
2 7 activation (nl) 150 (op_f32_f32)
-------------------------------------------------------------------------------
50,976
total
Number of operation types---------------------------------------------
50,176 98.4%
smul_s8_s1 650 1.3%
sxor_s1_s1 150 0.3% op_f32_f32
Quantized Convolution layer
For the quantized convolution layer, the pattern is similar to the quantized dense layer.
Following example shows the case where the binary input/weight
are used to compute a normalized binary output.
MaxPooling2D
layer allows to compact the
activations.
= tf.keras.Input(shape=shape_in)
x = qkeras.QActivation(qkeras.binary(alpha=1))(x)
y = qkeras.QConv2D(filters=16, kernel_size=(3, 3),
y =qkeras.binary(alpha=1),
kernel_quantizer=qkeras.binary(alpha=1),
bias_quantizer=False,
use_bias="same",
padding="dense_0")(y)
name= tf.keras.layers.MaxPooling2D(pool_size=(2, 2))(y)
y = tf.keras.layers.BatchNormalization()(y)
y = qkeras.QActivation(qkeras.binary(alpha=1))(y)
y
= tf.keras.Model(inputs=x, outputs=y) model
Variation with 'stride'
argument can be used to avoid
to use the MaxPooling2D
layer.
= tf.keras.Input(shape=shape_in)
x = qkeras.QActivation(qkeras.binary(alpha=1))(x)
y = qkeras.QConv2D(filters=16, kernel_size=(3, 3),
y =qkeras.binary(alpha=1),
kernel_quantizer=qkeras.binary(alpha=1),
bias_quantizer=False,
use_bias=(2, 2),
strides="same",
padding="dense_0")(y)
name= tf.keras.layers.BatchNormalization()(y)
y = qkeras.QActivation(qkeras.binary(alpha=1))(y)
y
= tf.keras.Model(inputs=x, outputs=y) model
Residual connections case
The following model shows how residual connections can be created to concatenate the activations. Particular attention shall be given to the shapes to be concatenated since must be the same with the exception of the size along the concatenation axis.
...
param per c-layer
Number of operations and --------------------------------------------------------------------------------------------
(type) #op (type)
c_id m_id name --------------------------------------------------------------------------------------------
0 1 quant_conv2d_5_conv2d (conv2d_dqnn) 819,200 (sxor_s1_s1)
1 2 max_pooling2d (pool) 12,800 (op_s1_s1)
2 4 quant_depthwise_conv2d_2_conv2d (conv2d_dqnn) 28,800 (sxor_s1_s1)
3 6 concatenate_2 (concat) 0 (op_s1_s1)
--------------------------------------------------------------------------------------------
860,800
total
Number of operation types---------------------------------------------
848,000 98.5%
sxor_s1_s1 12,800 1.5%
op_s1_s1 ...
Fallback to 32b floating point kernels
Following code shows the case where the requested configuration:
's8xs1->s8'
is not supported and the fallback is
applied. This is a typical case where the user has the opportunity
to modify its model (after a pre-analyze step of the model), to
keep this layer in float limiting the possible loss of
precision.
= tf.keras.Input(shape=shape_in)
x = qkeras.QActivation(qkeras.quantized_bits(bits=8, integer=7))(x)
y = qkeras.QConv2D(filters=16, kernel_size=(3, 3), strides=(2, 2),
y =qkeras.binary(alpha=1),
kernel_quantizer=qkeras.binary(alpha=1),
bias_quantizer=True,
use_bias="same",
padding="dense_0")(y)
name= tf.keras.layers.BatchNormalization()(y)
y = qkeras.QActivation(qkeras.quantized_bits(bits=8, integer=7))(y)
y
= tf.keras.Model(inputs=x, outputs=y) model
param per c-layer
Number of operations and ----------------------------------------------------------------------------
(type) #op (type)
c_id m_id name ----------------------------------------------------------------------------
0 0 input_1_0_conversion (conv) 1,568 (conv_s8_f32)
1 3 dense_0_conv2d (conv2d) 28,240 (smul_f32_f32)
2 4 q_activation_1 (conv) 6,272 (conv_f32_s8)
----------------------------------------------------------------------------
36,080
total
Number of operation types---------------------------------------------
1,568 4.3%
conv_s8_f32 28,240 78.3%
smul_f32_f32 6,272 17.4% conv_f32_s8
Building an efficient DQNN/BNN model for ST Edge AI Core
All highlighted recommendations from the Larq documentation (https://docs.larq.dev/larq/guides/bnn-architecture/) or from the QKeras notebooks (https://github.com/google/qkeras/blob/master/notebook/QKerasTutorial.ipynb or https://notebook.community/google/qkeras/notebook/QKerasTutorial) should be considered to design an efficient DQNN/BNN model for ST series. In particular:
- It is preferable to leave the first layer and the last layer in
higher precision:
's8'
or'f32'
- Usage of the
'BatchNormailzation'
layer
- Placement of the
'MaxPool'
layer before the'BatchNormailzation'
- Due to the way to encode the binary tensors (see “c-layout of the s1 type” section), it is recommended to have the number of channels as a multiple of 32 to optimize Flash/RAM size, and MAC/Cycle, but it is not a must have.
Pre-analyze step
It is recommended to execute regurlarly the “analyze” during the design of the DQNN/BNN model to know if it will be efficiently deployed (fallback not used) before to perform a complete training. This will avoid to use a quantized layer not supported or without gain in term of memory usage when it will deploy on a ST target.
FAQ
Is it possible to mix the quantized Larq and QKeras layers?
From the ST Edge AI Core point of view, yes, when importing the model, it is translated into an independent internal representation before applying the different optimization passes and the rendering stage. However, this is not recommended, even though each library (Larq or QKeras) is based on the Keras API, they are designed independently. There is no guarantee to converge with a good level of precision during the training phase.