Overview

The ST Edge AI Core can be used to deploy a pre-trained Deep Quantized Neural Network (DQNN) model designed and trained with the QKeras and Larq libraries when targeting STM32, STELLAR, and ISPU. The purpose of this article is to highlight the supported configurations and limitations to be able to deploy an efficient and optimized c-inference model for ST targets. For detailed explanations and recommendations to design a DQNN model checkout the respective user guide or provided notebook(s).

Deep Quantized Neural Network

Quantized models generally refer to models which use an 8-bit signed/unsigned integer data format to encode each weight and activation. After an optimization/quantization process (Post Training Quantization, PTQ or Quantization Aware Training, QAT fashion), it allows to deploy a floating-point network using smaller integers arithmetic to be more efficient in terms of computational resources. DQNN denotes models where bit width used in some weight and/or activation is smaller than 8 bit. Mixed data types (hybrid layer) can be also considered for a given operator (i.e. weight in binary, activation in 8-bit signed integer or 32-b floating point) allowing to manage a trade-off in terms of accuracy/precision and memory peak usage. To ensure performance, QKeras and Larq libraries only train in a quantization-aware mode (QAT).

Each library is designed as an extension of the high-level (custom layer) Keras API that provides an easy way to quickly create a deep quantized version of original Keras network. As shown in the left part of the following figure, based on the concept of quantized layers and quantizers, the user can transform a full precision layer describing how to quantize incoming and outcoming activations and weights. Note that the quantized layers are fully compatible with the Keras API so they can be used with Keras Layers interchangeably. This property allows the user to design the mixed models with layers which are kept in float.

Note that after using a classical quantized model (8-b format only), the DQNN model can be considered as an advanced optimization approach/alternative to deploy a model in the resource-constrained environments like a ST device w/o significant loss of accuracy. “Advanced”, because the design of this type of model is by construction, not direct.

1b and 8b signed format support

The ARM Cortex-M and ISPU instructions and the requested data manipulations (pack/unpack data operations during memory transfers,..) do not allow to have an efficient implementation for all the combinations of data types. The ST Edge AI Core focuses primarily on implementations to improve memory peak usage (flash and/or ram) and reduce latency (execution time), which means that optimization for size is not supported. Therefore, only the 32b-float, 8-bit signed and binary signed (1-bit) data types are considered by the code generator to deploy the optimized c-kernels (see the next “Optimized C-kernels” section). Otherwise, if possible, fallback to 32-b floating point c-kernel will be used with pre-post quantizer/dequantized operations.

From quantized layer to deployed operator

Data type of the input tensors is defined by the 'input_quantizer' argument for Larq or inferred from the data type of the previous operator (Larq/QKeras)
Data type of the output tensors is inferred by the outcoming operator-chain.
In the case where the input/output and weight tensors are quantized with a classical 8-b integer scheme (like for the TFlite quantized models), the respective optimized int8 c-kernel implementations will be used.

QKeras library

QKeras is a quantization extension framework developed by Google on top of Keras. It provides drop-in replacement for some of the Keras layers, in particular for computational layers such as convolution, dense, and non-linearities. For this reason the developer can quickly create a deep quantized version of a Keras network. QKeras is being designed to extend the functionality of Keras using Keras design principle, i.e. being user friendly, modular and extensible, trying to be “minimally intrusive” of Keras native functionality. It provides also the QTools and AutoQKeras tools to assist the user to deploy a quantized model on a specific hardware implementation or to treat the quantization as a hyperparameter search in a Keras-tuner environment.

import tensorflow as tf
import qkeras
...
x = tf.keras.Input(shape=(28, 28, 1))
y = qkeras.QActivation(qkeras.quantized_relu(bits=8, alpha=1))(x)
y = qkeras.QConv2D(16, (3, 3),
        kernel_quantizer=qkeras.binary(alpha=1),
        bias_quantizer=qkeras.binary(alpha=1),
        use_bias=False,
        name="conv2d_0")(y)
y = tf.keras.layers.MaxPooling2D(pool_size=(2,2))(y)
y = tf.keras.layers.BatchNormalization()(y)
y = qkeras.QActivation(qkeras.binary(alpha=1))(y)
...
model = tf.keras.Model(inputs=x, outputs=y)

Supported QKeras quantizers/layers

QActivation
QBatchNormalization
QConv2D
QConv2DTranspose
- 'padding' parameter should be 'valid'
- 'stride' must be '(1, 1)'
QDense
- Only 2D input shape is supported: [batch_size, input_dim]. A rank greater than 2 is not supported, Flatten layer before the QuantDense/QDense operator should be added
QDepthwiseConv2D

The following quantizers and associated configurations are supported:

Quantizer	comments/limitations
`quantized_bits()`	only 8 bit size (`bits=8`) is supported
`quantized_relu()`	only 8 bit size (`bits=8`) is supported
`quantized_tanh()`	only 8 bit size (`bits=8`) is supported
`binary()`	only supported in signed version (`use_01=False`), without scale (`alpha=1`)
`stochastic_binary()`	only supported in signed version (`use_01=False`), without scale (`alpha=1`)

Typically, 'quantized_relu()' quantizer can be used to quantize the inputs which are normalized between '0.0' and '1.0'. Note that 'quantized_relu(bits=8, integer=8, keep_negative=False)' can be considered if the range of the input values are between '0.0' and '256.0'.
```
x = tf.keras.Input(shape=(..))
y = qkeras.QActivation(qkeras.quantized_relu(bits=8, integer=0))(x)
y = qkeras.QConv2D(..)
...
```
The 'quantized_bits()' quantizer can also be used to quantize the inputs which are normalized between '-1.0' and '1.0'. Note that 'quantized_bits(bits=8, integer=7)' can be considered if the range of the input values are between '-128.0' and '127.0'.
```
x = tf.keras.Input(shape=(..))
y = qkeras.QActivation(qkeras.quantized_bits(bits=8, integer=0, symmetric=0, alpha=1))(x)
y = qkeras.QConv2D(..)
...
```

To have a fully binarized operation without bias and a normalized and binarized output:

...
y = qkeras.QActivation(qkeras.binary(alpha=1))(y)
y = qkeras.QConv2D(..
    kernel_quantizer="binary(alpha=1)",
    bias_quantizer=qkeras.binary(alpha=1),
    use_bias=False,
    )(y)
y = keras.MaxPooling2D(...)(y)
y = keras.BatchNormalization()(y)
y = qkeras.QActivation(qkeras.binary(alpha=1))(y)
...

Larq library

Larq is an open-source Python Library for training neural networks with extremely low-precision weights and activations, such as Binarized Neural Networks (BNNs). The approach is similar to the QKeras library with a preliminary focus on the BNN models. To deploy the trained model, a specific highly optimized inference engine (Larq Compute Engine, LCE) is also provided for various mobile platform.

import tensorflow as tf
import larq as lq
...
x = tf.keras.Input(shape=(28, 28, 1))
y = tf.keras.layers.Flatten()(x)
y = lq.layers.QuantDense(
        512,
        kernel_quantizer="ste_sign",
        kernel_constraint="weight_clip")(y)
y = lq.layers.QuantDense(
        10,
        input_quantizer="ste_sign",
        kernel_quantizer="ste_sign",
        kernel_constraint="weight_clip")(y)
y = tf.layers.Activation("softmax")(y)
...
model = tf.keras.Model(inputs=x, outputs=y)

Supported Larq layers

QuantConv2D
- for binary quantization, 'pad_values=-1 or 1' is requested if 'padding="same"'
- with 'DoReFa(..)' quantizer, 'use_bias==False' is expected
QuantDense
- 'DoReFa(..)' quantizer is not supported
- Only 2D input shape is supported: [batch_size, input_dim]. A rank greater than 2 is not supported, Flatten layer before the QuantDense/QDense operator should be added
QuantDepthwiseConv2D
- for binary quantization, 'pad_values=-1 or 1' is requested if 'padding="same"'
- 'DoReFa(..)' quantizer is not supported

Only the following quantizers and associated configurations are supported. Larq quantizers are fully described in the Larq documantation section: “https://docs.larq.dev/larq/api/quantizers/”:

Quantizer	comments/limitations
`'SteSign'`	used for binary quantization
`'ApproxSign'`	used for binary quantization
`'SwishSign'`	used for binary quantization
`'DoReFa'`	only 8 bit size (`k_bit=8`) is supported for the `QuantConv2D` layer

Typically, 'DoReFa(k_bit=8, mode="activations")' quantizer can be used to quantize the inputs which are normalized between '0.0' and '1.0'. Note that 'DoReFa(k_bit=8, mode="weights")' quantizes the weights between '-1.0' and '1.0'.

x = tf.keras.Input(shape=(..))
y = larq.layers.QuantConv2D(..,
        input_quantizer=larq.Dorefa(k_bits=8, mode="activations",
        kernel_quantizer=larq.Dorefa(k_bits=8, mode="weights",
        use_bias=False,
        )(x)
...

Optimized C-kernel configurations

Implementation conventions

This section shows the optimized data-type combinations using the naming conversion:

f32 to identify absence of quantization (i.e., 32-b floating point)
s8 to refer the 8-b signed quantizers
s1 to refer to the binary signed quantizers

c-layout of the s1 type

Elements of a binary activation tensors are packed on 32-b words along the last dimenion ('axis=-1') with the following rules:

bit order: little or MSB first
pad value: '0b'
a positive value is coded with '0b', while a negative value is coded with '1b'

Note

It is recommended to have the number of channels as a multiple of 32 to optimize Flash/RAM size and MAC/Cycle, but it is not a must have.

Quantized Dense layers

input format	output format	weight format	bias format (1)	notes
s1	s1	s1	s1	(2)
s1	s1	s1	f32	(2)
s1	s1	s8	s1	(2)
s1	s8	s1	s8	(2)
s1	s8	s8	s8	(2)
s1	f32	s1	s1	(2)
s1	f32	s1	f32	(2)
s1	f32	s8	s8	(2)
s1	f32	f32	f32	(2)
s8	s1	s1	s1	(2)
s8	s8	s1	s1	(2)
s8	f32	s1	s1	(2)
s8	s8	s8	s8	(2), bias are stored as s32 format
s8	s8	s8	s32	(2), int8-tflite kernels
s8	f32	s1	s1	(2)
f32	s1	s1	s1	(2)
f32	s1	s1	f32	(2)
f32	f32	s1	s1	(2)
f32	f32	s1	f32	(2)

(1) usage of the bias is optional
(2) batch-normalization can be fused

Optimized Convolution layers

input format	output format	weight format	bias format (1)	notes
s1	s1	s1	s1	(2), (3), including pointwise and depthwise version
s1	s8	s1	s8	(2), including pointwise version
s1	s8	s1	f32	(2), including pointwise version
s1	f32	s1	f32	(2), including pointwise version
s8	s1	s8
s8 (Dorefa)	s1	s8 (Dorefa)		(2), use_bias=False
s8	s8	s8	s8	(2) bias are stored as s32 format
s8	s8	s8	s32	(2) int8-tflite kernels

(1) usage of the bias is optional
(2) batch-normalization can be fused
(3) maxpool can be inserted between the convolution and the batch-normalization operators

Misc layers

Following layers are also available to support the more complex topology, for example with the residual connections.

input format	input format	output format	notes
maxpool	s1	s1	s8/f32 data type is also supported by the “standard” C-kernels
concat	s1	s1	s8/f32 data type is also supported by the “standard” C-kernels

Evidence of efficient code generation

Similar to the 'qkeras.print_qstats()' function or the extended 'summary()' function in Larq, the “analyze” command reports a summary of the number of operations used for each generated C-layer according the type of datas. The number of operation types for the entire generated C-model is also reported. This last information makes it possible to know if the deployed model is entirely or partially based on the optimiezd binarized/quantized c-kernels.

Note

For the size of the deployed weights, 'ROM'/'weights (ro)' metric indicates the espected size to store the quantized weights on the target. Note that the reported value is compared with the size to store the weights from the original format (32-b floating point). Detailed information by c-layer and by associated tensors is available in the generated reports.

Following example shows that 90% of the operations are the binary operations provided by the main contributor: "quant_conv2d_1_conv2d" layer.

$ stedgeai -m <model_file.h5> --target stm32
...
 params #             : 93,556 items (365.45 KiB)
 macc                 : 2,865,718
 weights (ro)         : 14,496 B (14.16 KiB) (1 segment) / -359,728(-96.1%) vs float model
 activations (rw)     : 86,528 B (84.50 KiB) (1 segment)
 ram (total)          : 89,704 B (87.60 KiB) = 86,528 + 3,136 + 40
...
 Number of operations and param per c-layer
 -------------------------------------------------------------------------------------------
 c_id    m_id   name (type)                                  #op (type)
 -------------------------------------------------------------------------------------------
 0       2      quant_conv2d_conv2d (conv2d)                         194,720 (smul_f32_f32)
 1       3      quant_conv2d_1_conv (conv)                            43,264 (conv_f32_s1)
 2       1      max_pooling2d (pool)                                  21,632 (op_s1_s1)
 3       3      quant_conv2d_1_conv2d (conv2d_dqnn)                2,230,272 (sxor_s1_s1)
 4       5      max_pooling2d_1 (pool)                                 6,400 (op_s1_s1)
 5       7      quant_conv2d_2_conv2d (conv2d_dqnn)                  331,776 (sxor_s1_s1)
 6       10     quant_dense_quantdense (dense_dqnn_dqnn)              36,864 (sxor_s1_s1)
 7       13     quant_dense_1_quantdense (dense_dqnn_dqnn)               640 (sxor_s1_s1)
 8       15     activation (nl)                                          150 (op_f32_f32)
 -------------------------------------------------------------------------------------------
 total                                                             2,865,718

   Number of operation types
   ---------------------------------------------
   smul_f32_f32             194,720        6.8%
   conv_f32_s1               43,264        1.5%
   op_s1_s1                  28,032        1.0%
   sxor_s1_s1             2,599,552       90.7%
   op_f32_f32                   150        0.0%

Example of supported patterns

As already mentioned in the overview, the ST Edge AI Core infers the data-type of the input and output tensors from the data-type of the incoming and outcome chains respectively. This section illustrates the typical patterns which are considered to deploy an optimized c-kernel.

Activation layer

Activation layer (including the QActivation layer) is mainly supported by fusing it in the previous or following layer. Supported arguments are the supported quantizer configurations.

x = tf.keras.Input(shape=(..))
y = qkeras.QActivation(qkeras.binary(alpha=1))(x)
y = qkeras.QConv2D(..)(y)
...

is equivalent to (code gen point of view):

x = tf.keras.Input(shape=(..))
y = larq.layers.QuantConv2D(..,
        input_quantizer='ste_sign',..)(x)
...

Quantized dense layer

Dense layer patterns are a sequence of layers composed of:

an optional QActivation to specify the input format. If missing the input format is f32
the QDense layer, use of bias is optional
an optional other layer: e.g. (Q)BatchNormalization which can be merged info the previous dense layer
an optional QActivation to specify the output format. If missing the output format is f32

The first example illustrates the case where 8-b quantized input (s8: symmetric with 7-bit fractional part) is used for the QDense layer exploiting 1-b quantized (or binary) weights with bias. The output is normalized and in f32.

import qkeras
import tensorflow as tf

shape_in = (128, 128)

x = tf.keras.Input(shape=shape_in)
y = qkeras.QActivation(qkeras.quantized_bits(bits=8, integer=7))(x)
y = tf.keras.layers.Flatten()(y)
y = qkeras.QDense(16,
        kernel_quantizer=qkeras.binary(alpha=1),
        bias_quantizer=qkeras.binary(alpha=1),
        use_bias=True,
        name="dense_0")(y)
y = tf.keras.layers.BatchNormalization()(y)
y = tf.keras.layers.Activation("softmax")(y)

model = tf.keras.Model(inputs=x, outputs=y)

 Number of operations per c-layer
 ------- ------ --------------------------- --------- ------------
 c_id    m_id   name (type)                       #op         type
 ------- ------ --------------------------- --------- ------------
 0       3      dense_0 (Dense)               262,144   smul_s8_s1
 1       5      activation (Nonlinearity)         240   op_f32_f32
 ------- ------ --------------------------- --------- ------------
 total                                        262,384

 Number of operation types
 ---------------- --------- -----------
 operation type           #           %
 ---------------- --------- -----------
 smul_s8_s1         262,144       99.9%
 op_f32_f32             240        0.1%

Second example shows the case, where two dense layers are used. The first uses the binary weights with a 8-b quantized input to compute the binary output. Second layer uses the binary weights and the previous binary output to compute the f32 output.

x = tf.keras.Input(shape=shape_in)
y = tf.keras.layers.Flatten()(x)
y = qkeras.QActivation(qkeras.quantized_bits(bits=8, integer=7))(y)
y = qkeras.QDense(64,
        kernel_quantizer=qkeras.binary(alpha=1),
        bias_quantizer=qkeras.binary(alpha=1),
        use_bias=True,
        name="dense_0")(y)
y = tf.keras.layers.BatchNormalization()(y)
y = qkeras.QActivation(qkeras.binary(alpha=1))(y)
y = qkeras.QDense(10,
        kernel_quantizer=qkeras.binary(alpha=1),
        bias_quantizer=qkeras.binary(alpha=1),
        use_bias=True,
        name="dense_1")(y)
y = tf.keras.layers.Activation("softmax")(y)

 Number of operations and param per c-layer
 -------------------------------------------------------------------------------
 c_id    m_id   name (type)                        #op (type)
 -------------------------------------------------------------------------------
 0       3      dense_0_qdense (dense_dqnn_dqnn)            50,176 (smul_s8_s1)
 1       6      dense_1_qdense (dense_dqnn_dqnn)               650 (sxor_s1_s1)
 2       7      activation (nl)                                150 (op_f32_f32)
 -------------------------------------------------------------------------------
 total                                                      50,976

   Number of operation types
   ---------------------------------------------
   smul_s8_s1                50,176       98.4%
   sxor_s1_s1                   650        1.3%
   op_f32_f32                   150        0.3%

Quantized Convolution layer

For the quantized convolution layer, the pattern is similar to the quantized dense layer.

Following example shows the case where the binary input/weight are used to compute a normalized binary output. MaxPooling2D layer allows to compact the activations.

x = tf.keras.Input(shape=shape_in)
y = qkeras.QActivation(qkeras.binary(alpha=1))(x)
y = qkeras.QConv2D(filters=16, kernel_size=(3, 3),
        kernel_quantizer=qkeras.binary(alpha=1),
        bias_quantizer=qkeras.binary(alpha=1),
        use_bias=False,
        padding="same",
        name="dense_0")(y)
y = tf.keras.layers.MaxPooling2D(pool_size=(2, 2))(y)
y = tf.keras.layers.BatchNormalization()(y)
y = qkeras.QActivation(qkeras.binary(alpha=1))(y)

model = tf.keras.Model(inputs=x, outputs=y)

Variation with 'stride' argument can be used to avoid to use the MaxPooling2D layer.

x = tf.keras.Input(shape=shape_in)
y = qkeras.QActivation(qkeras.binary(alpha=1))(x)
y = qkeras.QConv2D(filters=16, kernel_size=(3, 3),
        kernel_quantizer=qkeras.binary(alpha=1),
        bias_quantizer=qkeras.binary(alpha=1),
        use_bias=False,
        strides=(2, 2),
        padding="same",
        name="dense_0")(y)
y = tf.keras.layers.BatchNormalization()(y)
y = qkeras.QActivation(qkeras.binary(alpha=1))(y)

model = tf.keras.Model(inputs=x, outputs=y)

Residual connections case

The following model shows how residual connections can be created to concatenate the activations. Particular attention shall be given to the shapes to be concatenated since must be the same with the exception of the size along the concatenation axis.

...
 Number of operations and param per c-layer
 --------------------------------------------------------------------------------------------
 c_id    m_id   name (type)                                     #op (type)
 --------------------------------------------------------------------------------------------
 0       1      quant_conv2d_5_conv2d (conv2d_dqnn)                     819,200 (sxor_s1_s1)
 1       2      max_pooling2d (pool)                                     12,800 (op_s1_s1)
 2       4      quant_depthwise_conv2d_2_conv2d (conv2d_dqnn)            28,800 (sxor_s1_s1)
 3       6      concatenate_2 (concat)                                        0 (op_s1_s1)
 --------------------------------------------------------------------------------------------
 total                                                                  860,800

   Number of operation types
   ---------------------------------------------
   sxor_s1_s1               848,000       98.5%
   op_s1_s1                  12,800        1.5%
...

Fallback to 32b floating point kernels

Following code shows the case where the requested configuration: 's8xs1->s8' is not supported and the fallback is applied. This is a typical case where the user has the opportunity to modify its model (after a pre-analyze step of the model), to keep this layer in float limiting the possible loss of precision.

x = tf.keras.Input(shape=shape_in)
y = qkeras.QActivation(qkeras.quantized_bits(bits=8, integer=7))(x)
y = qkeras.QConv2D(filters=16, kernel_size=(3, 3), strides=(2, 2),
        kernel_quantizer=qkeras.binary(alpha=1),
        bias_quantizer=qkeras.binary(alpha=1),
        use_bias=True,
        padding="same",
        name="dense_0")(y)
y = tf.keras.layers.BatchNormalization()(y)
y = qkeras.QActivation(qkeras.quantized_bits(bits=8, integer=7))(y)

model = tf.keras.Model(inputs=x, outputs=y)

 Number of operations and param per c-layer
 ----------------------------------------------------------------------------
 c_id    m_id   name (type)                   #op (type)
 ----------------------------------------------------------------------------
 0       0      input_1_0_conversion (conv)             1,568 (conv_s8_f32)
 1       3      dense_0_conv2d (conv2d)                28,240 (smul_f32_f32)
 2       4      q_activation_1 (conv)                   6,272 (conv_f32_s8)
 ----------------------------------------------------------------------------
 total                                                 36,080

   Number of operation types
   ---------------------------------------------
   conv_s8_f32                1,568        4.3%
   smul_f32_f32              28,240       78.3%
   conv_f32_s8                6,272       17.4%

Building an efficient DQNN/BNN model for ST Edge AI Core

All highlighted recommendations from the Larq documentation (https://docs.larq.dev/larq/guides/bnn-architecture/) or from the QKeras notebooks (https://github.com/google/qkeras/blob/master/notebook/QKerasTutorial.ipynb or https://notebook.community/google/qkeras/notebook/QKerasTutorial) should be considered to design an efficient DQNN/BNN model for ST series. In particular:

It is preferable to leave the first layer and the last layer in higher precision: 's8' or 'f32'
Usage of the 'BatchNormailzation' layer
Placement of the 'MaxPool' layer before the 'BatchNormailzation'
Due to the way to encode the binary tensors (see “c-layout of the s1 type” section), it is recommended to have the number of channels as a multiple of 32 to optimize Flash/RAM size, and MAC/Cycle, but it is not a must have.

Pre-analyze step

It is recommended to execute regurlarly the “analyze” during the design of the DQNN/BNN model to know if it will be efficiently deployed (fallback not used) before to perform a complete training. This will avoid to use a quantized layer not supported or without gain in term of memory usage when it will deploy on a ST target.

FAQ

Is it possible to mix the quantized Larq and QKeras layers?

From the ST Edge AI Core point of view, yes, when importing the model, it is translated into an independent internal representation before applying the different optimization passes and the rendering stage. However, this is not recommended, even though each library (Larq or QKeras) is based on the Keras API, they are designed independently. There is no guarantee to converge with a good level of precision during the training phase.