ST Neural-ART NPU - Supported operators and limitations
for STM32 target, based on ST Edge AI Core Technology 2.2.0
r1.1
Overview
This document describes the model format and associated quantization scheme requested to deploy efficiently a neural network model on the Neural-ART accelerator™. It lists the mapping of the different operators.
Quantized Model Format
The ST Neural-ART compiler supports two formats of quantized model. Both are based on the same quantization scheme: 8b/8b, ss/sa per-channel.
- A quantized TensorFlow lite model generated by a post-training or training aware process. The calibration has been performed by the TensorFlow Lite framework, principally through the “TF Lite converter” utility exporting a TensorFlow file.
- A quantized ONNX model based on the Tensor-oriented (QDQ; Quantize and DeQuantize) format. The DeQuantizeLinear and QuantizeLinear operators are inserted between the original operators (in float) to simulate the quantization and dequantization process. It can be generated with the ONNX quantizer runtime services.
Please refer to the Quantized models chapter for a detailed description of the quantization procedure using Tensorflow Lite converter or onnxruntime.
If necessary, some layers in the model can stay in float but they will be executed on the CPU and not on the Neural-ART NPU.
Inference Model Constraints
ST Neural-ART processor unit is a re-configurable inference-only engine capable of accelerating in hardware the quantized inference models, no training mode is supported.
- Input and output tensors must be static
- Variable-length batch dimension (i.e. (None,)) is considered as equal to 1
- Operator with un-connected output is not supported
- Mixed data operations (i.e hybrid operator) are not supported, activations and weights should be quantized
- Data type for the weights/activations tensors must be:
- int8 (scale/offset format) ss/sa scheme (see Quantized models – per-channel)
- if float32 operation is requested, it will be mapped on a SW operation
Custom Layer Support - Post-Processing
Currently, there is no support for the custom layer like “TFLite_Detection_PostProcess” operator for the object detection model. If possible, before quantizing the model, it is preferable to remove it, to keep only the backbone and head parts. This includes the “NMS” (Non Maximum Suppression) algorithm (generally based on the floating-point operations), it will be supported by an “external” library running on the HOST MCU processor.
TFlite to Neural-ART™ Operation Mapping
The following table lists the operation mapping between TensorFlow Lite (TFlite) and Neural-ART™ stack. Note that only the quantized operations (int8 scale/offset format) can be directly mapped on the Neural-ART™ processing units. If an operator is not mapped on HW, fallback implementations (int8 or float32 version) is emitted and the HOST (Cortex-m55) sub-system will be used to execute the fall-back implementations.
TFLite Operation | Mapped On | Comment |
---|---|---|
ABS | HW | |
ADD | HW | |
ARG_MAX | SW_FLOAT | |
ARG_MIN | SW_FLOAT | |
AVERAGE_POOL_2D | HW | Limited window. Software fallback available |
BATCH_MATMUL | HW | |
CAST | HW | |
CEIL | HW | |
CONCATENATION | HW | |
CONV_2D | HW | |
COS | SW_FLOAT | |
COSH | SW_FLOAT | |
DEPTHWISE_CONV_2D | HW | |
DEQUANTIZE | SW_FLOAT | uint8/int8 -> float |
DIV | SW_INT | |
ELU | SW_FLOAT | |
EQUAL | HW | |
EXP | SW_FLOAT | |
EXPAND_DIMS | HW | |
FLOOR | SW_FLOAT | |
FULLY_CONNECTED | HW | |
GATHER | SW_FLOAT | |
GREATER | SW_FLOAT | |
GREATER_EQUAL | SW_FLOAT | |
HARD_SWISH | HW | |
L2_NORMALIZATION | SW_FLOAT | |
LEAKY_RELU | HW | |
LESS | SW_FLOAT | |
LESS_EQUAL | SW_FLOAT | |
LOCAL_RESPONSE_NORMALIZATION | SW_FLOAT | |
LOG | SW_FLOAT | |
LOGICAL_AND | HW | |
LOGICAL_NOT | HW | |
LOGICAL_OR | HW | |
LOGISTIC | HW | |
MAX_POOL_2D | HW | Limited window. Software fallback available |
MAXIMUM | HW | |
MEAN | SW_FLOAT | |
MINIMUM | HW | |
MUL | HW | SW fallback with broadcast or more than 4 dimensions |
NEG | HW | |
PACK | HW | |
PAD | HW | Partial – according to the parameters, it can be HW-assisted with the extra epochs. |
POW | SW_FLOAT | |
PRELU | HW | |
QUANTIZE | SW_FLOAT | float -> uint8/int8 |
(RE) QUANTIZE | HW | uint8 <-> int8 |
REDUCE_MAX | SW_FLOAT | |
REDUCE_MIN | SW_FLOAT | |
REDUCE_PROD | SW_FLOAT | |
RELU | HW | |
RELU6 | HW | |
RESHAPE | HW | |
RESIZE_BILINEAR | SW_FLOAT | |
RESIZE_NEAREST_NEIGHBOR | HW | HW if (coordinate_transformation_mode=‘asymmetric’ AND nearest_mode=‘floor’). SW_INT otherwise |
ROUND | SW_FLOAT | |
SHAPE | HW | |
SIN | SW_FLOAT | |
SLICE | HW | |
SOFTMAX | SW_INT | |
SPACE_TO_DEPTH | HW | with same input/output quantization |
SPLIT | HW | HW or SW_INT depends on axis |
SPLIT_V | HW | HW or SW_INT depends on axis |
STRIDED_SLICE | HW | |
SQUEEZE | HW | |
SQRT | SW_FLOAT | |
SUB | HW | |
SUM | SW_FLOAT | |
TANH | HW | |
TRANSPOSE | HW | Partial – according to the parameters, it can be HW-assisted with the extra epochs. |
TRANSPOSE_CONV | HW | Partial – according to the parameters, it can be HW-assisted with the extra epochs. |
UNPACK | HW |
ONNX to Neural-ART™ Operation Mapping
The following table lists the operation mapping between the ONNX and Neural-ART™ operator. Note that only the quantized operations (int8 scale/offset format) can be directly mapped on the Neural-ART™ processing units. If an operator is not mapped on HW, fallback implementations (int8 or float32 version) is emitted and the HOST (Cortex-m55) sub-system will be used to execute the fall-back implementations.
ONNX Operator | Pytorch Name | Mapped On | Comment |
---|---|---|---|
Abs | abs | HW | HW if Offset is 0, SW_FLOAT otherwise |
Acos | acos | HW | |
Acosh | HW | ||
Add/sum | add | HW | SW mapping if broadcasting out of HW supported broadcasting |
And | HW | ||
ArgMax | argmax | SW_FLOAT | |
ArgMin | argmin | SW_FLOAT | |
Asin | asin | HW | |
Asinh | HW | ||
Atan | atan | HW | |
Atanh | argmin | HW | |
AveragePool | avg_pool1d/avg_pool2d | HW | See AveragePool Limitations |
BatchNormalization | BatchNorm1D/BatchNorm2D | SW_FLOAT | Hardware support if placed after a CONV |
Cast | HW | Software library mapped on DMAs or pure software | |
Ceil | ceil | SW_FLOAT | |
Clip | clip/ReLU6 | HW | |
Concat | concat/concatenate | HW | Mapped on HW or SW depending on the batch and concatenation axis |
Conv | Conv1D/Conv2D | HW | see Optimal Kernel Size / Stride Size and Tips |
ConvTranspose | HW | ||
Cos | cos | HW | |
Cosh | cosh | HW | |
DepthToSpace | SW | DMA is used with mode “DCR” | |
DequantizeLinear | dequantize | SW_FLOAT | float -> int8/uint8 |
Div | div | SW_INT | HW if second operand is a constant |
Elu | elu | SW_FLOAT | |
Equal | eq | HW | |
Erf | erf | HW | |
Exp | exp | HW | |
Flatten | flatten | HW | |
Floor | floor | SW_FLOAT | |
Gather | gather | SW_FLOAT | |
Gemm | HW | ||
GlobalAveragePool | HW | Limited window. Software fallback available | |
GlobalMaxPool | HW | Limited window. Software fallback available | |
Greater | gt | SW_FLOAT | |
GreaterOrEqual | ge | SW_FLOAT | |
Hardmax | SW_FLOAT | ||
HardSigmoid | HW | ||
HardSwitch | HW | ||
Identity | identity | HW | |
InstanceNormalization | InstanceNorm1/InstanceNorm2 | SW_FLOAT | |
LpNormalization | SW_FLOAT | ||
LeakyRelu | LeakyReLU | HW | |
Less | SW_FLOAT | ||
LessOrEqual | le | SW_FLOAT | |
LRN | LocalResponseNormalization | SW_FLOAT | |
Log | log | SW_FLOAT | |
LpNormalization | SW_FLOAT | ||
MatMul | HW | ||
MaxPool | HW | Decomposition in multiple operations if Height or Width are
greater than 3 Horizonal and vertical strides ranging from 1 to 15 |
|
Max | HW | ||
Min | HW | ||
Mod | SW_FLOAT | ||
Mul | mul | HW | see broadcasting limitations |
Neg | neg | HW | |
Not | HW | ||
Or | HW | ||
Pad | ConstantPad1D/ConstantPad2D/ZeroPad2D | HW | Mapped on ConvAcc, DMA or SW |
Pow | SW_FLOAT | ||
PRelu | HW | ||
QLinearAdd | HW | Same as Add | |
QLinearAveragePool | HW | Same as AveragePool | |
QLinearConcat | HW | Same as Concat | |
QLinearConv | HW | Same as Conv | |
QLinearGlobalAveragePool | HW | same as GlobalAveragePool | |
QLinearMatMul | HW | same as MatMul | |
QLinearMul | HW | same as Mul | |
QuantizeLinear | quantize_per_tensor | SW_FLOAT | int8/uint8 -> float int8 requantization supported in HW |
Reciprocal | reciprocal | SW_FLOAT | |
ReduceLogSumExp | SW_FLOAT | ||
ReduceMax | HW | HW if it can be convered to GlobalMaxPool No overhead if the reduced axes are the right-most ones, and the input shape.size() = number of reduced_axes+2. Additional Transposition and Reshapes might be required otherwise. SW fallback available |
|
ReduceMean | HW | HW if it can be convered to GlobalAveragePool No overhead if the reduced axes are the right-most ones, and the input shape.size() = number of reduced_axes+2. Additional Transposition and Reshapes might be required otherwise. SW fallback available |
|
ReduceMin | SW_FLOAT | ||
ReduceProd | SW_FLOAT | ||
ReduceSum | SW_FLOAT | ||
Relu | relu | HW | |
Reshape | shape | HW | |
Resize | upsample | SW_INT | Resize Nearest Neighbor on HW if coordinate_transformation_mode=‘asymmetric’ AND nearest_mode=‘floor’ |
Round | round | SW_FLOAT | |
Selu | selu | SW_FLOAT | |
Shape | HW | ||
Sin | sin | HW | |
Sinh | sinh | HW | |
Sigmoid | sigmoid | HW | HW Acceleration available also for X*Sigmoid(X) |
Slice | HW | DMA acceleration | |
Softmax | softmax | SW_INT | |
Softplus | softplus | SW_FLOAT | |
SoftSign | softsign | SW_FLOAT | |
Split | split | HW | |
Squeeze | squeeze | HW | |
Sqrt | sqrt | HW | |
Sub | sub | HW | |
Tan | tan | HW | |
Tanh | tanh | HW | |
ThresholdedRelu | HW | ||
Tile | SW_FLOAT | Optimized by constant folding passing through front-end, SW_FLOAT fallback otherwise | |
Transpose | transpose | HW | |
Unsqueeze | unsqeeze | HW | |
Upsample | upsample | SW_FLOAT |
STM32N6 Neural-ART™ Processor Unit Tips/Limitations
Optimal Kernel Size / Stride Size
Maximum kernel width supported in hardware is 6 for stride 1 and 12 for stride of minimum 2. Vertical kernel size is limited to max 3. All larger kernel dimensions lead to a decomposition done by the compiler which has to be processed iteratively.
- Best use of a kernel height is 3 (using just 2 will not help)
- Best use of a kernel width is 3, 6, 12 (anything between and larger than 1 will not be better)
- Avoid horizontal strides that are not 1, 2, or 4 if possible (2 and 4 are valid but lead to less data reuse)
- Avoid vertical strides that are not 1, 2, or 4 if possible (vertical strides are in general inefficient)
- Kernel dimensions of 1 (horizontal or vertical) are special cases (e.g., 1x1, 2x1, 1x2, 2x1, 1x3, 1x6, 1x12, etc.) and can be handled efficiently. They will run a lot faster than, for example, 3x2 or 2x3
- Kernels with height 1 have no restrictions on the feature width (theoretical max is 2^16-1).
Don’t use strides larger than the kernel dimension (useless data). The compiler will have to play tricks as this is not natively supported in hardware. Note that layers with more input channels (ICH) than output channels (OCH) can often be mapped with better efficiency.
Special Case: 1x1 Kernel
1x1 kernel can be handled very efficiently if the number of input channels (ICH) is equal to N* 72…128 and the number of output channels (OCH) is equal to M(16..24) (best M24), with N, M as integers. If these numbers are larger than the amount of available CONV processing units (4 CA for STM32N6 NeuralART™), the compiler will translate this in multiple iterations.
To illustrate, three typical examples: - ICH=128 and OCH=96 would run at the same speed as ICH=32 and OCH=96 even when there are 4 times fewer MACs as in the first one (4 CA in parallel) - ICH=512 and OCH=24 would run at the same speed as ICH=512 and OCH=4 even when there are 6 times fewer MACs as in the first one (4 CA in series) - ICH=256 and OCH=48 would run at the same speed as ICH=129 and OCH=25 even when there are almost 4 times fewer MACs as in the first one. (2 CA in series, 2 chains in parallel)
Zeros in the Kernel or Feature Data
They will save power (gated MAC units) but not affect the performance.
Optimal Feature Width
For 8-bit feature data, feature width multiplied by the used batch depth (input channels) must be equal to or lower than 2048. Thus, for 512 wide features, batch depth will be limited to 4 and wider feature sizes further reduce the number of input channels and may cause the compiler to even split the feature into multiple columns.
Optimal Pool Operations with up to 3x3 Pooling Windows
- Don’t use pooling windows larger than 3, else they will be decomposed by the compiler.
- The number of input channels is limited. Check that a previous operation is not generating more input channels, else chaining will not be possible.
- The line buffer of the pooling unit is limited. Avoid using data with [width x input_channels] larger than 2048 else the compiler has to split the operator into multiple columns.
Hardware supported broadcasting
- Only Unidirectional broadcasting is supported in hardware
- The best performance is achieved with a number of elements less or equal to 512.
- Broadcasting is mapped on Arithmetic Unit with best performance
in case of
- Scalar broadcasting
- Channel broadcasting (C <= 512, other dimensions =1)
- Height broadcasting (H <= 512, other dimensions =1)
- Width broadcasting (W <= 512, other dimensions =1)
- Height-Width broadcasting ((H*W) <= 512, other dimensions =1)
- DMAs supported broadcasting:
- Channel broadcasting
- Height broadcasting
- Width broadcasting
- Height-Width broadcasting
- Channel-Width broadcasting
- Channel-Height broadcasting
- When the broadcasted input is the second input of the node
- Supported only for commutative operations
Convolutions tips
- Avoid big dilation factor, big dilation factors leads to non-optimal performance.
- Avoid using prime number as number of kernels/channels, it will limit the heuristic for kernel/channel splitting.
- When using a very large kernel, use a multiple of 3 for width and height
- Horizontal strides lead to less data reuseVertical strides are in general inefficient.
- Pad <= 2, are supported without impacting performance.
- Zeros in the kernel or feature data will save power (gated MAC) but not affect the performance. Values that are almost zero cannot be used to gate the MAC units, thus no power saving.
AveragePool limitations
- Only 1D and 2D supported
- Kernel shape: Height and Width between 1 and 3 included
- Horizonal and vertical strides ranging from 1 to 15
- Pad
- Left, right and top padding ranging from 0 to 7
- Max bottom padding supported is up to Window Height minus 1