2.2.0
ST Neural-ART NPU - Supported operators and limitations


ST Edge AI Core

ST Neural-ART NPU - Supported operators and limitations


for STM32 target, based on ST Edge AI Core Technology 2.2.0



r1.1

Overview

This document describes the model format and associated quantization scheme requested to deploy efficiently a neural network model on the Neural-ART accelerator™. It lists the mapping of the different operators.

Quantized Model Format

The ST Neural-ART compiler supports two formats of quantized model. Both are based on the same quantization scheme: 8b/8b, ss/sa per-channel.

  • A quantized TensorFlow lite model generated by a post-training or training aware process. The calibration has been performed by the TensorFlow Lite framework, principally through the “TF Lite converter” utility exporting a TensorFlow file.
  • A quantized ONNX model based on the Tensor-oriented (QDQ; Quantize and DeQuantize) format. The DeQuantizeLinear and QuantizeLinear operators are inserted between the original operators (in float) to simulate the quantization and dequantization process. It can be generated with the ONNX quantizer runtime services.

Please refer to the Quantized models chapter for a detailed description of the quantization procedure using Tensorflow Lite converter or onnxruntime.

If necessary, some layers in the model can stay in float but they will be executed on the CPU and not on the Neural-ART NPU.

Inference Model Constraints

ST Neural-ART processor unit is a re-configurable inference-only engine capable of accelerating in hardware the quantized inference models, no training mode is supported.

  • Input and output tensors must be static
  • Variable-length batch dimension (i.e. (None,)) is considered as equal to 1
  • Operator with un-connected output is not supported
  • Mixed data operations (i.e hybrid operator) are not supported, activations and weights should be quantized
  • Data type for the weights/activations tensors must be:
    • int8 (scale/offset format) ss/sa scheme (see Quantized models – per-channel)
    • if float32 operation is requested, it will be mapped on a SW operation

Custom Layer Support - Post-Processing

Currently, there is no support for the custom layer like “TFLite_Detection_PostProcess” operator for the object detection model. If possible, before quantizing the model, it is preferable to remove it, to keep only the backbone and head parts. This includes the “NMS” (Non Maximum Suppression) algorithm (generally based on the floating-point operations), it will be supported by an “external” library running on the HOST MCU processor.

TFlite to Neural-ART™ Operation Mapping

The following table lists the operation mapping between TensorFlow Lite (TFlite) and Neural-ART™ stack. Note that only the quantized operations (int8 scale/offset format) can be directly mapped on the Neural-ART™ processing units. If an operator is not mapped on HW, fallback implementations (int8 or float32 version) is emitted and the HOST (Cortex-m55) sub-system will be used to execute the fall-back implementations.

TFLite Operation Mapped On Comment
ABS HW
ADD HW
ARG_MAX SW_FLOAT
ARG_MIN SW_FLOAT
AVERAGE_POOL_2D HW Limited window. Software fallback available
BATCH_MATMUL HW
CAST HW
CEIL HW
CONCATENATION HW
CONV_2D HW
COS SW_FLOAT
COSH SW_FLOAT
DEPTHWISE_CONV_2D HW
DEQUANTIZE SW_FLOAT uint8/int8 -> float
DIV SW_INT
ELU SW_FLOAT
EQUAL HW
EXP SW_FLOAT
EXPAND_DIMS HW
FLOOR SW_FLOAT
FULLY_CONNECTED HW
GATHER SW_FLOAT
GREATER SW_FLOAT
GREATER_EQUAL SW_FLOAT
HARD_SWISH HW
L2_NORMALIZATION SW_FLOAT
LEAKY_RELU HW
LESS SW_FLOAT
LESS_EQUAL SW_FLOAT
LOCAL_RESPONSE_NORMALIZATION SW_FLOAT
LOG SW_FLOAT
LOGICAL_AND HW
LOGICAL_NOT HW
LOGICAL_OR HW
LOGISTIC HW
MAX_POOL_2D HW Limited window. Software fallback available
MAXIMUM HW
MEAN SW_FLOAT
MINIMUM HW
MUL HW SW fallback with broadcast or more than 4 dimensions
NEG HW
PACK HW
PAD HW Partial – according to the parameters, it can be HW-assisted with the extra epochs.
POW SW_FLOAT
PRELU HW
QUANTIZE SW_FLOAT float -> uint8/int8
(RE) QUANTIZE HW uint8 <-> int8
REDUCE_MAX SW_FLOAT
REDUCE_MIN SW_FLOAT
REDUCE_PROD SW_FLOAT
RELU HW
RELU6 HW
RESHAPE HW
RESIZE_BILINEAR SW_FLOAT
RESIZE_NEAREST_NEIGHBOR HW HW if (coordinate_transformation_mode=‘asymmetric’ AND nearest_mode=‘floor’). SW_INT otherwise
ROUND SW_FLOAT
SHAPE HW
SIN SW_FLOAT
SLICE HW
SOFTMAX SW_INT
SPACE_TO_DEPTH HW with same input/output quantization
SPLIT HW HW or SW_INT depends on axis
SPLIT_V HW HW or SW_INT depends on axis
STRIDED_SLICE HW
SQUEEZE HW
SQRT SW_FLOAT
SUB HW
SUM SW_FLOAT
TANH HW
TRANSPOSE HW Partial – according to the parameters, it can be HW-assisted with the extra epochs.
TRANSPOSE_CONV HW Partial – according to the parameters, it can be HW-assisted with the extra epochs.
UNPACK HW

ONNX to Neural-ART™ Operation Mapping

The following table lists the operation mapping between the ONNX and Neural-ART™ operator. Note that only the quantized operations (int8 scale/offset format) can be directly mapped on the Neural-ART™ processing units. If an operator is not mapped on HW, fallback implementations (int8 or float32 version) is emitted and the HOST (Cortex-m55) sub-system will be used to execute the fall-back implementations.

ONNX Operator Pytorch Name Mapped On Comment
Abs abs HW HW if Offset is 0, SW_FLOAT otherwise
Acos acos HW
Acosh HW
Add/sum add HW SW mapping if broadcasting out of HW supported broadcasting
And HW
ArgMax argmax SW_FLOAT
ArgMin argmin SW_FLOAT
Asin asin HW
Asinh HW
Atan atan HW
Atanh argmin HW
AveragePool avg_pool1d/avg_pool2d HW See AveragePool Limitations
BatchNormalization BatchNorm1D/BatchNorm2D SW_FLOAT Hardware support if placed after a CONV
Cast HW Software library mapped on DMAs or pure software
Ceil ceil SW_FLOAT
Clip clip/ReLU6 HW
Concat concat/concatenate HW Mapped on HW or SW depending on the batch and concatenation axis
Conv Conv1D/Conv2D HW see Optimal Kernel Size / Stride Size and Tips
ConvTranspose HW
Cos cos HW
Cosh cosh HW
DepthToSpace SW DMA is used with mode “DCR”
DequantizeLinear dequantize SW_FLOAT float -> int8/uint8
Div div SW_INT HW if second operand is a constant
Elu elu SW_FLOAT
Equal eq HW
Erf erf HW
Exp exp HW
Flatten flatten HW
Floor floor SW_FLOAT
Gather gather SW_FLOAT
Gemm HW
GlobalAveragePool HW Limited window. Software fallback available
GlobalMaxPool HW Limited window. Software fallback available
Greater gt SW_FLOAT
GreaterOrEqual ge SW_FLOAT
Hardmax SW_FLOAT
HardSigmoid HW
HardSwitch HW
Identity identity HW
InstanceNormalization InstanceNorm1/InstanceNorm2 SW_FLOAT
LpNormalization SW_FLOAT
LeakyRelu LeakyReLU HW
Less SW_FLOAT
LessOrEqual le SW_FLOAT
LRN LocalResponseNormalization SW_FLOAT
Log log SW_FLOAT
LpNormalization SW_FLOAT
MatMul HW
MaxPool HW Decomposition in multiple operations if Height or Width are greater than 3
Horizonal and vertical strides ranging from 1 to 15
Max HW
Min HW
Mod SW_FLOAT
Mul mul HW see broadcasting limitations
Neg neg HW
Not HW
Or HW
Pad ConstantPad1D/ConstantPad2D/ZeroPad2D HW Mapped on ConvAcc, DMA or SW
Pow SW_FLOAT
PRelu HW
QLinearAdd HW Same as Add
QLinearAveragePool HW Same as AveragePool
QLinearConcat HW Same as Concat
QLinearConv HW Same as Conv
QLinearGlobalAveragePool HW same as GlobalAveragePool
QLinearMatMul HW same as MatMul
QLinearMul HW same as Mul
QuantizeLinear quantize_per_tensor SW_FLOAT int8/uint8 -> float
int8 requantization supported in HW
Reciprocal reciprocal SW_FLOAT
ReduceLogSumExp SW_FLOAT
ReduceMax HW HW if it can be convered to GlobalMaxPool
No overhead if the reduced axes are the right-most ones, and the input shape.size() = number of reduced_axes+2. Additional Transposition and Reshapes might be required otherwise. SW fallback available
ReduceMean HW HW if it can be convered to GlobalAveragePool
No overhead if the reduced axes are the right-most ones, and the input shape.size() = number of reduced_axes+2. Additional Transposition and Reshapes might be required otherwise. SW fallback available
ReduceMin SW_FLOAT
ReduceProd SW_FLOAT
ReduceSum SW_FLOAT
Relu relu HW
Reshape shape HW
Resize upsample SW_INT Resize Nearest Neighbor on HW if coordinate_transformation_mode=‘asymmetric’ AND nearest_mode=‘floor’
Round round SW_FLOAT
Selu selu SW_FLOAT
Shape HW
Sin sin HW
Sinh sinh HW
Sigmoid sigmoid HW HW Acceleration available also for X*Sigmoid(X)
Slice HW DMA acceleration
Softmax softmax SW_INT
Softplus softplus SW_FLOAT
SoftSign softsign SW_FLOAT
Split split HW
Squeeze squeeze HW
Sqrt sqrt HW
Sub sub HW
Tan tan HW
Tanh tanh HW
ThresholdedRelu HW
Tile SW_FLOAT Optimized by constant folding passing through front-end, SW_FLOAT fallback otherwise
Transpose transpose HW
Unsqueeze unsqeeze HW
Upsample upsample SW_FLOAT

STM32N6 Neural-ART™ Processor Unit Tips/Limitations

Optimal Kernel Size / Stride Size

Maximum kernel width supported in hardware is 6 for stride 1 and 12 for stride of minimum 2. Vertical kernel size is limited to max 3. All larger kernel dimensions lead to a decomposition done by the compiler which has to be processed iteratively.

  • Best use of a kernel height is 3 (using just 2 will not help)
  • Best use of a kernel width is 3, 6, 12 (anything between and larger than 1 will not be better)
  • Avoid horizontal strides that are not 1, 2, or 4 if possible (2 and 4 are valid but lead to less data reuse)
  • Avoid vertical strides that are not 1, 2, or 4 if possible (vertical strides are in general inefficient)
  • Kernel dimensions of 1 (horizontal or vertical) are special cases (e.g., 1x1, 2x1, 1x2, 2x1, 1x3, 1x6, 1x12, etc.) and can be handled efficiently. They will run a lot faster than, for example, 3x2 or 2x3
  • Kernels with height 1 have no restrictions on the feature width (theoretical max is 2^16-1).

Don’t use strides larger than the kernel dimension (useless data). The compiler will have to play tricks as this is not natively supported in hardware. Note that layers with more input channels (ICH) than output channels (OCH) can often be mapped with better efficiency.

Special Case: 1x1 Kernel

1x1 kernel can be handled very efficiently if the number of input channels (ICH) is equal to N* 72…128 and the number of output channels (OCH) is equal to M(16..24) (best M24), with N, M as integers. If these numbers are larger than the amount of available CONV processing units (4 CA for STM32N6 NeuralART™), the compiler will translate this in multiple iterations.

To illustrate, three typical examples: - ICH=128 and OCH=96 would run at the same speed as ICH=32 and OCH=96 even when there are 4 times fewer MACs as in the first one (4 CA in parallel) - ICH=512 and OCH=24 would run at the same speed as ICH=512 and OCH=4 even when there are 6 times fewer MACs as in the first one (4 CA in series) - ICH=256 and OCH=48 would run at the same speed as ICH=129 and OCH=25 even when there are almost 4 times fewer MACs as in the first one. (2 CA in series, 2 chains in parallel)

Zeros in the Kernel or Feature Data

They will save power (gated MAC units) but not affect the performance.

Optimal Feature Width

For 8-bit feature data, feature width multiplied by the used batch depth (input channels) must be equal to or lower than 2048. Thus, for 512 wide features, batch depth will be limited to 4 and wider feature sizes further reduce the number of input channels and may cause the compiler to even split the feature into multiple columns.

Optimal Pool Operations with up to 3x3 Pooling Windows

  • Don’t use pooling windows larger than 3, else they will be decomposed by the compiler.
  • The number of input channels is limited. Check that a previous operation is not generating more input channels, else chaining will not be possible.
  • The line buffer of the pooling unit is limited. Avoid using data with [width x input_channels] larger than 2048 else the compiler has to split the operator into multiple columns.

Hardware supported broadcasting

  • Only Unidirectional broadcasting is supported in hardware
  • The best performance is achieved with a number of elements less or equal to 512.
  • Broadcasting is mapped on Arithmetic Unit with best performance in case of
    • Scalar broadcasting
    • Channel broadcasting (C <= 512, other dimensions =1)
    • Height broadcasting (H <= 512, other dimensions =1)
    • Width broadcasting (W <= 512, other dimensions =1)
    • Height-Width broadcasting ((H*W) <= 512, other dimensions =1)
  • DMAs supported broadcasting:
    • Channel broadcasting
    • Height broadcasting
    • Width broadcasting
    • Height-Width broadcasting
    • Channel-Width broadcasting
    • Channel-Height broadcasting
  • When the broadcasted input is the second input of the node
    • Supported only for commutative operations

Convolutions tips

  • Avoid big dilation factor, big dilation factors leads to non-optimal performance.
  • Avoid using prime number as number of kernels/channels, it will limit the heuristic for kernel/channel splitting.
  • When using a very large kernel, use a multiple of 3 for width and height
  • Horizontal strides lead to less data reuseVertical strides are in general inefficient.
  • Pad <= 2, are supported without impacting performance.
  • Zeros in the kernel or feature data will save power (gated MAC) but not affect the performance. Values that are almost zero cannot be used to gate the MAC units, thus no power saving.

AveragePool limitations

  • Only 1D and 2D supported
  • Kernel shape: Height and Width between 1 and 3 included
  • Horizonal and vertical strides ranging from 1 to 15
  • Pad
    • Left, right and top padding ranging from 0 to 7
    • Max bottom padding supported is up to Window Height minus 1