ST Neural-ART NPU - Supported operators and limitations

for STM32 target, based on ST Edge AI Core Technology 2.2.0

r1.1

Overview

This document describes the model format and associated quantization scheme requested to deploy efficiently a neural network model on the Neural-ART accelerator™. It lists the mapping of the different operators.

Quantized Model Format

The ST Neural-ART compiler supports two formats of quantized model. Both are based on the same quantization scheme: 8b/8b, ss/sa per-channel.

A quantized TensorFlow lite model generated by a post-training or training aware process. The calibration has been performed by the TensorFlow Lite framework, principally through the “TF Lite converter” utility exporting a TensorFlow file.
A quantized ONNX model based on the Tensor-oriented (QDQ; Quantize and DeQuantize) format. The DeQuantizeLinear and QuantizeLinear operators are inserted between the original operators (in float) to simulate the quantization and dequantization process. It can be generated with the ONNX quantizer runtime services.

Please refer to the Quantized models chapter for a detailed description of the quantization procedure using Tensorflow Lite converter or onnxruntime.

If necessary, some layers in the model can stay in float but they will be executed on the CPU and not on the Neural-ART NPU.

Inference Model Constraints

ST Neural-ART processor unit is a re-configurable inference-only engine capable of accelerating in hardware the quantized inference models, no training mode is supported.

Input and output tensors must be static
Variable-length batch dimension (i.e. (None,)) is considered as equal to 1
Operator with un-connected output is not supported
Mixed data operations (i.e hybrid operator) are not supported, activations and weights should be quantized
Data type for the weights/activations tensors must be:
- int8 (scale/offset format) ss/sa scheme (see Quantized models – per-channel)
- if float32 operation is requested, it will be mapped on a SW operation

Custom Layer Support - Post-Processing

Currently, there is no support for the custom layer like “TFLite_Detection_PostProcess” operator for the object detection model. If possible, before quantizing the model, it is preferable to remove it, to keep only the backbone and head parts. This includes the “NMS” (Non Maximum Suppression) algorithm (generally based on the floating-point operations), it will be supported by an “external” library running on the HOST MCU processor.

TFlite to Neural-ART™ Operation Mapping

The following table lists the operation mapping between TensorFlow Lite (TFlite) and Neural-ART™ stack. Note that only the quantized operations (int8 scale/offset format) can be directly mapped on the Neural-ART™ processing units. If an operator is not mapped on HW, fallback implementations (int8 or float32 version) is emitted and the HOST (Cortex-m55) sub-system will be used to execute the fall-back implementations.

TFLite Operation	Mapped On	Comment
ABS	HW
ADD	HW
ARG_MAX	SW_FLOAT
ARG_MIN	SW_FLOAT
AVERAGE_POOL_2D	HW	Limited window. Software fallback available
BATCH_MATMUL	HW
CAST	HW
CEIL	HW
CONCATENATION	HW
CONV_2D	HW
COS	SW_FLOAT
COSH	SW_FLOAT
DEPTHWISE_CONV_2D	HW
DEQUANTIZE	SW_FLOAT	uint8/int8 -> float
DIV	SW_INT
ELU	SW_FLOAT
EQUAL	HW
EXP	SW_FLOAT
EXPAND_DIMS	HW
FLOOR	SW_FLOAT
FULLY_CONNECTED	HW
GATHER	SW_FLOAT
GREATER	SW_FLOAT
GREATER_EQUAL	SW_FLOAT
HARD_SWISH	HW
L2_NORMALIZATION	SW_FLOAT
LEAKY_RELU	HW
LESS	SW_FLOAT
LESS_EQUAL	SW_FLOAT
LOCAL_RESPONSE_NORMALIZATION	SW_FLOAT
LOG	SW_FLOAT
LOGICAL_AND	HW
LOGICAL_NOT	HW
LOGICAL_OR	HW
LOGISTIC	HW
MAX_POOL_2D	HW	Limited window. Software fallback available
MAXIMUM	HW
MEAN	SW_FLOAT
MINIMUM	HW
MUL	HW	SW fallback with broadcast or more than 4 dimensions
NEG	HW
PACK	HW
PAD	HW	Partial – according to the parameters, it can be HW-assisted with the extra epochs.
POW	SW_FLOAT
PRELU	HW
QUANTIZE	SW_FLOAT	float -> uint8/int8
(RE) QUANTIZE	HW	uint8 <-> int8
REDUCE_MAX	SW_FLOAT
REDUCE_MIN	SW_FLOAT
REDUCE_PROD	SW_FLOAT
RELU	HW
RELU6	HW
RESHAPE	HW
RESIZE_BILINEAR	SW_FLOAT
RESIZE_NEAREST_NEIGHBOR	HW	HW if (coordinate_transformation_mode=‘asymmetric’ AND nearest_mode=‘floor’). SW_INT otherwise
ROUND	SW_FLOAT
SHAPE	HW
SIN	SW_FLOAT
SLICE	HW
SOFTMAX	SW_INT
SPACE_TO_DEPTH	HW	with same input/output quantization
SPLIT	HW	HW or SW_INT depends on axis
SPLIT_V	HW	HW or SW_INT depends on axis
STRIDED_SLICE	HW
SQUEEZE	HW
SQRT	SW_FLOAT
SUB	HW
SUM	SW_FLOAT
TANH	HW
TRANSPOSE	HW	Partial – according to the parameters, it can be HW-assisted with the extra epochs.
TRANSPOSE_CONV	HW	Partial – according to the parameters, it can be HW-assisted with the extra epochs.
UNPACK	HW

ONNX to Neural-ART™ Operation Mapping

The following table lists the operation mapping between the ONNX and Neural-ART™ operator. Note that only the quantized operations (int8 scale/offset format) can be directly mapped on the Neural-ART™ processing units. If an operator is not mapped on HW, fallback implementations (int8 or float32 version) is emitted and the HOST (Cortex-m55) sub-system will be used to execute the fall-back implementations.

ONNX Operator	Pytorch Name	Mapped On	Comment
Abs	abs	HW	HW if Offset is 0, SW_FLOAT otherwise
Acos	acos	HW
Acosh		HW
Add/sum	add	HW	SW mapping if broadcasting out of HW supported broadcasting
And		HW
ArgMax	argmax	SW_FLOAT
ArgMin	argmin	SW_FLOAT
Asin	asin	HW
Asinh		HW
Atan	atan	HW
Atanh	argmin	HW
AveragePool	avg_pool1d/avg_pool2d	HW	See AveragePool Limitations
BatchNormalization	BatchNorm1D/BatchNorm2D	SW_FLOAT	Hardware support if placed after a CONV
Cast		HW	Software library mapped on DMAs or pure software
Ceil	ceil	SW_FLOAT
Clip	clip/ReLU6	HW
Concat	concat/concatenate	HW	Mapped on HW or SW depending on the batch and concatenation axis
Conv	Conv1D/Conv2D	HW	see Optimal Kernel Size / Stride Size and Tips
ConvTranspose		HW
Cos	cos	HW
Cosh	cosh	HW
DepthToSpace	SW	DMA is used with mode “DCR”
DequantizeLinear	dequantize	SW_FLOAT	float -> int8/uint8
Div	div	SW_INT	HW if second operand is a constant
Elu	elu	SW_FLOAT
Equal	eq	HW
Erf	erf	HW
Exp	exp	HW
Flatten	flatten	HW
Floor	floor	SW_FLOAT
Gather	gather	SW_FLOAT
Gemm		HW
GlobalAveragePool		HW	Limited window. Software fallback available
GlobalMaxPool		HW	Limited window. Software fallback available
Greater	gt	SW_FLOAT
GreaterOrEqual	ge	SW_FLOAT
Hardmax		SW_FLOAT
HardSigmoid		HW
HardSwitch		HW
Identity	identity	HW
InstanceNormalization	InstanceNorm1/InstanceNorm2	SW_FLOAT
LpNormalization		SW_FLOAT
LeakyRelu	LeakyReLU	HW
Less		SW_FLOAT
LessOrEqual	le	SW_FLOAT
LRN	LocalResponseNormalization	SW_FLOAT
Log	log	SW_FLOAT
LpNormalization		SW_FLOAT
MatMul		HW
MaxPool		HW	Decomposition in multiple operations if Height or Width are greater than 3 Horizonal and vertical strides ranging from 1 to 15
Max		HW
Min		HW
Mod		SW_FLOAT
Mul	mul	HW	see broadcasting limitations
Neg	neg	HW
Not		HW
Or		HW
Pad	ConstantPad1D/ConstantPad2D/ZeroPad2D	HW	Mapped on ConvAcc, DMA or SW
Pow		SW_FLOAT
PRelu		HW
QLinearAdd		HW	Same as Add
QLinearAveragePool		HW	Same as AveragePool
QLinearConcat		HW	Same as Concat
QLinearConv		HW	Same as Conv
QLinearGlobalAveragePool		HW	same as GlobalAveragePool
QLinearMatMul		HW	same as MatMul
QLinearMul		HW	same as Mul
QuantizeLinear	quantize_per_tensor	SW_FLOAT	int8/uint8 -> float int8 requantization supported in HW
Reciprocal	reciprocal	SW_FLOAT
ReduceLogSumExp		SW_FLOAT
ReduceMax		HW	HW if it can be convered to GlobalMaxPool No overhead if the reduced axes are the right-most ones, and the input shape.size() = number of reduced_axes+2. Additional Transposition and Reshapes might be required otherwise. SW fallback available
ReduceMean		HW	HW if it can be convered to GlobalAveragePool No overhead if the reduced axes are the right-most ones, and the input shape.size() = number of reduced_axes+2. Additional Transposition and Reshapes might be required otherwise. SW fallback available
ReduceMin		SW_FLOAT
ReduceProd		SW_FLOAT
ReduceSum		SW_FLOAT
Relu	relu	HW
Reshape	shape	HW
Resize	upsample	SW_INT	Resize Nearest Neighbor on HW if coordinate_transformation_mode=‘asymmetric’ AND nearest_mode=‘floor’
Round	round	SW_FLOAT
Selu	selu	SW_FLOAT
Shape		HW
Sin	sin	HW
Sinh	sinh	HW
Sigmoid	sigmoid	HW	HW Acceleration available also for X*Sigmoid(X)
Slice		HW	DMA acceleration
Softmax	softmax	SW_INT
Softplus	softplus	SW_FLOAT
SoftSign	softsign	SW_FLOAT
Split	split	HW
Squeeze	squeeze	HW
Sqrt	sqrt	HW
Sub	sub	HW
Tan	tan	HW
Tanh	tanh	HW
ThresholdedRelu		HW
Tile		SW_FLOAT	Optimized by constant folding passing through front-end, SW_FLOAT fallback otherwise
Transpose	transpose	HW
Unsqueeze	unsqeeze	HW
Upsample	upsample	SW_FLOAT

STM32N6 Neural-ART™ Processor Unit Tips/Limitations

Optimal Kernel Size / Stride Size

Maximum kernel width supported in hardware is 6 for stride 1 and 12 for stride of minimum 2. Vertical kernel size is limited to max 3. All larger kernel dimensions lead to a decomposition done by the compiler which has to be processed iteratively.

Best use of a kernel height is 3 (using just 2 will not help)
Best use of a kernel width is 3, 6, 12 (anything between and larger than 1 will not be better)
Avoid horizontal strides that are not 1, 2, or 4 if possible (2 and 4 are valid but lead to less data reuse)
Avoid vertical strides that are not 1, 2, or 4 if possible (vertical strides are in general inefficient)
Kernel dimensions of 1 (horizontal or vertical) are special cases (e.g., 1x1, 2x1, 1x2, 2x1, 1x3, 1x6, 1x12, etc.) and can be handled efficiently. They will run a lot faster than, for example, 3x2 or 2x3
Kernels with height 1 have no restrictions on the feature width (theoretical max is 2^16-1).

Don’t use strides larger than the kernel dimension (useless data). The compiler will have to play tricks as this is not natively supported in hardware. Note that layers with more input channels (ICH) than output channels (OCH) can often be mapped with better efficiency.

Special Case: 1x1 Kernel

1x1 kernel can be handled very efficiently if the number of input channels (ICH) is equal to N* 72…128 and the number of output channels (OCH) is equal to M(16..24) (best M24), with N, M as integers. If these numbers are larger than the amount of available CONV processing units (4 CA for STM32N6 NeuralART™), the compiler will translate this in multiple iterations.

To illustrate, three typical examples: - ICH=128 and OCH=96 would run at the same speed as ICH=32 and OCH=96 even when there are 4 times fewer MACs as in the first one (4 CA in parallel) - ICH=512 and OCH=24 would run at the same speed as ICH=512 and OCH=4 even when there are 6 times fewer MACs as in the first one (4 CA in series) - ICH=256 and OCH=48 would run at the same speed as ICH=129 and OCH=25 even when there are almost 4 times fewer MACs as in the first one. (2 CA in series, 2 chains in parallel)

Zeros in the Kernel or Feature Data

They will save power (gated MAC units) but not affect the performance.

Optimal Feature Width

For 8-bit feature data, feature width multiplied by the used batch depth (input channels) must be equal to or lower than 2048. Thus, for 512 wide features, batch depth will be limited to 4 and wider feature sizes further reduce the number of input channels and may cause the compiler to even split the feature into multiple columns.

Optimal Pool Operations with up to 3x3 Pooling Windows

Don’t use pooling windows larger than 3, else they will be decomposed by the compiler.
The number of input channels is limited. Check that a previous operation is not generating more input channels, else chaining will not be possible.
The line buffer of the pooling unit is limited. Avoid using data with [width x input_channels] larger than 2048 else the compiler has to split the operator into multiple columns.

Hardware supported broadcasting

Only Unidirectional broadcasting is supported in hardware
The best performance is achieved with a number of elements less or equal to 512.
Broadcasting is mapped on Arithmetic Unit with best performance in case of
- Scalar broadcasting
- Channel broadcasting (C <= 512, other dimensions =1)
- Height broadcasting (H <= 512, other dimensions =1)
- Width broadcasting (W <= 512, other dimensions =1)
- Height-Width broadcasting ((H*W) <= 512, other dimensions =1)
DMAs supported broadcasting:
- Channel broadcasting
- Height broadcasting
- Width broadcasting
- Height-Width broadcasting
- Channel-Width broadcasting
- Channel-Height broadcasting
When the broadcasted input is the second input of the node
- Supported only for commutative operations

Convolutions tips

Avoid big dilation factor, big dilation factors leads to non-optimal performance.
Avoid using prime number as number of kernels/channels, it will limit the heuristic for kernel/channel splitting.
When using a very large kernel, use a multiple of 3 for width and height
Horizontal strides lead to less data reuseVertical strides are in general inefficient.
Pad <= 2, are supported without impacting performance.
Zeros in the kernel or feature data will save power (gated MAC) but not affect the performance. Values that are almost zero cannot be used to gate the MAC units, thus no power saving.

AveragePool limitations

Only 1D and 2D supported
Kernel shape: Height and Width between 1 and 3 included
Horizonal and vertical strides ranging from 1 to 15
Pad
- Left, right and top padding ranging from 0 to 7
- Max bottom padding supported is up to Window Height minus 1

ST Neural-ART NPU - Supported operators and limitations - r1.1
ST Edge AI Core Technology 2.2.0

ST logo Information in this document is provided solely in connection with ST products. The contents of this document are subject to change without prior notice. © Copyright STMicroelectronics 2025. All rights reserved. www.st.com