Evaluation report and metrics
ST Edge AI Core Technology 2.2.0
r2.6
Purpose
This article describes the different metrics (and associated computing flow) used to evaluate the accuracy of the generated C-files (or C-model) primarily through the validate command. The proposed metrics should be considered as generic indicators. They allow numerical comparison of the predictions of the C-model against the predictions of the original model. Only simple scalar values are computed; no specific thresholds or pre/post-processing are used. For postprocess evaluation, all injected and predicted data (including the data from the original model) are saved. The end user can also execute the C-model locally for efficient and advanced or customized validation flow.
Warning
Be aware that the underlying validation engine is not designed and optimized, in terms of execution time and host resource usage, to validate a pretrained model as during a training/test phase. A representative and small subset of the whole training dataset is expected to test the accuracy of the generated C-model running on the desktop/host or on-target environment
Evaluated metrics
metric | category | description | applicable to |
---|---|---|---|
MACC | complexity | computational complexity | all models |
ROM/RAM | memory | memory-related metrics | all models |
metric | category | description | applicable to |
---|---|---|---|
ACC | perf | accuracy (Classification accuracy) | only classifier models (float and integer format) |
RMSE | perf | Root Mean Square Error | all models |
MAE | perf | Mean Absolute Error | all models |
L2R | perf | L2 relative Error | all models |
MEAN | perf | Arithmetic mean of the error | all models |
STD | perf | Standard deviation of the error | all models |
NSE | perf | Nash-Sutcliffe efficiency criteria | all models |
COS | perf | COsine Similarity | all models |
CM | perf | Confusion Matrix | only classifier models (float and integer format) |
Computation of the metrics
The MACC and ROM/RAM metrics are computed during the import of the model. The other metrics are evaluated during the process of validation (that is, validate command). By default, no user data are requested since the models are fed with the random data. However, the user also has the possibility to give a representative preprocessed dataset (with or without the references). A dedicated file always saves raw input and output data, which a postprocess user script could use.
- [ I ] designates the list of the preprocessed samples (or inputs) which are used to feed the original model and the C-model. It can be provided by the user (see “Inputs validation files” and “Specific attention on the provided data” sections) or randomly generated (see “Random data generation” section). They are used as-is without preprocessing with a potential exception for the quantized models.
- [ P ] designates the list of the predicted samples inferred by the C-model.
- [ R’ ] designates the list of the predicted samples inferred by the original model.
- {optional} [ R ] designates the list of the predicted output samples provided by the user. They are used as the ground truth or reference values.
At the end of the process, metrics are summarized in a simple table.
Evaluation report (summary)
-----------------------------------------------------------------------------------------------------------------------------------------
Output acc rmse mae l2r mean std nse cos tensor
-----------------------------------------------------------------------------------------------------------------------------------------
HOST c-model #1 100.00% 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 1.0000000 1.0000000 nl_3, (4,), m_id=[3]
original model #1 100.00% 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 1.0000000 1.0000000 nl_3, (4,), m_id=[3]
X-cross #1 100.00% 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 1.0000000 1.0000000 nl_3, (4,), m_id=[3]
-----------------------------------------------------------------------------------------------------------------------------------------
- X-cross #1 indicates the metrics which are evaluated with the [ P ] and [ R’ ] data for the first output #1 (there is one line by output). In this case, the predicted values [ R’ ] are considered as the references.
- HOST [or target] C-model (respectively original model) designates the case where [ R ] data are also provided by the user allowing to compute the metrics against theses references or ground truth values. If [ R ] is not provided only X-cross #1 metrics are computed.
When the generated c-model is executed on the device the same built-in validation flow is applied.
Data collect
At the end of the validation flow, all input and generated data are saved allowing specific postprocess analysis (see “Postprocessing example” section)
Stored data are the raw data [ I ] which are
used to feed (w/o preprocessing) the C-model
('c_inputs_'
) and the original model
('m_inputs_'
). Generated prediction [ P
] ('c_outputs_'
) and generated references
[ R’ ] ('m_outputs_'
) are also stored.
Format of the generated files ('npz/csv'
) are described
in the “Input validation
files” section.
And for the quantized models?
No specific metrics are defined for the quantized models. The
same metrics are processed for a quantized model or not, with or
without input/output in integer. Be aware that the metrics are
always computed with the float32
data
types. If the data type is int8
(or
uint8
), data are dequantized before. The same
scale/zero-point values from the original model are used for all the
data: [ P ], [ R’ ] and [
R ]. Symmetrically, if float32
data type are
provided, data are quantized before to feed the model.
Note
If uint8
data type are provided, and the requested
quantized model expects int8
or float32
type, an exception is raised. The data importer is not able to
convert the data to the expected format.
“no-exec-model” option
The '--no-exec-model'
option allows performing the
built-in validation flow w/o the execution of the original model.
Can be used when a particular model is imported and generated but
the associated inference engine embedded in the pack is not
compatible or older. The other advantage is to improve the execution
time when only the predicted data by the c-model are requested.
Specific attention on the provided data
To have an accurate built-in validation process or more significant metrics. It is important to feed the original model and the generated c-model with the closest data from the validating dataset used to test the imported model. If the raw dataset has been preprocessed, the user should build a representative dataset with a subset of these preprocessed data.
This recommendation is also applicable for the output data since the computation of the metrics is based on the element-wise operations between the provided references and the predicted values. For a classifier (one or multiple classes) for example, one-hot encoding data can be provided while integer encoding is not supported.
The following figure illustrates a typical example, where a particular attention is requested. The STM32Cube function pack for computer vision (https://www.st.com/en/embedded-software/fp-ai-vision1.html) provides an advanced debug mode which allows to inject or to dump the preprocessed images. If the user wants to valid its pretrained model with the ST Edge AI Core validation engine and to compare with the deployed model, they must ensure that the format and the applied preprocessing are similar to have the most accurate validation metrics. In this use-case, to consider the preprocessing done by the pack, it is recommended to create a set of representative preprocessed images to test different models offline or to refine an existing pretrained model.
Random data generation and “–range/seed” option
When no user data are provided, random data are generated. By
default, the values are uniformly distributed with a fixed
seed in the range [0.0, 1.0[
to have a
reproducible test. The user has the possibility to change the min
and max values with the --range
option. During the
validation process, min/max/mean and standard deviation values for
the different input and outputs are reported. The
--seed
option can be used to change the initial
seed.
For example to generate a batch of 20 input samples with the data uniformly distributed between -10 and 5:
$ stedgeai validate -m <model_file> --target stm32 --range -10 5 -b 20
...
Setting validation data...
generating random data, size=20, seed=42, range=(-10.0, 5.0)
I[1]: (20, 1, 1, 99)/float32, min/max=[-9.952, 4.996], mean/std=[-2.513, 4.383], input_0
No output/reference samples are provided
...
If the model has integer inputs ('int8'
or
'uint8'
), the min and the max value are automatically
inferred with the associated scale and zero-point values. This
insure to have a uniform distribution between the min/max values of
the input data type (that is, [-128, 127]
for
int8
type).
$ stedgeai validate -m <quant_model_file> --target <target>
...
Setting validation data...
generating random data, size=10, seed=42, range=default
I[1]: (10, 49, 40, 1)/uint8, min/max=[0, 255], mean/std=[127.323, 73.590], scale=0.10196070 zp=0, Reshape_1
No output/reference samples are provided
...
Classifier and regressor models
No specific metrics are defined for the regressor models. ACC and CM metrics are
only evaluated if the predicted outputs [ R’ ] (or
[ R ]) may represent probabilities within a given
tolerance. Nevertheless, the '--classifier'
option can
be used to force the computation of the 'ACC
and
CM'
metrics.
$ stedgeai validate -m <regressor_model_file> --target <target>
...
Evaluation report (summary)
-------------------------------------------------------------------------------------------------------
Mode acc rmse mae l2r tensor
-------------------------------------------------------------------------------------------------------
X-cross #1 n.a. 0.000000065 0.000000048 0.000000127 dense_2, ai_float, [(1, 1, 1)], m_id=[2]
-------------------------------------------------------------------------------------------------------
Input validation files
The user can provide the inputs and associated ground truth or reference values in a simple file (npz file) or separated files (npy or csv files). During the import of the data, each array is reshaped according to the shape of the input (respectively output) of the c-model.
file format | description |
---|---|
csv | Text files with a flattened version of
the input (or output) tensors. One sample by line is expected. Comma
',' separator is used to separate the values. |
npy | A simple binary numpy file with a single array. (batch-size, -1) shape. |
pb | Simple binary TensorProto file with a
single array. (batch-size, -1) shape. Can be created with the
tf.make_tensor_proto helper function. |
npz | Binary numpy file with several arrays. |
For the npz file, the following pairs of keys
(dict entries) are supported (see the snippet code in FAQ
article to generate a npz
file from an image
dataset):
x_test
andy_test
(simple I/O)inputs
andoutputs
(simple I/O)in_0
andout_0
(simple I/O)m_inputs
andm_outputs
(simple I/O)m_inputs_<idx>
andm_outputs_<idx>
(multiple I/O,idx
starting with 1)c_inputs_<idx>
andc_outputs_<idx>
(multiple I/O, if nom_inputs_<idx>
keys are defined)- else
xxx
keys are considered as the multiple or simple inputs, consequently no outputs are considered
For the csv files, a particular tag
('dtype=uint8'
or 'dtype=int8'
) is defined
inside one of the first five comment lines to indicate the type of
the data else float32 data type is used.
# Example of csv file (32-b float number)
# comment line
-1.076007485389709473e+00,6.278980255126953125e+00,.. 3.949900865554809570e+00
1.160453605651855469e+01,1.707991600036621094e+01,.. 1.334794044494628906e+00
...
# Example of csv file (uint8 number)
# dtype=uint8
50, 65, 71, 71
4.800000000000000000e+01,6.700000000000000000e+01,7.300000000000000000e+01,6.700000000000000000e+01
...
Lines in the 'csv'
file are always parsed as 32b
float and then values are converted to int8/uint8 type if requested.
In case of models with multiple I/Os, one csv file should be
strictly provided per input (respectively per output). Associations
of files and inputs and outputs is done according to the order in
which files are provided to the CLI. On the contrary to the name of
the files is ignored. In case of npz
file, the index
<idx>
in the key name will be used.
```
$ stedgeai validate -m <model_file> --target stm32 -vi inputs_one.csv inputs_two.csv -vo outputs.csv
```
When the user dataset is loaded, types and shapes are reported:
...
Setting validation data...
loading file: <output-directory-path>\network_val_io.npz
I[1]: (10, 49, 40, 1)/uint8, min/max=[0, 255], mean/std=[127.323, 73.590],
scale=0.10196070 zp=0, Reshape_1
O[1]: (10, 1, 1, 4)/uint8, min/max=[0, 255], mean/std=[63.925, 90.967],
scale=0.00390625 zp=0, nl_2_fmt_conv
...
Tip
Generated output validation files
(<network>_io_data.npz
file) from a previous
“validate” command can be used as input files.
Output validation files for postprocessing
Inputs and predicted values are saved in separate files without any modifications, preserving their types and shapes. This enables the use of a postprocessing script to compute user-defined metrics.
<output-directory-path>\<name>_val_io.npz
<output-directory-path>\<name>_val_m_inputs_1.npy
<output-directory-path>\<name>_val_m_outputs_1.npy
...
The npz/npy files are the standard numpy binary
format storing one or several arrays. For the npz
file,
the following key entries are used to store the data:
'm_inputs_<idx>'
,
'c_inputs_<idx>'
,
'm_outputs_<idx>'
and
'c_outputs_<idx>'
. There is one npy
file by input and output tensors.
For quick and easy debugging purposes, csv files
(text files) are also created for input and output. However, the
number of samples and data per sample is intentionally limited to
reduce execution time. By default, only the first 128 samples with a
size smaller than 512 items are saved. If this limit is exceeded,
the file is created but without any data. The
'--save-csv'
option can be used to force the creation
of the csv files with all the data. The created csv file follows the
format of the input validation files,
and the file name is defined as follows:
<network_name>_[m,c]_[inputs,outputs]_<idx>.csv
Warning
In the case where the --inputs-ch-position
and
--outputs-ch-position
options are used to modify the layout
of deployed model, the stored data for the model is also
transposed.
Metrics
Computational complexity: MACC and cycles/MACC
During the analyzing step of the model, overall computational
complexity is displayed and logged as 'MACC'
(refer to
“Analyze
command” section). This metric corresponds to the number of
multiply-and-accumulate operations which are requested to perform an
inference. It is computed independently of the data format
(floating-point or integer) or the underlying C-implementation. As
illustrated in the report (“graph” section, table form), the value
of the MACC can be detailed layer-per-layer according to the applied
optimizations (fusing and/or folding processes).
The validation firmware allows on-device profiling and the measure of the average number of CPU clocks requested for the whole C-model. Combining this information with the ‘MACC’ allows computing the number of requested cycles by MACC. This indicator highlights the global efficiency of the underlying C-implementation, including the hardware platform setting aspects. With the validation on-device, the execution time by layer is also evaluated.
...
Results for 10 inference(s) - average per inference
device : 0x431 - STM32F411xC/E @100/100MHz fpu,art_lat=3,
art_prefetch,art_icache,art_dcache
duration : 0.351ms
CPU cycles : 35062
cycles/MACC : 8.74
c_nodes : 6
...
Be aware that no theoretical relation is defined between the reported complexity and the real performance of the implemented C-model. This is due to the variability of the targeted environments considering the used target tool-chains, the target device, and underlying subsystem memory settings, NN topologies and layers. It is difficult to provide offline an accurate number of CPU cycles/MACC for a particular target. However, out-of-the-box, the following rough estimations can be used for a 32-bit floating-point C-model.
STM32 series based on | cycles/MACC |
---|---|
Arm® Cortex®-M4 | ~9 |
Arm® Cortex®-M7 | ~6 |
For the quantized models, this factor can be approximately divided by 2 on average.
Memory-related metrics
Main contributors
To deploy a model in a resource-constrained runtime environment,
the requested ROM and RAM sizes
are the key factors. Two main memory contributors are directly
reported through the fields: 'weights (ro)'
and
'activations (rw)'
(refer to “Analyze
command” section). The first (also called ROM
or
FLASH
) designates the size in bytes requested to store
the weights/bias or other constant params. They are generally placed
in a read-only memory-mapped segment like the embedded STM32 flash
memory (also defined as .rodata
section). Second (also
called RAM
or activations buffer) designates the size
in bytes requested to store the intermediate results (including
optionally the buffers for the input or/and
output tensors of the model). It can be considered as a private
heap (or scratch buffer) only used by the C-runtime engine during an
inference. It should be placed in a read-write and low-latency
memory-mapped segment.
Tip
“AI buffers and privileged placement” section provides more details about the integration aspects.
Including the kernel code/data sizes
To have a complete view, by refining all AI contributors, the 'analyze'
or 'generate'
command can be used. The generated report allows having more details
about the requested
memory usage to execute the AI stack, by including also the
requested size of the data and code sections for the used kernels
(part of the network runtime library) and the size for the generated
C-structure and parameters for a given model (part of the generated
network.c
file).
In the following table, only the AI contributors are considered for each code/data section.
Requested memory size per segment ("stm32h7" series)
----------------------------- -------- ----------- -------- -----------
module text rodata data bss
----------------------------- -------- ----------- -------- -----------
network.o 5,504 89,572 32,056 1,192
NetworkRuntime730_CM7_GCC.a 32,128 0 0 0
network_data.o 56 48 88 0
lib (toolchain)* 1,300 624 0 0
----------------------------- -------- ----------- -------- -----------
RT total** 38,988 90,244 32,144 1,192
----------------------------- -------- ----------- -------- -----------
...
Summary per memory device type
-------------------------------------------------
.\device FLASH % RAM %
-------------------------------------------------
RT total 161,376 4.4% 33,336 2.3%
-------------------------------------------------
TOTAL 3,703,360 1,427,841
-------------------------------------------------
System heap and stack size
By construction, the kernels are designed not to use the system
heap (no explicit call to malloc/free
like functions).
If a minimal stack is requested to use the runtime, the
aiSystemPerformance test application has been designed to
know the requested stack size.
Classification accuracy (acc)
For classifier type, accuracy implies classification
accuracy. ACC is the ratio between the correct
predictions and the total number of inputs and can be used to
evaluate the performance of the classifier model. On the
contrary, if a regressor model is passed, the ACC
is NOT calculated and n.a.
value is
reported.
Note
No threshold is applied to determine if a given class is detected or not. Only the maximum value along the axis is used.
import numpy as np
from sklearn.metrics import accuracy_score
def acc(ref, pred):
"""Classification accuracy (ACC)."""
return accuracy_score(np.argmax(ref, axis=1), np.argmax(pred, axis=1))
Root Mean Square Error (rmse)
RMSE is the average of the square of the difference between the reference values and the predicted values. Since it considers the square of the error, the effect of larger errors becomes more relevant then smaller errors, hence focusing more on the largest errors. RMSE is computed for the flattened array (element-wise along the array), returning a scalar value.
import numpy as np
def rmse(ref, pred):
"""Return Root Mean Squared Error (RMSE)."""
return np.sqrt(((ref - pred).astype(np.float64) ** 2).mean())
Mean Absolute Error (mae)
MAE is the average of the difference between the reference values and the predicted values. It gives the measure of how much far the predictions were from the expected output. However, it does not provide information about the direction of the error that is, whether data are underpredicted or over predicted. MAE is computed for the flattened array (element-wise along the array), returning a scalar value.
import numpy as np
def mae(ref, pred):
"""Return Mean Absolute Error (MAE)."""
return (np.abs(ref - pred).astype(np.float64)).mean()
L2 relative error (l2r)
L2r is the relative 2-norm or euclidean distance between the reference values and the predicted values.
import numpy as np
def l2r(ref, pred):
"""Compute L2 relative error"""
def magnitude(v):
return np.sqrt(np.sum(np.square(v).flatten()))
= magnitude(pred) + np.finfo(np.float32).eps
mag return magnitude(ref - pred) / mag
Arithmetic mean of the error (mean)
mean (also called bias) is the arithmetic mean between the reference values and the predicted values.
import numpy as np
def mean(ref, pred):
"""Return the Arithmetic Mean (MEAN)."""
return np.mean(ref - pred)
Standard deviation of the error (std)
std is the standard deviation between the reference values and the predicted values. It provides a measure of the spread of a distribution of the error.
import numpy as np
def mean(ref, pred):
"""Return the Arithmetic Mean (MEAN)."""
return np.std(ref - pred)
Nash-Sutcliffe efficiency criteria (nse)
nse is a normalized statistic that determines the relative magnitude of the residual variance (“noise”) compared to the measured data variance (“information”).
import numpy as np
def nse(ref, pred):
"""Return Nash-Sutcliffe efficiency criteria (NSE)"""
= np.mean((ref - pred) ** 2) # Mean Squared Error (MSE)
_mse return 1 - _mse(ref, pred) / ((np.std(ref) ** 2) + np.finfo(np.float32).eps)
Cosine Similarity (cos)
cos is the measure of similarity between the two
signals, that is, the reference values and the predicted values. As
for nse metric, value closed to
0.99
indicates a high level of similarity.
import numpy as np
def _cosine(ref, pred):
"""Return Cosine Similarity (COS)"""
= np.dot(ref.flatten(), pred.flatten())
err /= (np.linalg.norm(ref.flatten()) * np.linalg.norm(pred.flatten()))
err return err
Confusion matrix (CM)
When the model is considered as a classifier, a confusion matrix is displayed and logged for reference values vs. predicated values. Note that if a regressor type is passed, the confusion matrix is NOT computed.
8 classes (50 samples)
------------------------------------------------
4 . . . . . . .
C0 . 9 . . . . . .
C1 . . 6 . . . . .
C2 . . . 7 . . . .
C3 . . . . 5 . . .
C4 . . . . . 6 . .
C5 . . . . . . 7 .
C6 . . . . . . . 6 C7
Note
The confusion matrix is only displayed when the number of classes is lower or equal to 20.
Interpretation of the results
This section provides typical examples to interpret the reported metrics. It is important to keep in mind the following points:
- Proposed built-in validation flow is mainly based on the
comparison of output predictions generated by the c-model (X86
or/and on-target C-runtime) and the values predicted by the origin
inference engine (that is, ONNX runtime, TFLite interpreter and
Keras predict method). Consequently, the precision is highly
dependent on the underlying implementation, in particular on how
intermediate results are accumulated and rounded.
- For more accurate results, it is recommended to consider a validation flow using pre and post-processing and representative validation dataset on the top of the ST Edge AI core network runtime library (host or/and on-target C-runtime) (Refer to “How to run locally a c-model” or “How to use the AiRunner package” articles)
- Random data with a given range for the input samples is usually
not representative of the real distribution used to train the
original model, for example because the data distribution between
the channels is always considered as uniform.
- It is recommended to provide the representative preprocessed data to apply the built-in validation flow.
Floating-point model
With random data
This section illustrates a typical example where a 32b
floating-point model is validated with random data. Default range
[0.0, 1.0[
is used here, because the image dataset has
been normalized between 0 and 1.
- X-cross #1 results highlight that the predicted values of both models are close.
- X-cross #1
l2r=0.000001551
illustrates that with random inputs, the predicted values of the c-model are very close to the outputs generated by the Keras model interpreter. acc=100%
metric indicates that a given input sample is classified with the same way in both cases.
import tensorflow as tf
import numpy as np
= 28, 28, 1
H, W, C = (H, W, C)
IN_SHAPE = 10
NB_CLASSES
def load_data_set():
"""Load the data"""
= tf.keras.datasets.mnist
mnist = mnist.load_data()
(train_images, train_labels), (test_images, test_labels)
# Normalize the input image so that each pixel value is between 0 to 1.
= x_train / 255.0
x_train = x_test / 255.0
x_test
= x_train.reshape(x_train.shape[0], H, W, C).astype(np.float32)
x_train = x_test.reshape(x_test.shape[0], H, W, C).astype(np.float32)
x_test
# convert class vectors to binary class matrices
= tf.keras.utils.to_categorical(y_train, NB_CLASSES)
y_train = tf.keras.utils.to_categorical(y_test, NB_CLASSES)
y_test
return x_train, y_train, x_test, y_test
def build_model():
"""Define the model"""
= tf.keras.Sequential([
model =IN_SHAPE),
tf.keras.layers.InputLayer(input_shape=12, kernel_size=(3, 3), activation=tf.nn.relu),
tf.keras.layers.Conv2D(filters=(2, 2)),
tf.keras.layers.MaxPooling2D(pool_size0.5),
tf.keras.layers.Dropout(
tf.keras.layers.Flatten(),10, activation='softmax')
tf.keras.layers.Dense(
])
return model
$ stedgeai validate -m <model_fp32> --target <target>
..
Setting validation data...
generating random data, size=10, seed=42, range=default
I[1]: (10, 28, 28, 1)/float32, min/max=[0.000, 1.000], mean/std=[0.495, 0.289], input_0
No output/reference samples are provided
Running the STM AI c-model (AI RUNNER)...(name=network, mode=HOST)
...
Running the Keras model...
Saving validation data...
output directory: <output-directory-path>
creating <output-directory-path>\network_val_io.npz
m_outputs_1: (10, 1, 1, 10)/float32, min/max=[0.000, 0.975], mean/std=[0.100, 0.253], dense_nl
c_outputs_1: (10, 1, 1, 10)/float32, min/max=[0.000, 0.975], mean/std=[0.100, 0.253], dense_nl
Computing the metrics...
Cross accuracy report #1 (reference vs C-model)
----------------------------------------------------------------------------------------------------
notes: - the output of the reference model is used as ground truth/reference value
- 10 samples (10 items per sample)
acc=100.00%, rmse=0.000000422, mae=0.000000162, l2r=0.000001551
10 classes (10 samples)
----------------------------------------------------------
C0 0 . . . . . . . . .
C1 . 0 . . . . . . . .
C2 . . 9 . . . . . . .
C3 . . . 0 . . . . . .
C4 . . . . 0 . . . . .
C5 . . . . . 1 . . . .
C6 . . . . . . 0 . . .
C7 . . . . . . . 0 . .
C8 . . . . . . . . 0 .
C9 . . . . . . . . . 0
Evaluation report (summary)
---------------------------------- ... ----------- ... -------------------------------
Output acc rmse l2r tensor
---------------------------------- ... ----------- ... -------------------------------
X-cross #1 100.00% 0.000000422 ... 0.000001551 ... dense_nl, ai_float,...
---------------------------------- ... ----------- ... -------------------------------
Creating report file <output-directory-path>\network_validate_report.txt
with a representative dataset
When a representative user dataset is used (including the ground
truth values), the ACC/CM
metrics are also evaluated
with the original and generated c-models.
- X-cross #1 results highlight that the predicted values of both models are close.
- X-cross #1
l2r=0.000000249
for example, indicates that with the real inputs, the predicted values of the c-model are very close to the outputs generated by the Keras model interpreter. - However, x86 C-model #1 and original model #1
lines show the relatively important
rmse/mae/l2r
errors. They are due here to the encoding of the provided ground truth values, one-hot encoded:[0.0 0.0 1.0 .. 0.0]
. These errors could be larger if no softmax layer is present.
$ stedgeai validate -m <model_fp32> --mode stm32 -vi <data_directory>/data_reduced_test.npz
..
Setting validation data...
loading file: <data_directory>\data_reduced_test.npz
- samples are reshaped: (128, 28, 28, 1) -> (128, 28, 28, 1)
- samples are reshaped: (128, 10) -> (128, 1, 1, 10)
I[1]: (128, 28, 28, 1)/float32, min/max=[0.000, 1.000], mean/std=[0.139, 0.318], input_0
O[1]: (128, 1, 1, 10)/float32, min/max=[0.000, 1.000], mean/std=[0.100, 0.300], dense_nl
Running the STM AI c-model (AI RUNNER)...(name=network, mode=x86)
...
Running the Keras model...
Saving validation data...
output directory:<output-directory-path>
creating <output-directory-path>\network_val_io.npz
m_outputs_1: (128, 1, 1, 10)/float32, min/max=[0.000, 1.000], mean/std=[0.100, 0.288], dense_nl
c_outputs_1: (128, 1, 1, 10)/float32, min/max=[0.000, 1.000], mean/std=[0.100, 0.288], dense_nl
Computing the metrics...
Accuracy report #1 for the generated x86 C-model
---------------------------------------------------------------------------------------------
notes: - computed against the provided ground truth values
- 128 samples (10 items per sample)
acc=97.66%, rmse=0.054929093, mae=0.010250041, l2r=0.180345476
10 classes (128 samples)
----------------------------------------------------------
C0 11 . . . 1 . . . . .
C1 . 19 . . . . . . . .
C2 . . 16 . . . . . . .
C3 . . . 11 . . . . . .
C4 . . . . 15 . . . . .
C5 . . . . . 7 . . . .
C6 . . . . . . 10 . . .
C7 . . . . . . . 9 . .
C8 . . . 1 . . . . 17 .
C9 1 . . . . . . . . 10
Accuracy report #1 for the reference model
---------------------------------------------------------------------------------------------
notes: - computed against the provided ground truth values
- 128 samples (10 items per sample)
acc=97.66%, rmse=0.054929089, mae=0.010250039, l2r=0.180345476
10 classes (128 samples)
----------------------------------------------------------
C0 11 . . . 1 . . . . .
C1 . 19 . . . . . . . .
C2 . . 16 . . . . . . .
C3 . . . 11 . . . . . .
C4 . . . . 15 . . . . .
C5 . . . . . 7 . . . .
C6 . . . . . . 10 . . .
C7 . . . . . . . 9 . .
C8 . . . 1 . . . . 17 .
C9 1 . . . . . . . . 10
Cross accuracy report #1 (reference vs C-model)
---------------------------------------------------------------------------------------------
notes: - the output of the reference model is used as ground truth/reference value
- 128 samples (10 items per sample)
acc=100.00%, rmse=0.000000076, mae=0.000000016, l2r=0.000000249
10 classes (128 samples)
----------------------------------------------------------
C0 12 . . . . . . . . .
C1 . 19 . . . . . . . .
C2 . . 16 . . . . . . .
C3 . . . 12 . . . . . .
C4 . . . . 16 . . . . .
C5 . . . . . 7 . . . .
C6 . . . . . . 10 . . .
C7 . . . . . . . 9 . .
C8 . . . . . . . . 17 .
C9 . . . . . . . . . 10
Evaluation report (summary)
----------------------------------------- ... ----------- ... ------------------------
Output acc rmse l2r tensor
----------------------------------------- ... ----------- ... ------------------------
x86 c-model #1 97.66% 0.054929093 ... 0.180345476 ... nl_4_fmt_conv, ai_i8,...
original model #1 97.66% 0.054929089 ... 0.180345476 ... nl_4_fmt_conv, ai_i8,...
X-cross #1 100.00% 0.000000076 ... 0.000000249 ... nl_4_fmt_conv, ai_i8,...
----------------------------------------- ... ----------- ... ------------------------
Creating report file <output-directory-path>\network_validate_report.txt
Without softmax layer
In the case where the last layer is not a softmax
operator, the rmse/mae/l2r
errors of the x86
C-model #1 and of original model #1 are larger (for
example ~6.6+
here with another MNIST model). These
errors are fully dependent on the values of the data provided. In
this example, they are one-hot encoded
([0.0 0.0 1.0 .. 0.0]
) but are directly compared to the
predicted values accumulating the significant differences. In this
context, acc
and *X-cross” results are the more
significant metrics. Note that the --classifier
option
is used to force the computation of the ACC/CM
metrics.
$ stedgeai validate -m <float_model_wo_softmax> --target stm32 -vi mnist_reduced_test.npz --classifier
...
Evaluation report (summary)
---------------------------------------------------------------------------------------------------
Mode acc rmse mae l2r tensor
---------------------------------------------------------------------------------------------------
x86 C-model #1 97.66% 6.643473148 5.626701355 0.992283940 dense_2_dense, ai_float,..
original model #1 97.66% 6.643473148 5.626701355 0.992283940 dense_2_dense, ai_float,..
X-cross #1 100.00% 0.000003622 0.000002494 0.000000541 dense_2_dense, ai_float,..
---------------------------------------------------------------------------------------------------
...
To avoid this situation, it is preferable for a regressor model
to provide the “real” predicted values (as generated during the test
of the trained model). In this case, the ACC/CM
metrics
are not significant.
$ stedgeai validate -m <float_model_wo_softmax> --target stm32 -vi mnist_reduced_predicted_test_.npz
..
Evaluation report (summary)
---------------------------------------------------------------------------------------------------
Mode acc rmse mae l2r tensor
---------------------------------------------------------------------------------------------------
x86 C-model #1 n.a. 0.000003622 0.000002494 0.000000541 dense_2_dense, ai_float,..
original model #1 n.a. 0.000000000 0.000000000 0.000000000 dense_2_dense, ai_float,..
X-cross #1 n.a. 0.000003622 0.000002494 0.000000541 dense_2_dense, ai_float,..
---------------------------------------------------------------------------------------------------
Quantized model
Previous floating-point model has been quantized with a
representative subset of the training dataset. As the
tf.int8
option has been defined for the conversion of
the inputs and the outputs, the provided user data are automatically
quantized before to feed the models. The predicted values are
dequantized before to compare them to the provided ground truth
values which are one-hot encoded. As for the validation of the
floating-point model, the same comments about the provided metrics
are applicable.
def tflite_convert(keras_model, data):
"""Quantize a Keras model (post-training quantization)"""
= tf.lite.TFLiteConverter.from_keras_model(model)
converter = (1,) + IN_SHAPE
shape_in
def rep_data_gen():
for i in data[0:100]:
= np.reshape(i, shape_in)
f = tf.convert_to_tensor(f, tf.float32)
tensor yield [tensor]
= rep_data_gen
converter.representative_dataset = [tf.lite.Optimize.DEFAULT]
converter.optimizations = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.target_spec.supported_ops
= np.int8
converter.inference_input_type = np.int8
converter.inference_output_type
return converter.convert()
$ stedgeai validate -m validate <quantized_model> --target stm32 -vi <data_directory>/mnist_reduced_test.npz
..
Setting validation data...
loading file: <data_directory>\mnist_reduced_test.npz
- samples are reshaped: (128, 28, 28, 1) -> (128, 28, 28, 1)
- samples are reshaped: (128, 10) -> (128, 1, 1, 10)
I[1]: (128, 28, 28, 1)/int8, min/max=[-128, 127], mean/std=[-92.544, 81.165],
scale=0.00392157 zp=-128, input_1
O[1]: (128, 1, 1, 10)/float32, min/max=[0.000, 1.000], mean/std=[0.100, 0.300],
nl_4_fmt_conv
Running the STM AI c-model (AI RUNNER)...(name=network, mode=x86)
...
Running the TFlite model...
Saving validation data...
output directory: <output-directory-path>
creating <output-directory-path>\network_val_io.npz
m_outputs_1: (128, 1, 1, 10)/int8, min/max=[-128, 127], mean/std=[-102.458, 73.581],
nl_4_fmt_conv
c_outputs_1: (128, 1, 1, 10)/int8, min/max=[-128, 127], mean/std=[-102.458, 73.581],
scale=0.00390625 zp=-128, nl_4_fmt_conv
Computing the metrics...
Accuracy report #1 for the generated x86 C-model
------------------------------------------------------------------------------------
notes: - data type is different: r/float32 instead p/int8
- p/int8 data are dequantized with s=0.003906 zp=-128
- computed against the provided ground truth values
- 128 samples (10 items per sample)
acc=97.66%, rmse=0.054981638, mae=0.010229493, l2r=0.180712447
10 classes (128 samples)
----------------------------------------------------------
C0 11 . . . 1 . . . . .
C1 . 19 . . . . . . . .
C2 . . 16 . . . . . . .
C3 . . . 11 . . . . . .
C4 . . . . 15 . . . . .
C5 . . . . . 7 . . . .
C6 . . . . . . 10 . . .
C7 . . . . . . . 9 . .
C8 . . . 1 . . . . 17 .
C9 1 . . . . . . . . 10
Accuracy report #1 for the reference model
------------------------------------------------------------------------------------
notes: - data type is different: r/float32 instead p/int8
- p/int8 data are dequantized with s=0.003906 zp=-128
- computed against the provided ground truth values
- 128 samples (10 items per sample)
acc=97.66%, rmse=0.054981638, mae=0.010229493, l2r=0.180712447
10 classes (128 samples)
----------------------------------------------------------
C0 11 . . . 1 . . . . .
C1 . 19 . . . . . . . .
C2 . . 16 . . . . . . .
C3 . . . 11 . . . . . .
C4 . . . . 15 . . . . .
C5 . . . . . 7 . . . .
C6 . . . . . . 10 . . .
C7 . . . . . . . 9 . .
C8 . . . 1 . . . . 17 .
C9 1 . . . . . . . . 10
Cross accuracy report #1 (reference vs C-model)
-------------------------------------------------------------------------------------
notes: - r/int8 data are dequantized with s=0.003906 zp=-128
- p/int8 data are dequantized with s=0.003906 zp=-128
- the output of the reference model is used as ground truth/reference value
- 128 samples (10 items per sample)
acc=100.00%, rmse=0.000000000, mae=0.000000000, l2r=0.000000000
10 classes (128 samples)
----------------------------------------------------------
C0 12 . . . . . . . . .
C1 . 19 . . . . . . . .
C2 . . 16 . . . . . . .
C3 . . . 12 . . . . . .
C4 . . . . 16 . . . . .
C5 . . . . . 7 . . . .
C6 . . . . . . 10 . . .
C7 . . . . . . . 9 . .
C8 . . . . . . . . 17 .
C9 . . . . . . . . . 10
Evaluation report (summary)
----------------------------------------- ... ----------- ... ------------------------
Output acc rmse l2r tensor
----------------------------------------- ... ----------- ... ------------------------
x86 c-model #1 97.66% 0.054981638 ... 0.180712447 ... nl_4_fmt_conv, ai_i8,...
original model #1 97.66% 0.054981638 ... 0.180712447 ... nl_4_fmt_conv, ai_i8,...
X-cross #1 100.00% 0.000000000 ... 0.000000000 ... nl_4_fmt_conv, ai_i8,...
----------------------------------------- ... ----------- ... ------------------------
Creating report file <output-directory-path>\network_validate_report.txt
Model with multiple I/O
No specific process is applied. All metrics are calculated independently for each output.
$ stedgeai validate -m <model_with_2_outputs> --target stm32 -vi test_multiple_io.npz
...
Evaluation report (summary)
-------------------------------------------------------------------------------------------------
Mode acc rmse mae l2r tensor
-------------------------------------------------------------------------------------------------
x86 C-model #1 n.a. 0.000000000 0.000000000 0.000000000 add_1, ai_float, ..
x86 C-model #2 n.a. 0.000000000 0.000000000 0.000000000 multiply_1, ai_float, ..
original model #1 n.a. 0.000000000 0.000000000 0.000000000 add_1, ai_float, ..
original model #2 n.a. 0.000000000 0.000000000 0.000000000 multiply_1, ai_float, ..
X-cross #1 n.a. 0.000000000 0.000000000 0.000000000 add_1, ai_float, ..
X-cross #2 n.a. 0.000000000 0.000000000 0.000000000 multiply_1, ai_float, ..
-------------------------------------------------------------------------------------------------
X-cross (l2r) #1 error : 0.00000000e+00 (expected to be < 0.01)
Postprocessing example
This section provides a typical custom Python script example to
read the generated data and to build new element-wise metrics:
variance
, f1_score
,.. (Thanks to the
numpy and sklearn.metrics Python modules).
acc
, rmse
, mae
and
l2r
are also provided to illustrate how these metrics
are computed.
$ python custom_metrics.py
Read reference NPZ file "mnist_test.npz"...
Read generated NPZ file "./st_ai_output\network_val_io.npz"...
Evaluation report
--------------------------------------------------------------------------------
acc rmse mae var f1_score l2r
--------------------------------------------------------------------------------
C-model 98.4% 0.064472 0.013081 0.004160 0.984033 0.21363260
original model 98.4% 0.064472 0.013081 0.004160 0.984033 0.21363260
X-cross 100.0% 0.000000 0.000000 0.000000 1.000000 0.00000000
--------------------------------------------------------------------------------
precision recall f1-score support
c0 0.92 1.00 0.96 12
c1 1.00 1.00 1.00 19
c2 1.00 1.00 1.00 16
c3 0.92 1.00 0.96 11
c4 1.00 1.00 1.00 15
c5 1.00 1.00 1.00 7
c6 1.00 1.00 1.00 10
c7 1.00 1.00 1.00 9
c8 1.00 0.94 0.97 18
c9 1.00 0.91 0.95 11
accuracy 0.98 128
macro avg 0.98 0.99 0.98 128
weighted avg 0.99 0.98 0.98 128
Full code of the custom Python script.
# -*- coding: utf-8 -*-
"""
Implement custom metrics
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report
= './st_ai_output'
OUTPUT_DIR = 'network'
NETWORK_NAME = 'mnist_test.npz'
REFERENCE_NPZ
# metrics
def mse(ref, pred):
"""Return Mean Squared Error (MSE)."""
return ((ref - pred).astype(np.float64) ** 2).mean()
def rmse(ref, pred):
"""Return Root Mean Squared Error (RMSE)."""
return np.sqrt(((ref - pred).astype(np.float64) ** 2).mean())
def mae(ref, pred):
"""Return Mean Absolute Error (MAE)."""
return (np.abs(ref - pred).astype(np.float64)).mean()
def var(ref, pred):
"""Return Variance"""
return np.var((ref - pred), dtype=np.float64, ddof=1)
def acc(ref, pred):
"""Classification accuracy (ACC)."""
return accuracy_score(np.argmax(ref, axis=1), np.argmax(pred, axis=1))
def f1_s(ref, pred, average='macro'):
"""Compute the F1 score, also known as balanced F-score or F-measure (F1)"""
return f1_score(np.argmax(ref, axis=1), np.argmax(pred, axis=1), average=average)
def l2r(ref, pred):
"""Compute L2 relative error"""
def magnitude(v):
return np.sqrt(np.sum(np.square(v).flatten()))
= magnitude(pred) + np.finfo(np.float32).eps
mag return magnitude(ref - pred) / mag
# Read reference values
= REFERENCE_NPZ
fname print('Read reference NPZ file "{}"...'.format(fname))
= np.load(fname)
arrays
= arrays['x_test']
i_ref = arrays['y_test']
r_ref
# Read the generated inputs and predicted samples (original & C models)
= os.path.join(OUTPUT_DIR, NETWORK_NAME + '_val_io.npz')
fname print('Read generated NPZ file "{}"...'.format(fname))
= np.load(fname)
arrays
= arrays['m_inputs_1']
i_ = arrays['c_inputs_1']
ic_ = arrays['c_outputs_1']
p_ = arrays['m_outputs_1']
pm_
# calculate metrics
def build_metrics(ref, pred):
= {}
res 'acc'] = acc(ref, pred)
res['var'] = var(ref, pred)
res['f1_score'] = f1_s(ref, pred)
res['rmse'] = rmse(ref, pred)
res['mae'] = mae(ref, pred)
res['mse'] = mse(ref, pred)
res['l2r'] = l2r(ref, pred)
res[return res
def print_metrics(name, ref, pred):
= build_metrics(ref, pred)
res str = '{:15s}'.format(name)
= '{:.1f}%'.format(res['acc'] * 100.0)
_acc str += '{:10s}'.format(_acc)
str += '{:.6f} '.format(res['rmse'])
str += '{:.6f} '.format(res['mae'])
str += '{:.6f} '.format(res['var'])
str += '{:.6f} '.format(res['f1_score'])
str += '{:.8f} '.format(res['l2r'])
print(str)
# Reshape the outputs to be aligned
= p_.reshape(r_ref.shape)
p_ = p_.reshape(r_ref.shape)
pm_
# Log the results
print('\nEvaluation report')
print('-'*80)
print(' {:8s} {:8s} {:8s} {:8s} {:8s} {:8s}'.format('acc', 'rmse',
'mae', 'var', 'f1_score', 'l2r'))
print('-'*80)
'C-model', r_ref, p_)
print_metrics('original model', r_ref, pm_)
print_metrics('X-cross', pm_, p_)
print_metrics(print('-'*80)
print('')
= ['c0', 'c1', 'c2', 'c3', 'c4', 'c5', 'c6', 'c7', 'c8', 'c9']
target_names print(classification_report(np.argmax(r_ref, axis=1), np.argmax(p_, axis=1),
=target_names)) target_names