2.2.0
Evaluation report and metrics


ST Edge AI Core

Evaluation report and metrics


ST Edge AI Core Technology 2.2.0



r2.6

Purpose

This article describes the different metrics (and associated computing flow) used to evaluate the accuracy of the generated C-files (or C-model) primarily through the validate command. The proposed metrics should be considered as generic indicators. They allow numerical comparison of the predictions of the C-model against the predictions of the original model. Only simple scalar values are computed; no specific thresholds or pre/post-processing are used. For postprocess evaluation, all injected and predicted data (including the data from the original model) are saved. The end user can also execute the C-model locally for efficient and advanced or customized validation flow.

Warning

Be aware that the underlying validation engine is not designed and optimized, in terms of execution time and host resource usage, to validate a pretrained model as during a training/test phase. A representative and small subset of the whole training dataset is expected to test the accuracy of the generated C-model running on the desktop/host or on-target environment

Evaluated metrics

metric category description applicable to
MACC complexity computational complexity all models
ROM/RAM memory memory-related metrics all models
metric category description applicable to
ACC perf accuracy (Classification accuracy) only classifier models (float and integer format)
RMSE perf Root Mean Square Error all models
MAE perf Mean Absolute Error all models
L2R perf L2 relative Error all models
MEAN perf Arithmetic mean of the error all models
STD perf Standard deviation of the error all models
NSE perf Nash-Sutcliffe efficiency criteria all models
COS perf COsine Similarity all models
CM perf Confusion Matrix only classifier models (float and integer format)

Computation of the metrics

The MACC and ROM/RAM metrics are computed during the import of the model. The other metrics are evaluated during the process of validation (that is, validate command). By default, no user data are requested since the models are fed with the random data. However, the user also has the possibility to give a representative preprocessed dataset (with or without the references). A dedicated file always saves raw input and output data, which a postprocess user script could use.

Validation flow - Computation of the metrics (default mode with random data)
Validation flow - Computation of the metrics (with user data)
  • [ I ] designates the list of the preprocessed samples (or inputs) which are used to feed the original model and the C-model. It can be provided by the user (see “Inputs validation files” and “Specific attention on the provided data” sections) or randomly generated (see “Random data generation” section). They are used as-is without preprocessing with a potential exception for the quantized models.
  • [ P ] designates the list of the predicted samples inferred by the C-model.
  • [ R’ ] designates the list of the predicted samples inferred by the original model.
  • {optional} [ R ] designates the list of the predicted output samples provided by the user. They are used as the ground truth or reference values.

At the end of the process, metrics are summarized in a simple table.

 Evaluation report (summary)
 -----------------------------------------------------------------------------------------------------------------------------------------
 Output              acc       rmse        mae         l2r         mean        std         nse         cos         tensor
 -----------------------------------------------------------------------------------------------------------------------------------------
 HOST c-model #1     100.00%   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000   1.0000000   1.0000000   nl_3, (4,), m_id=[3]
 original model #1   100.00%   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000   1.0000000   1.0000000   nl_3, (4,), m_id=[3]
 X-cross #1          100.00%   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000   1.0000000   1.0000000   nl_3, (4,), m_id=[3]
 -----------------------------------------------------------------------------------------------------------------------------------------
  • X-cross #1 indicates the metrics which are evaluated with the [ P ] and [ R’ ] data for the first output #1 (there is one line by output). In this case, the predicted values [ R’ ] are considered as the references.
  • HOST [or target] C-model (respectively original model) designates the case where [ R ] data are also provided by the user allowing to compute the metrics against theses references or ground truth values. If [ R ] is not provided only X-cross #1 metrics are computed.

When the generated c-model is executed on the device the same built-in validation flow is applied.

Validation flow - Computation of the metric (with target)

Data collect

At the end of the validation flow, all input and generated data are saved allowing specific postprocess analysis (see “Postprocessing example” section)

Validation flow - Data collect

Stored data are the raw data [ I ] which are used to feed (w/o preprocessing) the C-model ('c_inputs_') and the original model ('m_inputs_'). Generated prediction [ P ] ('c_outputs_') and generated references [ R’ ] ('m_outputs_') are also stored. Format of the generated files ('npz/csv') are described in the “Input validation files” section.

And for the quantized models?

No specific metrics are defined for the quantized models. The same metrics are processed for a quantized model or not, with or without input/output in integer. Be aware that the metrics are always computed with the float32 data types. If the data type is int8 (or uint8), data are dequantized before. The same scale/zero-point values from the original model are used for all the data: [ P ], [ R’ ] and [ R ]. Symmetrically, if float32 data type are provided, data are quantized before to feed the model.

Validation flow - Computation of the metrics (with quantized model)

Note

If uint8 data type are provided, and the requested quantized model expects int8 or float32 type, an exception is raised. The data importer is not able to convert the data to the expected format.

“no-exec-model” option

The '--no-exec-model' option allows performing the built-in validation flow w/o the execution of the original model. Can be used when a particular model is imported and generated but the associated inference engine embedded in the pack is not compatible or older. The other advantage is to improve the execution time when only the predicted data by the c-model are requested.

Validation flow - Computation of the metrics (–no-exec-model)

Specific attention on the provided data

To have an accurate built-in validation process or more significant metrics. It is important to feed the original model and the generated c-model with the closest data from the validating dataset used to test the imported model. If the raw dataset has been preprocessed, the user should build a representative dataset with a subset of these preprocessed data.

Representative dataset creation

This recommendation is also applicable for the output data since the computation of the metrics is based on the element-wise operations between the provided references and the predicted values. For a classifier (one or multiple classes) for example, one-hot encoding data can be provided while integer encoding is not supported.

The following figure illustrates a typical example, where a particular attention is requested. The STM32Cube function pack for computer vision (https://www.st.com/en/embedded-software/fp-ai-vision1.html) provides an advanced debug mode which allows to inject or to dump the preprocessed images. If the user wants to valid its pretrained model with the ST Edge AI Core validation engine and to compare with the deployed model, they must ensure that the format and the applied preprocessing are similar to have the most accurate validation metrics. In this use-case, to consider the preprocessing done by the pack, it is recommended to create a set of representative preprocessed images to test different models offline or to refine an existing pretrained model.

Validation against the FP-Vision

Random data generation and “–range/seed” option

When no user data are provided, random data are generated. By default, the values are uniformly distributed with a fixed seed in the range [0.0, 1.0[ to have a reproducible test. The user has the possibility to change the min and max values with the --range option. During the validation process, min/max/mean and standard deviation values for the different input and outputs are reported. The --seed option can be used to change the initial seed.

For example to generate a batch of 20 input samples with the data uniformly distributed between -10 and 5:

$ stedgeai validate -m <model_file> --target stm32 --range -10 5 -b 20
...
Setting validation data...
 generating random data, size=20, seed=42, range=(-10.0, 5.0)
 I[1]: (20, 1, 1, 99)/float32, min/max=[-9.952, 4.996], mean/std=[-2.513, 4.383], input_0
 No output/reference samples are provided
...

If the model has integer inputs ('int8' or 'uint8'), the min and the max value are automatically inferred with the associated scale and zero-point values. This insure to have a uniform distribution between the min/max values of the input data type (that is, [-128, 127] for int8 type).

$ stedgeai validate -m <quant_model_file> --target <target>
...
Setting validation data...
 generating random data, size=10, seed=42, range=default
 I[1]: (10, 49, 40, 1)/uint8, min/max=[0, 255], mean/std=[127.323, 73.590], scale=0.10196070 zp=0, Reshape_1
 No output/reference samples are provided
...

Classifier and regressor models

No specific metrics are defined for the regressor models. ACC and CM metrics are only evaluated if the predicted outputs [ R’ ] (or [ R ]) may represent probabilities within a given tolerance. Nevertheless, the '--classifier' option can be used to force the computation of the 'ACC and CM' metrics.

$ stedgeai validate -m <regressor_model_file> --target <target>
...
Evaluation report (summary)
-------------------------------------------------------------------------------------------------------
Mode         acc    rmse          mae           l2r           tensor
-------------------------------------------------------------------------------------------------------
X-cross #1   n.a.   0.000000065   0.000000048   0.000000127   dense_2, ai_float, [(1, 1, 1)], m_id=[2]
-------------------------------------------------------------------------------------------------------

Input validation files

The user can provide the inputs and associated ground truth or reference values in a simple file (npz file) or separated files (npy or csv files). During the import of the data, each array is reshaped according to the shape of the input (respectively output) of the c-model.

file format description
csv Text files with a flattened version of the input (or output) tensors. One sample by line is expected. Comma ',' separator is used to separate the values.
npy A simple binary numpy file with a single array. (batch-size, -1) shape.
pb Simple binary TensorProto file with a single array. (batch-size, -1) shape. Can be created with the tf.make_tensor_proto helper function.
npz Binary numpy file with several arrays.

For the npz file, the following pairs of keys (dict entries) are supported (see the snippet code in FAQ article to generate a npz file from an image dataset):

  • x_test and y_test (simple I/O)
  • inputs and outputs (simple I/O)
  • in_0 and out_0 (simple I/O)
  • m_inputs and m_outputs (simple I/O)
  • m_inputs_<idx> and m_outputs_<idx> (multiple I/O, idx starting with 1)
  • c_inputs_<idx> and c_outputs_<idx> (multiple I/O, if no m_inputs_<idx> keys are defined)
  • else xxx keys are considered as the multiple or simple inputs, consequently no outputs are considered

For the csv files, a particular tag ('dtype=uint8' or 'dtype=int8') is defined inside one of the first five comment lines to indicate the type of the data else float32 data type is used.

# Example of csv file (32-b float number)
# comment line
-1.076007485389709473e+00,6.278980255126953125e+00,.. 3.949900865554809570e+00
1.160453605651855469e+01,1.707991600036621094e+01,.. 1.334794044494628906e+00
...
# Example of csv file (uint8 number)
# dtype=uint8
50, 65, 71, 71
4.800000000000000000e+01,6.700000000000000000e+01,7.300000000000000000e+01,6.700000000000000000e+01
...

Lines in the 'csv' file are always parsed as 32b float and then values are converted to int8/uint8 type if requested. In case of models with multiple I/Os, one csv file should be strictly provided per input (respectively per output). Associations of files and inputs and outputs is done according to the order in which files are provided to the CLI. On the contrary to the name of the files is ignored. In case of npz file, the index <idx> in the key name will be used.

```
$ stedgeai validate -m <model_file> --target stm32 -vi inputs_one.csv inputs_two.csv -vo outputs.csv
```

When the user dataset is loaded, types and shapes are reported:

...
Setting validation data...
  loading file: <output-directory-path>\network_val_io.npz
  I[1]: (10, 49, 40, 1)/uint8, min/max=[0, 255], mean/std=[127.323, 73.590],
              scale=0.10196070 zp=0, Reshape_1
  O[1]: (10, 1, 1, 4)/uint8, min/max=[0, 255], mean/std=[63.925, 90.967],
              scale=0.00390625 zp=0, nl_2_fmt_conv
...

Tip

Generated output validation files (<network>_io_data.npz file) from a previous “validate” command can be used as input files.

Output validation files for postprocessing

Inputs and predicted values are saved in separate files without any modifications, preserving their types and shapes. This enables the use of a postprocessing script to compute user-defined metrics.

<output-directory-path>\<name>_val_io.npz
<output-directory-path>\<name>_val_m_inputs_1.npy
<output-directory-path>\<name>_val_m_outputs_1.npy
...

The npz/npy files are the standard numpy binary format storing one or several arrays. For the npz file, the following key entries are used to store the data: 'm_inputs_<idx>', 'c_inputs_<idx>', 'm_outputs_<idx>' and 'c_outputs_<idx>'. There is one npy file by input and output tensors.

For quick and easy debugging purposes, csv files (text files) are also created for input and output. However, the number of samples and data per sample is intentionally limited to reduce execution time. By default, only the first 128 samples with a size smaller than 512 items are saved. If this limit is exceeded, the file is created but without any data. The '--save-csv' option can be used to force the creation of the csv files with all the data. The created csv file follows the format of the input validation files, and the file name is defined as follows:

<network_name>_[m,c]_[inputs,outputs]_<idx>.csv

Warning

In the case where the --inputs-ch-position and --outputs-ch-positionoptions are used to modify the layout of deployed model, the stored data for the model is also transposed.

Metrics

Computational complexity: MACC and cycles/MACC

During the analyzing step of the model, overall computational complexity is displayed and logged as 'MACC' (refer to “Analyze command” section). This metric corresponds to the number of multiply-and-accumulate operations which are requested to perform an inference. It is computed independently of the data format (floating-point or integer) or the underlying C-implementation. As illustrated in the report (“graph” section, table form), the value of the MACC can be detailed layer-per-layer according to the applied optimizations (fusing and/or folding processes).

The validation firmware allows on-device profiling and the measure of the average number of CPU clocks requested for the whole C-model. Combining this information with the ‘MACC’ allows computing the number of requested cycles by MACC. This indicator highlights the global efficiency of the underlying C-implementation, including the hardware platform setting aspects. With the validation on-device, the execution time by layer is also evaluated.

...
Results for 10 inference(s) - average per inference
 device              : 0x431 - STM32F411xC/E @100/100MHz fpu,art_lat=3,
                       art_prefetch,art_icache,art_dcache
 duration            : 0.351ms
 CPU cycles          : 35062
 cycles/MACC         : 8.74
 c_nodes             : 6
...

Be aware that no theoretical relation is defined between the reported complexity and the real performance of the implemented C-model. This is due to the variability of the targeted environments considering the used target tool-chains, the target device, and underlying subsystem memory settings, NN topologies and layers. It is difficult to provide offline an accurate number of CPU cycles/MACC for a particular target. However, out-of-the-box, the following rough estimations can be used for a 32-bit floating-point C-model.

STM32 series based on cycles/MACC
Arm® Cortex®-M4 ~9
Arm® Cortex®-M7 ~6

For the quantized models, this factor can be approximately divided by 2 on average.

Memory-related metrics

Main contributors

To deploy a model in a resource-constrained runtime environment, the requested ROM and RAM sizes are the key factors. Two main memory contributors are directly reported through the fields: 'weights (ro)' and 'activations (rw)' (refer to “Analyze command” section). The first (also called ROM or FLASH) designates the size in bytes requested to store the weights/bias or other constant params. They are generally placed in a read-only memory-mapped segment like the embedded STM32 flash memory (also defined as .rodata section). Second (also called RAM or activations buffer) designates the size in bytes requested to store the intermediate results (including optionally the buffers for the input or/and output tensors of the model). It can be considered as a private heap (or scratch buffer) only used by the C-runtime engine during an inference. It should be placed in a read-write and low-latency memory-mapped segment.

Tip

“AI buffers and privileged placement” section provides more details about the integration aspects.

Including the kernel code/data sizes

To have a complete view, by refining all AI contributors, the 'analyze' or 'generate' command can be used. The generated report allows having more details about the requested memory usage to execute the AI stack, by including also the requested size of the data and code sections for the used kernels (part of the network runtime library) and the size for the generated C-structure and parameters for a given model (part of the generated network.c file).

In the following table, only the AI contributors are considered for each code/data section.

 Requested memory size per segment ("stm32h7" series)
 ----------------------------- -------- ----------- -------- -----------
 module                            text      rodata     data         bss
 ----------------------------- -------- ----------- -------- -----------
 network.o                        5,504      89,572   32,056       1,192
 NetworkRuntime730_CM7_GCC.a     32,128           0        0           0
 network_data.o                      56          48       88           0
 lib (toolchain)*                 1,300         624        0           0
 ----------------------------- -------- ----------- -------- -----------
 RT total**                      38,988      90,244   32,144       1,192
 ----------------------------- -------- ----------- -------- -----------
...

  Summary per memory device type
  -------------------------------------------------
  .\device         FLASH      %         RAM      %
  -------------------------------------------------
  RT total       161,376   4.4%      33,336   2.3%
  -------------------------------------------------
  TOTAL        3,703,360          1,427,841
  -------------------------------------------------

System heap and stack size

By construction, the kernels are designed not to use the system heap (no explicit call to malloc/free like functions). If a minimal stack is requested to use the runtime, the aiSystemPerformance test application has been designed to know the requested stack size.

Classification accuracy (acc)

For classifier type, accuracy implies classification accuracy. ACC is the ratio between the correct predictions and the total number of inputs and can be used to evaluate the performance of the classifier model. On the contrary, if a regressor model is passed, the ACC is NOT calculated and n.a. value is reported.

Note

No threshold is applied to determine if a given class is detected or not. Only the maximum value along the axis is used.

import numpy as np
from sklearn.metrics import accuracy_score

def acc(ref, pred):
  """Classification accuracy (ACC)."""
  return accuracy_score(np.argmax(ref, axis=1), np.argmax(pred, axis=1))

Root Mean Square Error (rmse)

RMSE is the average of the square of the difference between the reference values and the predicted values. Since it considers the square of the error, the effect of larger errors becomes more relevant then smaller errors, hence focusing more on the largest errors. RMSE is computed for the flattened array (element-wise along the array), returning a scalar value.

import numpy as np

def rmse(ref, pred):
  """Return Root Mean Squared Error (RMSE)."""
  return np.sqrt(((ref - pred).astype(np.float64) ** 2).mean())

Mean Absolute Error (mae)

MAE is the average of the difference between the reference values and the predicted values. It gives the measure of how much far the predictions were from the expected output. However, it does not provide information about the direction of the error that is, whether data are underpredicted or over predicted. MAE is computed for the flattened array (element-wise along the array), returning a scalar value.

import numpy as np

def mae(ref, pred):
  """Return Mean Absolute Error (MAE)."""
  return (np.abs(ref - pred).astype(np.float64)).mean()

L2 relative error (l2r)

L2r is the relative 2-norm or euclidean distance between the reference values and the predicted values.

import numpy as np

def l2r(ref, pred):
  """Compute L2 relative error"""
  def magnitude(v):
    return np.sqrt(np.sum(np.square(v).flatten()))
  mag = magnitude(pred) + np.finfo(np.float32).eps
  return magnitude(ref - pred) / mag

Arithmetic mean of the error (mean)

mean (also called bias) is the arithmetic mean between the reference values and the predicted values.

import numpy as np

def mean(ref, pred):
  """Return the Arithmetic Mean (MEAN)."""
  return np.mean(ref - pred)

Standard deviation of the error (std)

std is the standard deviation between the reference values and the predicted values. It provides a measure of the spread of a distribution of the error.

import numpy as np

def mean(ref, pred):
  """Return the Arithmetic Mean (MEAN)."""
  return np.std(ref - pred)

Nash-Sutcliffe efficiency criteria (nse)

nse is a normalized statistic that determines the relative magnitude of the residual variance (“noise”) compared to the measured data variance (“information”).

import numpy as np

def nse(ref, pred):
  """Return Nash-Sutcliffe efficiency criteria (NSE)""" 
  _mse = np.mean((ref - pred) ** 2)  # Mean Squared Error (MSE)
  return 1 - _mse(ref, pred) / ((np.std(ref) ** 2) + np.finfo(np.float32).eps)

Cosine Similarity (cos)

cos is the measure of similarity between the two signals, that is, the reference values and the predicted values. As for nse metric, value closed to 0.99 indicates a high level of similarity.

import numpy as np

def _cosine(ref, pred):
  """Return Cosine Similarity (COS)"""
  err = np.dot(ref.flatten(), pred.flatten())
  err /= (np.linalg.norm(ref.flatten()) * np.linalg.norm(pred.flatten()))
  return err

Confusion matrix (CM)

When the model is considered as a classifier, a confusion matrix is displayed and logged for reference values vs. predicated values. Note that if a regressor type is passed, the confusion matrix is NOT computed.

8 classes (50 samples)
------------------------------------------------
C0         4    .    .    .    .    .    .    .
C1         .    9    .    .    .    .    .    .
C2         .    .    6    .    .    .    .    .
C3         .    .    .    7    .    .    .    .
C4         .    .    .    .    5    .    .    .
C5         .    .    .    .    .    6    .    .
C6         .    .    .    .    .    .    7    .
C7         .    .    .    .    .    .    .    6

Note

The confusion matrix is only displayed when the number of classes is lower or equal to 20.

Interpretation of the results

This section provides typical examples to interpret the reported metrics. It is important to keep in mind the following points:

Floating-point model

With random data

This section illustrates a typical example where a 32b floating-point model is validated with random data. Default range [0.0, 1.0[ is used here, because the image dataset has been normalized between 0 and 1.

  • X-cross #1 results highlight that the predicted values of both models are close.
  • X-cross #1 l2r=0.000001551 illustrates that with random inputs, the predicted values of the c-model are very close to the outputs generated by the Keras model interpreter.
  • acc=100% metric indicates that a given input sample is classified with the same way in both cases.
import tensorflow as tf
import numpy as np

H, W, C = 28, 28, 1
IN_SHAPE = (H, W, C)
NB_CLASSES = 10

def load_data_set():
  """Load the data"""

  mnist = tf.keras.datasets.mnist
  (train_images, train_labels), (test_images, test_labels) = mnist.load_data()

  # Normalize the input image so that each pixel value is between 0 to 1.
  x_train = x_train / 255.0
  x_test = x_test / 255.0

  x_train = x_train.reshape(x_train.shape[0], H, W, C).astype(np.float32)
  x_test = x_test.reshape(x_test.shape[0], H, W, C).astype(np.float32)

  # convert class vectors to binary class matrices
  y_train = tf.keras.utils.to_categorical(y_train, NB_CLASSES)
  y_test = tf.keras.utils.to_categorical(y_test, NB_CLASSES)

  return x_train, y_train, x_test, y_test

def build_model():
  """Define the model"""

  model = tf.keras.Sequential([
    tf.keras.layers.InputLayer(input_shape=IN_SHAPE),
    tf.keras.layers.Conv2D(filters=12, kernel_size=(3, 3), activation=tf.nn.relu),
    tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(10, activation='softmax')
  ])

  return model
$ stedgeai validate -m <model_fp32> --target <target>
..
Setting validation data...
 generating random data, size=10, seed=42, range=default
 I[1]: (10, 28, 28, 1)/float32, min/max=[0.000, 1.000], mean/std=[0.495, 0.289], input_0
 No output/reference samples are provided

Running the STM AI c-model (AI RUNNER)...(name=network, mode=HOST)
...

Running the Keras model...

Saving validation data...
 output directory: <output-directory-path>
 creating <output-directory-path>\network_val_io.npz
 m_outputs_1: (10, 1, 1, 10)/float32, min/max=[0.000, 0.975], mean/std=[0.100, 0.253], dense_nl
 c_outputs_1: (10, 1, 1, 10)/float32, min/max=[0.000, 0.975], mean/std=[0.100, 0.253], dense_nl

Computing the metrics...

 Cross accuracy report #1 (reference vs C-model)
 ----------------------------------------------------------------------------------------------------
 notes: - the output of the reference model is used as ground truth/reference value
        - 10 samples (10 items per sample)

  acc=100.00%, rmse=0.000000422, mae=0.000000162, l2r=0.000001551

  10 classes (10 samples)
  ----------------------------------------------------------
  C0        0    .    .    .    .    .    .    .    .    .
  C1        .    0    .    .    .    .    .    .    .    .
  C2        .    .    9    .    .    .    .    .    .    .
  C3        .    .    .    0    .    .    .    .    .    .
  C4        .    .    .    .    0    .    .    .    .    .
  C5        .    .    .    .    .    1    .    .    .    .
  C6        .    .    .    .    .    .    0    .    .    .
  C7        .    .    .    .    .    .    .    0    .    .
  C8        .    .    .    .    .    .    .    .    0    .
  C9        .    .    .    .    .    .    .    .    .    0


Evaluation report (summary)
---------------------------------- ... ----------- ... -------------------------------
Output       acc       rmse            l2r             tensor
---------------------------------- ... ----------- ... -------------------------------
X-cross #1   100.00%   0.000000422 ... 0.000001551 ...  dense_nl, ai_float,...
---------------------------------- ... ----------- ... -------------------------------

Creating report file <output-directory-path>\network_validate_report.txt

with a representative dataset

When a representative user dataset is used (including the ground truth values), the ACC/CM metrics are also evaluated with the original and generated c-models.

  • X-cross #1 results highlight that the predicted values of both models are close.
  • X-cross #1 l2r=0.000000249 for example, indicates that with the real inputs, the predicted values of the c-model are very close to the outputs generated by the Keras model interpreter.
  • However, x86 C-model #1 and original model #1 lines show the relatively important rmse/mae/l2r errors. They are due here to the encoding of the provided ground truth values, one-hot encoded: [0.0 0.0 1.0 .. 0.0]. These errors could be larger if no softmax layer is present.
$ stedgeai validate -m <model_fp32> --mode stm32 -vi <data_directory>/data_reduced_test.npz
..
Setting validation data...
 loading file: <data_directory>\data_reduced_test.npz
 - samples are reshaped: (128, 28, 28, 1) -> (128, 28, 28, 1)
 - samples are reshaped: (128, 10) -> (128, 1, 1, 10)
 I[1]: (128, 28, 28, 1)/float32, min/max=[0.000, 1.000], mean/std=[0.139, 0.318], input_0
 O[1]: (128, 1, 1, 10)/float32, min/max=[0.000, 1.000], mean/std=[0.100, 0.300], dense_nl

Running the STM AI c-model (AI RUNNER)...(name=network, mode=x86)
...

Running the Keras model...

Saving validation data...
 output directory:<output-directory-path>
 creating <output-directory-path>\network_val_io.npz
 m_outputs_1: (128, 1, 1, 10)/float32, min/max=[0.000, 1.000], mean/std=[0.100, 0.288], dense_nl
 c_outputs_1: (128, 1, 1, 10)/float32, min/max=[0.000, 1.000], mean/std=[0.100, 0.288], dense_nl


Computing the metrics...

 Accuracy report #1 for the generated x86 C-model
 ---------------------------------------------------------------------------------------------
 notes: - computed against the provided ground truth values
        - 128 samples (10 items per sample)

  acc=97.66%, rmse=0.054929093, mae=0.010250041, l2r=0.180345476

  10 classes (128 samples)
  ----------------------------------------------------------
  C0       11    .    .    .    1    .    .    .    .    .
  C1        .   19    .    .    .    .    .    .    .    .
  C2        .    .   16    .    .    .    .    .    .    .
  C3        .    .    .   11    .    .    .    .    .    .
  C4        .    .    .    .   15    .    .    .    .    .
  C5        .    .    .    .    .    7    .    .    .    .
  C6        .    .    .    .    .    .   10    .    .    .
  C7        .    .    .    .    .    .    .    9    .    .
  C8        .    .    .    1    .    .    .    .   17    .
  C9        1    .    .    .    .    .    .    .    .   10



 Accuracy report #1 for the reference model
 ---------------------------------------------------------------------------------------------
 notes: - computed against the provided ground truth values
        - 128 samples (10 items per sample)

  acc=97.66%, rmse=0.054929089, mae=0.010250039, l2r=0.180345476

  10 classes (128 samples)
  ----------------------------------------------------------
  C0       11    .    .    .    1    .    .    .    .    .
  C1        .   19    .    .    .    .    .    .    .    .
  C2        .    .   16    .    .    .    .    .    .    .
  C3        .    .    .   11    .    .    .    .    .    .
  C4        .    .    .    .   15    .    .    .    .    .
  C5        .    .    .    .    .    7    .    .    .    .
  C6        .    .    .    .    .    .   10    .    .    .
  C7        .    .    .    .    .    .    .    9    .    .
  C8        .    .    .    1    .    .    .    .   17    .
  C9        1    .    .    .    .    .    .    .    .   10


 Cross accuracy report #1 (reference vs C-model)
 ---------------------------------------------------------------------------------------------
 notes: - the output of the reference model is used as ground truth/reference value
        - 128 samples (10 items per sample)

  acc=100.00%, rmse=0.000000076, mae=0.000000016, l2r=0.000000249

  10 classes (128 samples)
  ----------------------------------------------------------
  C0       12    .    .    .    .    .    .    .    .    .
  C1        .   19    .    .    .    .    .    .    .    .
  C2        .    .   16    .    .    .    .    .    .    .
  C3        .    .    .   12    .    .    .    .    .    .
  C4        .    .    .    .   16    .    .    .    .    .
  C5        .    .    .    .    .    7    .    .    .    .
  C6        .    .    .    .    .    .   10    .    .    .
  C7        .    .    .    .    .    .    .    9    .    .
  C8        .    .    .    .    .    .    .    .   17    .
  C9        .    .    .    .    .    .    .    .    .   10


Evaluation report (summary)
----------------------------------------- ... ----------- ... ------------------------
Output              acc       rmse             l2r            tensor
----------------------------------------- ... ----------- ... ------------------------
x86 c-model #1      97.66%    0.054929093 ... 0.180345476 ... nl_4_fmt_conv, ai_i8,...
original model #1   97.66%    0.054929089 ... 0.180345476 ... nl_4_fmt_conv, ai_i8,...
X-cross #1          100.00%   0.000000076 ... 0.000000249 ... nl_4_fmt_conv, ai_i8,...
----------------------------------------- ... ----------- ... ------------------------

Creating report file <output-directory-path>\network_validate_report.txt

Without softmax layer

In the case where the last layer is not a softmax operator, the rmse/mae/l2r errors of the x86 C-model #1 and of original model #1 are larger (for example ~6.6+ here with another MNIST model). These errors are fully dependent on the values of the data provided. In this example, they are one-hot encoded ([0.0 0.0 1.0 .. 0.0]) but are directly compared to the predicted values accumulating the significant differences. In this context, acc and *X-cross” results are the more significant metrics. Note that the --classifier option is used to force the computation of the ACC/CM metrics.

$ stedgeai validate -m <float_model_wo_softmax> --target stm32 -vi mnist_reduced_test.npz --classifier
...
Evaluation report (summary)
---------------------------------------------------------------------------------------------------
Mode                acc       rmse          mae           l2r           tensor
---------------------------------------------------------------------------------------------------
x86 C-model #1      97.66%    6.643473148   5.626701355   0.992283940   dense_2_dense, ai_float,..
original model #1   97.66%    6.643473148   5.626701355   0.992283940   dense_2_dense, ai_float,..
X-cross #1          100.00%   0.000003622   0.000002494   0.000000541   dense_2_dense, ai_float,..
---------------------------------------------------------------------------------------------------
...

To avoid this situation, it is preferable for a regressor model to provide the “real” predicted values (as generated during the test of the trained model). In this case, the ACC/CM metrics are not significant.

$ stedgeai validate -m <float_model_wo_softmax> --target stm32 -vi mnist_reduced_predicted_test_.npz
..
Evaluation report (summary)
---------------------------------------------------------------------------------------------------
Mode                acc       rmse          mae           l2r           tensor
---------------------------------------------------------------------------------------------------
x86 C-model #1      n.a.      0.000003622   0.000002494   0.000000541   dense_2_dense, ai_float,..
original model #1   n.a.      0.000000000   0.000000000   0.000000000   dense_2_dense, ai_float,..
X-cross #1          n.a.      0.000003622   0.000002494   0.000000541   dense_2_dense, ai_float,..
---------------------------------------------------------------------------------------------------

Quantized model

Previous floating-point model has been quantized with a representative subset of the training dataset. As the tf.int8 option has been defined for the conversion of the inputs and the outputs, the provided user data are automatically quantized before to feed the models. The predicted values are dequantized before to compare them to the provided ground truth values which are one-hot encoded. As for the validation of the floating-point model, the same comments about the provided metrics are applicable.

def tflite_convert(keras_model, data):
  """Quantize a Keras model (post-training quantization)"""

  converter = tf.lite.TFLiteConverter.from_keras_model(model)
  shape_in = (1,) + IN_SHAPE

  def rep_data_gen():
    for i in data[0:100]:
        f = np.reshape(i, shape_in)
        tensor = tf.convert_to_tensor(f, tf.float32)
        yield [tensor]

  converter.representative_dataset = rep_data_gen
  converter.optimizations = [tf.lite.Optimize.DEFAULT]
  converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]

  converter.inference_input_type = np.int8
  converter.inference_output_type = np.int8

  return converter.convert()
$ stedgeai validate -m validate <quantized_model> --target stm32 -vi <data_directory>/mnist_reduced_test.npz
..
Setting validation data...
 loading file: <data_directory>\mnist_reduced_test.npz
 - samples are reshaped: (128, 28, 28, 1) -> (128, 28, 28, 1)
 - samples are reshaped: (128, 10) -> (128, 1, 1, 10)
 I[1]: (128, 28, 28, 1)/int8, min/max=[-128, 127], mean/std=[-92.544, 81.165],
       scale=0.00392157 zp=-128, input_1
 O[1]: (128, 1, 1, 10)/float32, min/max=[0.000, 1.000], mean/std=[0.100, 0.300],
       nl_4_fmt_conv

Running the STM AI c-model (AI RUNNER)...(name=network, mode=x86)
...

Running the TFlite model...

Saving validation data...
 output directory: <output-directory-path>
 creating <output-directory-path>\network_val_io.npz
 m_outputs_1: (128, 1, 1, 10)/int8, min/max=[-128, 127], mean/std=[-102.458, 73.581],
              nl_4_fmt_conv
 c_outputs_1: (128, 1, 1, 10)/int8, min/max=[-128, 127], mean/std=[-102.458, 73.581],
              scale=0.00390625 zp=-128, nl_4_fmt_conv

Computing the metrics...

 Accuracy report #1 for the generated x86 C-model
 ------------------------------------------------------------------------------------
 notes: - data type is different: r/float32 instead p/int8
        - p/int8 data are dequantized with s=0.003906 zp=-128
        - computed against the provided ground truth values
        - 128 samples (10 items per sample)

  acc=97.66%, rmse=0.054981638, mae=0.010229493, l2r=0.180712447

  10 classes (128 samples)
  ----------------------------------------------------------
  C0       11    .    .    .    1    .    .    .    .    .
  C1        .   19    .    .    .    .    .    .    .    .
  C2        .    .   16    .    .    .    .    .    .    .
  C3        .    .    .   11    .    .    .    .    .    .
  C4        .    .    .    .   15    .    .    .    .    .
  C5        .    .    .    .    .    7    .    .    .    .
  C6        .    .    .    .    .    .   10    .    .    .
  C7        .    .    .    .    .    .    .    9    .    .
  C8        .    .    .    1    .    .    .    .   17    .
  C9        1    .    .    .    .    .    .    .    .   10


 Accuracy report #1 for the reference model
 ------------------------------------------------------------------------------------
 notes: - data type is different: r/float32 instead p/int8
        - p/int8 data are dequantized with s=0.003906 zp=-128
        - computed against the provided ground truth values
        - 128 samples (10 items per sample)

  acc=97.66%, rmse=0.054981638, mae=0.010229493, l2r=0.180712447

  10 classes (128 samples)
  ----------------------------------------------------------
  C0       11    .    .    .    1    .    .    .    .    .
  C1        .   19    .    .    .    .    .    .    .    .
  C2        .    .   16    .    .    .    .    .    .    .
  C3        .    .    .   11    .    .    .    .    .    .
  C4        .    .    .    .   15    .    .    .    .    .
  C5        .    .    .    .    .    7    .    .    .    .
  C6        .    .    .    .    .    .   10    .    .    .
  C7        .    .    .    .    .    .    .    9    .    .
  C8        .    .    .    1    .    .    .    .   17    .
  C9        1    .    .    .    .    .    .    .    .   10


 Cross accuracy report #1 (reference vs C-model)
 -------------------------------------------------------------------------------------
 notes: - r/int8 data are dequantized with s=0.003906 zp=-128
        - p/int8 data are dequantized with s=0.003906 zp=-128
        - the output of the reference model is used as ground truth/reference value
        - 128 samples (10 items per sample)

  acc=100.00%, rmse=0.000000000, mae=0.000000000, l2r=0.000000000

  10 classes (128 samples)
  ----------------------------------------------------------
  C0       12    .    .    .    .    .    .    .    .    .
  C1        .   19    .    .    .    .    .    .    .    .
  C2        .    .   16    .    .    .    .    .    .    .
  C3        .    .    .   12    .    .    .    .    .    .
  C4        .    .    .    .   16    .    .    .    .    .
  C5        .    .    .    .    .    7    .    .    .    .
  C6        .    .    .    .    .    .   10    .    .    .
  C7        .    .    .    .    .    .    .    9    .    .
  C8        .    .    .    .    .    .    .    .   17    .
  C9        .    .    .    .    .    .    .    .    .   10


Evaluation report (summary)
----------------------------------------- ... ----------- ... ------------------------
Output              acc       rmse             l2r            tensor
----------------------------------------- ... ----------- ... ------------------------
x86 c-model #1      97.66%    0.054981638 ... 0.180712447 ... nl_4_fmt_conv, ai_i8,...
original model #1   97.66%    0.054981638 ... 0.180712447 ... nl_4_fmt_conv, ai_i8,...
X-cross #1          100.00%   0.000000000 ... 0.000000000 ... nl_4_fmt_conv, ai_i8,...
----------------------------------------- ... ----------- ... ------------------------

Creating report file <output-directory-path>\network_validate_report.txt

Model with multiple I/O

No specific process is applied. All metrics are calculated independently for each output.

$ stedgeai validate -m <model_with_2_outputs> --target stm32 -vi test_multiple_io.npz
...
Evaluation report (summary)
-------------------------------------------------------------------------------------------------
Mode                acc    rmse          mae           l2r           tensor
-------------------------------------------------------------------------------------------------
x86 C-model #1      n.a.   0.000000000   0.000000000   0.000000000   add_1, ai_float, ..
x86 C-model #2      n.a.   0.000000000   0.000000000   0.000000000   multiply_1, ai_float, ..
original model #1   n.a.   0.000000000   0.000000000   0.000000000   add_1, ai_float, ..
original model #2   n.a.   0.000000000   0.000000000   0.000000000   multiply_1, ai_float, ..
X-cross #1          n.a.   0.000000000   0.000000000   0.000000000   add_1, ai_float, ..
X-cross #2          n.a.   0.000000000   0.000000000   0.000000000   multiply_1, ai_float, ..
-------------------------------------------------------------------------------------------------

X-cross (l2r) #1 error : 0.00000000e+00 (expected to be < 0.01)

Postprocessing example

This section provides a typical custom Python script example to read the generated data and to build new element-wise metrics: variance, f1_score,.. (Thanks to the numpy and sklearn.metrics Python modules). acc, rmse, mae and l2r are also provided to illustrate how these metrics are computed.

$ python custom_metrics.py
Read reference NPZ file "mnist_test.npz"...
Read generated NPZ file "./st_ai_output\network_val_io.npz"...

Evaluation report
--------------------------------------------------------------------------------
               acc       rmse      mae       var       f1_score  l2r
--------------------------------------------------------------------------------
C-model        98.4%     0.064472  0.013081  0.004160  0.984033  0.21363260
original model 98.4%     0.064472  0.013081  0.004160  0.984033  0.21363260
X-cross        100.0%    0.000000  0.000000  0.000000  1.000000  0.00000000
--------------------------------------------------------------------------------

              precision    recall  f1-score   support

          c0       0.92      1.00      0.96        12
          c1       1.00      1.00      1.00        19
          c2       1.00      1.00      1.00        16
          c3       0.92      1.00      0.96        11
          c4       1.00      1.00      1.00        15
          c5       1.00      1.00      1.00         7
          c6       1.00      1.00      1.00        10
          c7       1.00      1.00      1.00         9
          c8       1.00      0.94      0.97        18
          c9       1.00      0.91      0.95        11

    accuracy                           0.98       128
   macro avg       0.98      0.99      0.98       128
weighted avg       0.99      0.98      0.98       128

Full code of the custom Python script.

# -*- coding: utf-8 -*-
"""
Implement custom metrics
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import os
import numpy as np

from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report

OUTPUT_DIR = './st_ai_output'
NETWORK_NAME = 'network'
REFERENCE_NPZ = 'mnist_test.npz'

# metrics

def mse(ref, pred):
  """Return Mean Squared Error (MSE)."""
  return ((ref - pred).astype(np.float64) ** 2).mean()

def rmse(ref, pred):
  """Return Root Mean Squared Error (RMSE)."""
  return np.sqrt(((ref - pred).astype(np.float64) ** 2).mean())

def mae(ref, pred):
  """Return Mean Absolute Error (MAE)."""
  return (np.abs(ref - pred).astype(np.float64)).mean()

def var(ref, pred):
  """Return Variance"""
  return np.var((ref - pred), dtype=np.float64, ddof=1)

def acc(ref, pred):
  """Classification accuracy (ACC)."""
  return accuracy_score(np.argmax(ref, axis=1), np.argmax(pred, axis=1))

def f1_s(ref, pred, average='macro'):
  """Compute the F1 score, also known as balanced F-score or F-measure (F1)"""
  return f1_score(np.argmax(ref, axis=1), np.argmax(pred, axis=1), average=average)

def l2r(ref, pred):
  """Compute L2 relative error"""
  def magnitude(v):
    return np.sqrt(np.sum(np.square(v).flatten()))
  mag = magnitude(pred) + np.finfo(np.float32).eps
  return magnitude(ref - pred) / mag


# Read reference values

fname = REFERENCE_NPZ
print('Read reference NPZ file "{}"...'.format(fname))
arrays = np.load(fname)

i_ref = arrays['x_test']
r_ref = arrays['y_test']

# Read the generated inputs and predicted samples (original & C models)

fname = os.path.join(OUTPUT_DIR, NETWORK_NAME + '_val_io.npz')
print('Read generated NPZ file "{}"...'.format(fname))
arrays = np.load(fname)

i_  = arrays['m_inputs_1']
ic_  = arrays['c_inputs_1']
p_  = arrays['c_outputs_1']
pm_ = arrays['m_outputs_1']

# calculate metrics

def build_metrics(ref, pred):
  res = {}
  res['acc'] = acc(ref, pred)
  res['var'] = var(ref, pred)
  res['f1_score'] = f1_s(ref, pred)
  res['rmse'] = rmse(ref, pred)
  res['mae'] = mae(ref, pred)
  res['mse'] = mse(ref, pred)
  res['l2r'] = l2r(ref, pred)
  return res

def print_metrics(name, ref, pred):
  res = build_metrics(ref, pred)
  str = '{:15s}'.format(name)
  _acc = '{:.1f}%'.format(res['acc'] * 100.0)
  str += '{:10s}'.format(_acc)
  str += '{:.6f}  '.format(res['rmse'])
  str += '{:.6f}  '.format(res['mae'])
  str += '{:.6f}  '.format(res['var'])
  str += '{:.6f}  '.format(res['f1_score'])
  str += '{:.8f}  '.format(res['l2r'])
  print(str)

# Reshape the outputs to be aligned

p_ = p_.reshape(r_ref.shape)
pm_ = p_.reshape(r_ref.shape)

# Log the results

print('\nEvaluation report')
print('-'*80)
print('               {:8s}  {:8s}  {:8s}  {:8s}  {:8s}  {:8s}'.format('acc', 'rmse',
          'mae', 'var', 'f1_score', 'l2r'))
print('-'*80)
print_metrics('C-model', r_ref, p_)
print_metrics('original model', r_ref, pm_)
print_metrics('X-cross', pm_, p_)
print('-'*80)

print('')
target_names = ['c0', 'c1', 'c2', 'c3', 'c4', 'c5', 'c6', 'c7', 'c8', 'c9']
print(classification_report(np.argmax(r_ref, axis=1), np.argmax(p_, axis=1),
                            target_names=target_names))