Version: kirkstone_1-06-00

MediaTek NPU

Introduction

The MediaTek AI Processing Unit (APU) is a high-performance hardware engine designed for deep learning, optimized for both bandwidth and power efficiency. Its architecture features a combination of big, small, and tiny cores, making it ideal for a wide range of modern applications, including AI-powered camera functions, virtual assistants, and OS or in-app enhancements.

The latest APU 5.0 introduces a cluster AI architecture, delivering an impressive 3.6 TOPS.

Overview

On the Genio 700/510 platform, various software solutions are offered to boost AI computing by GPU and APU.

GPU Neural Network Acceleration

TensorFlow Lite with hardware acceleration is available to facilitate the development and deployment of machine learning models. By utilizing TensorFlow Lite Delegates, hardware acceleration for TFLite models can be enabled through on-device accelerators such as the GPU and Digital Signal Processor (DSP).

IoT Yocto comes pre-integrated with two key delegates:

GPU Delegate: This uses the OpenGL ES compute shader on the device to perform inference on TFLite models.
Arm NN Delegate: Arm NN is an open-source software library designed to support machine learning on Arm hardware. It acts as a bridge between neural network frameworks and Cortex-A CPUs or Arm Mali GPUs.

APU Neural Network Acceleration

MediaTek's proprietary machine learning solution, NeuroPilot, is available on IoT Yocto for Genio 510/700 platforms.

NeuroPilot is a suite of software tools and APIs that forms the core of MediaTek’s AI ecosystem. It allows developers to efficiently build and deploy AI applications on edge devices, speeding up AI processes while ensuring data privacy.

On the Genio 510/700 platform, both online and offline inference paths are supported through the NeuroPilot suite:

Neuron Stable Delegate: This is MediaTek's Neuron Delegate, implemented using the interface provided by TensorFlowLite Stable.
Neuron SDK: A tool that includes the Neuron compiler (ncc-tflite), which converts TFLite models into MediaTek-proprietary binaries (DLA, Deep Learning Archive) for deployment on MediaTek platforms. This results in optimized models with reduced latency and memory usage. The Neuron SDK also provides a Neuron Run-time API, which can be called from C/C++ programs to create a runtime environment, parse compiled models, and perform on-device inference.

Clea OS

Pre-Requisites

One of the following Seco SoM's:
- E58 Genio 700 (CPU/GPU/APU support).
- E58 Genio 510 (CPU/GPU/APU support).
A compatible Carrier Board.
Read the Build a Reference Image with Yocto Project article.

Adding APU recipes to Reference Images for Yocto Project

Cloning the Clea OS BSP repository

In an empty directory, use git-repo to obtain the BSP on the last version:

Create a folder for the project

$ mkdir -p ~/projects/clea-os
$ cd ~/projects/clea-os

Initialize the manifest environment

$ repo init -u https://git.seco.com/clea-os/seco-manifest.git -b kirkstone
$ repo sync -j$(nproc) --fetch-submodules --no-clone-bundle

Define the configuration that you need to build (e.g. 'seco_smarc_e58_genio700_clea-os_embedded_wayland' which refers to Clea OS Embedded image for E58 Genio700 board):

$ . ./seco-setup.sh -d seco_smarc_e58_genio700_clea-os_embedded_wayland
$ . ./seco-setup.sh -c

Getting APU

APU Support is provided by the following Yocto layers:

The next steps expect the current directory to be ~/projects/clea-os/layers.

Clone the meta-tensorflow and meta-nn Git repositories into your project layers directory:

$ git clone https://git.yoctoproject.org/meta-tensorflow/ && cd meta-tensorflow
$ git checkout f5f218c1ac055ef449ce779392ef426936d1bd2b && cd ..
$ git clone -b rity-kirkstone-v24.0 https://gitlab.com/mediatek/aiot/rity/meta-nn

Adding the recipes to distribution

Add meta-tensorflow and meta-nn to layer set to be included in the project:

$ cd ~/projects/clea-os/build_e58_genio700_embedded_wayland
$ echo 'BBLAYERS:append = " ${BSPDIR}/layers/meta-tensorflow "' >> conf/bblayers.conf
$ echo 'BBLAYERS:append = " ${BSPDIR}/layers/meta-nn "' >> conf/bblayers.conf

Add the meta-tensorflow and meta-nn recipes and some image processing libraries to your image:

$ echo 'IMAGE_INSTALL:append = " packagegroup-rity-mtk-neuropilot packagegroup-rity-ai-ml "' >> conf/local.conf

Building

Build the seco-image-clea-os-full image for your target SoM:

$ bitbake seco-image-clea-os-full

Flashing the image

To flash your image to the board, see the Installation Guide for your SoM. To enable APU, you additioanlly need to load the apusys.dtbo overlay.

Executing Benchmarks

Tensorflow Lite and Delegates

MediaTek provides an inference example, supporting CPU, GPU with Arm NN delegate, and NPU.

To run the benchmark:

APU

# benchmark_model --stable_delegate_settings_file=/usr/share/label_image/stable_delegate_settings.json --use_nnapi=false --use_xnnpack=false --use_gpu=false --min_secs=20 --graph=/usr/share/label_image/mobilenet_v1_1.0_224_quant.tflite

GPU with Arm NN delegate

# benchmark_model --graph=/usr/share/label_image/mobilenet_v1_1.0_224_quant.tflite --external_delegate_path=/usr/lib/libarmnnDelegate.so.29 --external_delegate_options="backends:GpuAcc,CpuAcc" --num_runs=10

CPU

# benchmark_model --graph=/usr/share/label_image/mobilenet_v1_1.0_224_quant.tflite --num_threads=8 --num_runs=10

See below a comparison of Inference Time executing this benchmark (in Performance mode):

  SOM                                    Inference Time   FPS (1/Inference Time)
  -------------------------------------- ---------------- ------------------------
  SECO E58 Genio700 - CPU only           7.12 ms          140.45 fps
  SECO E58 Genio700 - GPU w/ Arm NN      11.08 ms         90.25 fps
  SECO E58 Genio700 - APU                1.21 ms          826.45 fps

Neuron SDK

On the Genio 510/700, IoT Yocto includes support for the Neuron SDK, part of the MediaTek NeuroPilot software suite. The Neuron SDK features the Neuron compiler (ncc-tflite), which converts TFLite models into MediaTek-proprietary binaries (DLA, Deep Learning Archive) for deployment on MediaTek platforms. It also offers the Neuron Runtime API, providing a set of C/C++ APIs for creating a runtime environment, parsing compiled model files, and performing on-device network inference. For detailed information, refer to the Neuron SDK chapter of MediaTek's Documentation.

A Python benchmark application for image recognition is also pre-installed and can be found in the /usr/share/benchmark_dla directory.

Use the command python3 benchmark.py --auto to run the benchmark. This will automatically find all TFLite models in the /usr/share/benchmark_dla directory, compile them into DLA, and perform inference on the APU. The benchmark results will be saved in /usr/share/benchmark_dla/benchmark.log.

We can also copy the model from Tensorflow Lite example into this directory.

# cp /usr/share/label_image/mobilenet_v1_1.0_224_quant.tflite /usr/share/benchmark_dla

And run the benchmarks

# cd /usr/share/benchmark_dla
# python3 benchmark.py --auto

After that, we can check the inference time of each model

# cat benchmark.log
/usr/share/benchmark_dla/ResNet50V2_224_1.0_quant.tflite, mdla3.0, avg inference time: 14.77
/usr/share/benchmark_dla/ResNet50V2_224_1.0_quant.tflite, vpu, avg inference time: 59.46
/usr/share/benchmark_dla/mobilenet_v1_1.0_224_quant.tflite, mdla3.0, avg inference time: 1.83
/usr/share/benchmark_dla/mobilenet_v1_1.0_224_quant.tflite, vpu, avg inference time: 17.46
/usr/share/benchmark_dla/mobilenet_v2_1.0_224_quant.tflite, mdla3.0, avg inference time: 1.85
/usr/share/benchmark_dla/mobilenet_v2_1.0_224_quant.tflite, vpu, avg inference time: 16.79
/usr/share/benchmark_dla/inception_v3_quant.tflite, mdla3.0, avg inference time: 15.14
/usr/share/benchmark_dla/inception_v3_quant.tflite, vpu, avg inference time: 74.64
/usr/share/benchmark_dla/ssd_mobilenet_v1_coco_quantized.tflite, mdla3.0, avg inference time: 11.62
/usr/share/benchmark_dla/ssd_mobilenet_v1_coco_quantized.tflite, vpu, avg inference time: 24.46

See below a comparison of Inference Time executing mobilenet_v1_1.0_224_quant.tflite model (in Performance mode):

  SOM                                    Inference Time   FPS (1/Inference Time)
  -------------------------------------- ---------------- ------------------------
  SECO E58 Genio700 - mdla3.0            1.83 ms          546.45 fps
  SECO E58 Genio700 - vpu                17.46 ms         52.27 fps

Performance Mode

To force the CPU, GPU, and APU to run at maximum frequency, you can follow these commands for each component:

CPU at Maximum Frequency

Set Performance Mode for CPU Governor

# echo performance > /sys/devices/system/cpu/cpufreq/policy0/scaling_governor
# echo performance > /sys/devices/system/cpu/cpufreq/policy6/scaling_governor

Disable CPU Idle

On Genio 510

# for j in {2..0}; do for i in {5..0} ; do echo 1 > /sys/devices/system/cpu/cpu$i/cpuidle/state$j/disable ; done ; done

On Genio 700

# for j in {2..0}; do for i in {7..0} ; do echo 1 > /sys/devices/system/cpu/cpu$i/cpuidle/state$j/disable ; done ; done

GPU at Maximum Frequency

Set Performance Mode for GPU Governor

# echo performance > /sys/devices/platform/soc/13000000.mali/devfreq/13000000.mali/governor

Alternatively, you can refer to the Adjust GPU Frequency guide from MediaTek for more details on manually setting the frequency.

APU at Maximum Frequency

Set APU to Maximum Frequency

# echo dvfs_debug 0 > /sys/kernel/debug/apusys/power

You can also refer to the QoS Tuning Flow and set qos.boostValue to NEURONRUNTIME_BOOSTVALUE_MAX for maximum performance.

Disable Thermal

# echo 115000 > /sys/class/thermal/thermal_zone0/trip_point_0_temp
# echo 115000 > /sys/class/thermal/thermal_zone0/trip_point_1_temp
# echo 115000 > /sys/class/thermal/thermal_zone0/trip_point_2_temp

These steps ensure that the CPU, GPU, and APU operate at their highest performance levels.

Additional Resources

Please refer to the Documentation by MediaTek for Demo applications and additional resources about the APU.

Introduction​

Overview​

GPU Neural Network Acceleration​

APU Neural Network Acceleration​

Clea OS​

Pre-Requisites​

Adding APU recipes to Reference Images for Yocto Project​

Cloning the Clea OS BSP repository​

Getting APU​

Adding the recipes to distribution​

Building​

Flashing the image​

Executing Benchmarks​

Tensorflow Lite and Delegates​

Neuron SDK​

Performance Mode​

CPU at Maximum Frequency​

GPU at Maximum Frequency​

APU at Maximum Frequency​

Disable Thermal​

Additional Resources​