MediaTek NPU
Introduction
The MediaTek AI Processing Unit (APU) is a high-performance hardware engine designed for deep learning, optimized for both bandwidth and power efficiency. Its architecture features a combination of big, small, and tiny cores, making it ideal for a wide range of modern applications, including AI-powered camera functions, virtual assistants, and OS or in-app enhancements.
The latest APU 5.0 introduces a cluster AI architecture, delivering an impressive 3.6 TOPS.
Overview
On the Genio 700/510 platform, various software solutions are offered to boost AI computing by GPU and APU.
GPU Neural Network Acceleration
TensorFlow Lite with hardware acceleration is available to facilitate the development and deployment of machine learning models. By utilizing TensorFlow Lite Delegates, hardware acceleration for TFLite models can be enabled through on-device accelerators such as the GPU and Digital Signal Processor (DSP).
IoT Yocto comes pre-integrated with two key delegates:
- GPU Delegate: This uses the OpenGL ES compute shader on the device to perform inference on TFLite models.
- Arm NN Delegate: Arm NN is an open-source software library designed to support machine learning on Arm hardware. It acts as a bridge between neural network frameworks and Cortex-A CPUs or Arm Mali GPUs.
APU Neural Network Acceleration
MediaTek's proprietary machine learning solution, NeuroPilot, is available on IoT Yocto for Genio 510/700 platforms.
NeuroPilot is a suite of software tools and APIs that forms the core of MediaTek’s AI ecosystem. It allows developers to efficiently build and deploy AI applications on edge devices, speeding up AI processes while ensuring data privacy.
On the Genio 510/700 platform, both online and offline inference paths are supported through the NeuroPilot suite:
- Neuron Stable Delegate: This is MediaTek's Neuron Delegate, implemented using the interface provided by TensorFlowLite Stable.
- Neuron SDK: A tool that includes the Neuron compiler (
ncc-tflite
), which converts TFLite models into MediaTek-proprietary binaries (DLA, Deep Learning Archive) for deployment on MediaTek platforms. This results in optimized models with reduced latency and memory usage. The Neuron SDK also provides a Neuron Run-time API, which can be called from C/C++ programs to create a runtime environment, parse compiled models, and perform on-device inference.
Clea OS
Pre-Requisites
- One of the following Seco SoM's:
- E58 Genio 700 (CPU/GPU/APU support).
- E58 Genio 510 (CPU/GPU/APU support).
- A compatible Carrier Board.
- Read the Build a Reference Image with Yocto Project article.
Adding APU recipes to Reference Images for Yocto Project
Cloning the Clea OS BSP repository
In an empty directory, use git-repo to obtain the BSP on the last version:
Create a folder for the project
$ mkdir -p ~/projects/clea-os
$ cd ~/projects/clea-os
Initialize the manifest environment
$ repo init -u https://git.seco.com/clea-os/seco-manifest.git -b kirkstone
$ repo sync -j$(nproc) --fetch-submodules --no-clone-bundle
Define the configuration that you need to build (e.g. 'seco_smarc_e58_genio700_clea-os_embedded_wayland' which refers to Clea OS Embedded image for E58 Genio700 board):
$ . ./seco-setup.sh -d seco_smarc_e58_genio700_clea-os_embedded_wayland
$ . ./seco-setup.sh -c
Getting APU
APU Support is provided by the following Yocto layers:
The next steps expect the current directory to be
~/projects/clea-os/layers
.
Clone the meta-tensorflow
and meta-nn
Git repositories into your project layers directory:
$ git clone https://git.yoctoproject.org/meta-tensorflow/ && cd meta-tensorflow
$ git checkout ec20e19319f9eb89ceadc04923cc5bc75e865692 && cd ..
$ git clone -b rity-kirkstone-v23.2 https://gitlab.com/mediatek/aiot/rity/meta-nn
Adding the recipes to distribution
Add meta-tensorflow
and meta-nn
to layer set to be included in the project:
$ cd ~/projects/clea-os/build_e58_genio700_embedded_wayland
$ echo 'BBLAYERS:append = " ${BSPDIR}/layers/meta-tensorflow "' >> conf/bblayers.conf
$ echo 'BBLAYERS:append = " ${BSPDIR}/layers/meta-nn "' >> conf/bblayers.conf
Add the meta-tensorflow
and meta-nn
recipes and some image processing libraries to your
image:
$ echo 'IMAGE_INSTALL:append = " packagegroup-rity-mtk-neuropilot packagegroup-rity-ai-ml "' >> conf/local.conf
Building
Build the seco-image-clea-os-full
image for your target SoM:
$ bitbake seco-image-clea-os-full
Flashing the image
To flash your image to the board, see the Installation Guide for your SoM.
To enable APU, you additioanlly need to load the apusys.dtbo
overlay.
Executing Benchmarks
Tensorflow Lite and Delegates
MediaTek provides an inference example, supporting CPU, GPU with Arm NN delegate, and NPU.
To run the benchmark:
- APU
# benchmark_model --stable_delegate_settings_file=/usr/share/label_image/stable_delegate_settings.json --use_nnapi=false --use_xnnpack=false --use_gpu=false --min_secs=20 --graph=/usr/share/label_image/mobilenet_v1_1.0_224_quant.tflite
- GPU with Arm NN delegate
# benchmark_model --graph=/usr/share/label_image/mobilenet_v1_1.0_224_quant.tflite --external_delegate_path=/usr/lib/libarmnnDelegate.so.29 --external_delegate_options="backends:GpuAcc,CpuAcc" --num_runs=10
- CPU
# benchmark_model --graph=/usr/share/label_image/mobilenet_v1_1.0_224_quant.tflite --num_threads=8 --num_runs=10
See below a comparison of Inference Time executing this benchmark (in Performance mode):
SOM Inference Time FPS (1/Inference Time)
-------------------------------------- ---------------- ------------------------
SECO E58 Genio700 - CPU only 7.12 ms 140.45 fps
SECO E58 Genio700 - GPU w/ Arm NN 11.08 ms 90.25 fps
SECO E58 Genio700 - APU 1.21 ms 826.45 fps
Neuron SDK
On the Genio 510/700, IoT Yocto includes support for the Neuron SDK, part of the MediaTek NeuroPilot software suite. The Neuron SDK features the Neuron compiler (ncc-tflite), which converts TFLite models into MediaTek-proprietary binaries (DLA, Deep Learning Archive) for deployment on MediaTek platforms. It also offers the Neuron Runtime API, providing a set of C/C++ APIs for creating a runtime environment, parsing compiled model files, and performing on-device network inference. For detailed information, refer to the Neuron SDK chapter of MediaTek's Documentation.
A Python benchmark application for image recognition is also pre-installed and can be found in the /usr/share/benchmark_dla
directory.
Use the command python3 benchmark.py --auto
to run the benchmark. This will automatically find all TFLite models in the /usr/share/benchmark_dla
directory, compile them into DLA, and perform inference on the APU. The benchmark results will be saved in /usr/share/benchmark_dla/benchmark.log
.
We can also copy the model from Tensorflow Lite example into this directory.
# cp /usr/share/label_image/mobilenet_v1_1.0_224_quant.tflite /usr/share/benchmark_dla
And run the benchmarks
# cd /usr/share/benchmark_dla
# python3 benchmark.py --auto
After that, we can check the inference time of each model
# cat benchmark.log
/usr/share/benchmark_dla/ResNet50V2_224_1.0_quant.tflite, mdla3.0, avg inference time: 14.77
/usr/share/benchmark_dla/ResNet50V2_224_1.0_quant.tflite, vpu, avg inference time: 59.46
/usr/share/benchmark_dla/mobilenet_v1_1.0_224_quant.tflite, mdla3.0, avg inference time: 1.83
/usr/share/benchmark_dla/mobilenet_v1_1.0_224_quant.tflite, vpu, avg inference time: 17.46
/usr/share/benchmark_dla/mobilenet_v2_1.0_224_quant.tflite, mdla3.0, avg inference time: 1.85
/usr/share/benchmark_dla/mobilenet_v2_1.0_224_quant.tflite, vpu, avg inference time: 16.79
/usr/share/benchmark_dla/inception_v3_quant.tflite, mdla3.0, avg inference time: 15.14
/usr/share/benchmark_dla/inception_v3_quant.tflite, vpu, avg inference time: 74.64
/usr/share/benchmark_dla/ssd_mobilenet_v1_coco_quantized.tflite, mdla3.0, avg inference time: 11.62
/usr/share/benchmark_dla/ssd_mobilenet_v1_coco_quantized.tflite, vpu, avg inference time: 24.46
See below a comparison of Inference Time executing mobilenet_v1_1.0_224_quant.tflite
model (in Performance mode):
SOM Inference Time FPS (1/Inference Time)
-------------------------------------- ---------------- ------------------------
SECO E58 Genio700 - mdla3.0 1.83 ms 546.45 fps
SECO E58 Genio700 - vpu 17.46 ms 52.27 fps
Performance Mode
To force the CPU, GPU, and APU to run at maximum frequency, you can follow these commands for each component:
CPU at Maximum Frequency
Set Performance Mode for CPU Governor
# echo performance > /sys/devices/system/cpu/cpufreq/policy0/scaling_governor
# echo performance > /sys/devices/system/cpu/cpufreq/policy6/scaling_governor
- Disable CPU Idle
On Genio 510
# for j in {2..0}; do for i in {5..0} ; do echo 1 > /sys/devices/system/cpu/cpu$i/cpuidle/state$j/disable ; done ; done
On Genio 700
# for j in {2..0}; do for i in {7..0} ; do echo 1 > /sys/devices/system/cpu/cpu$i/cpuidle/state$j/disable ; done ; done
GPU at Maximum Frequency
Set Performance Mode for GPU Governor
# echo performance > /sys/devices/platform/soc/13000000.mali/devfreq/13000000.mali/governor
Alternatively, you can refer to the Adjust GPU Frequency guide from MediaTek for more details on manually setting the frequency.
APU at Maximum Frequency
Set APU to Maximum Frequency
# echo dvfs_debug 0 > /sys/kernel/debug/apusys/power
You can also refer to the QoS Tuning Flow and set qos.boostValue
to NEURONRUNTIME_BOOSTVALUE_MAX
for maximum performance.
Disable Thermal
# echo 115000 > /sys/class/thermal/thermal_zone0/trip_point_0_temp
# echo 115000 > /sys/class/thermal/thermal_zone0/trip_point_1_temp
# echo 115000 > /sys/class/thermal/thermal_zone0/trip_point_2_temp
These steps ensure that the CPU, GPU, and APU operate at their highest performance levels.
Additional Resources
Please refer to the Documentation by MediaTek for Demo applications and additional resources about the APU.