This is the documentation for the latest (main) development branch of mcuxpresso sdk. If you are looking for the documentation of previous releases, use the drop-down menu on the left and select the desired version.

tflm_kws

Overview

Keyword spotting example based on Keyword spotting for Microcontrollers [1].

Input data preprocessing

Raw audio data is pre-processed first - a spectrogram is calculated: A 40 ms window slides over a one-second audio sample with a 20 ms stride. For each window, audio frequency strengths are computed using FFT and turned into a set of Mel-Frequency Cepstral Coefficients (MFCC). Only first 10 coefficients are taken into account. The window slides over a sample 49 times, hence a matrix with 49 rows and 10 columns is created. The matrix is called a spectrogram.

In the example, static audio samples (“off”, “right”) are evaluated first regardless microphone is connected or not. Secondly, audio is processed directly from microphone.

Classification

The spectrogram is fed into a neural network. The neural network is a depthwise separable convolutional neural network based on MobileNet described in [2]. The model produces a probability vector for the following classes: “Silence”, “Unknown”, “yes”, “no”, “up”, “down”, “left”, “right”, “on”, “off”, “stop” and “go”.

Quantization

The NN model is quantized to run faster on MCUs and it takes in a quantized input and produces a quantized output. An input spectrogram needs to be scaled from range [-247, 30] to range [0, 255] and round to integers. Values lower than zero are set to zero and values exceeding 255 are set to 255. An output of the softmax function is a vector with components in the interval (0, 255) and the components will add up to 255).

HOW TO USE THE APPLICATION: Say different keyword so that microphone can catch them. Voice recorded from the microphone can be heared using headphones connected to the audio jack. Note semihosting implementation causes slower or discontinuous audio experience. Select UART in ‘Project Options’ during project import for using external debug console via UART (virtual COM port).

[1] https://github.com/ARM-software/ML-KWS-for-MCU [2] https://arxiv.org/abs/1704.04861

Files: main.cpp - example main function ds_cnn_s.tflite - pre-trained TensorFlow Lite model converted from DS_CNN_S.pb (source: https://github.com/ARM-software/ML-KWS-for-MCU/blob/master/Pretrained_models/DS_CNN/DS_CNN_S.pb) (for details on how to quantize and convert a model see the eIQ TensorFlow Lite User’s Guide, which can be downloaded with the MCUXpresso SDK package) off.wav - waveform audio file of the word to recognize (source: Speech Commands Dataset available at https://storage.cloud.google.com/download.tensorflow.org/data/speech_commands_v0.02.tar.gz,
file speech_commands_test_set_v0.02/off/0ba018fc_nohash_2.wav) right.wav - waveform audio file of the word to recognize (source: Speech Commands Dataset available at https://storage.cloud.google.com/download.tensorflow.org/data/speech_commands_v0.02.tar.gz, file speech_commands_test_set_v0.02/right/0a2b400e_nohash_1.wav) audio_data.h - waveform audio files converted into a C language array of audio signal values (“off”, “right”) audio signal values using Python with the Scipy package: from scipy.io import wavfile rate, data = wavfile.read(‘yes.wav’) with open(‘wav_data.h’, ‘w’) as fout: print(‘#define WAVE_DATA {’, file=fout) data.tofile(fout, ‘,’, ‘%d’) print(‘}\n’, file=fout) train.py - model training script based on https://www.tensorflow.org/tutorials/audio/simple_audio timer.c - timer source code audio/* - audio capture and pre-processing code audio/mfcc.cpp - MFCC feature extraction matching the TensorFlow MFCC operation audio/kws_mfcc.cpp - ausio buffer handling for MFCC feature extraction model/get_top_n.cpp - top results retrieval model/model_data.h - model data from the ds_cnn_s.tflite file converted to a C language array using the xxd tool (distributed with the Vim editor at www.vim.org) model/model.cpp - model initialization and inference code model/model_ds_cnn_ops.cpp - model operations registration model/output_postproc.cpp - model output processing

Running the demo

The log below shows the output of the demo in the terminal window:

Keyword spotting example using a TensorFlow Lite model. Detection threshold: 25

Static data processing: Expected category: off

 Inference time: 32 ms
 Detected:        off (100%)

Expected category: right

 Inference time: 32 ms
 Detected:      right (98%)

Microphone data processing:

 Inference time: 32 ms
 Detected: No label detected (0%)

 Inference time: 32 ms
 Detected:         up (85%)

 Inference time: 32 ms
 Detected:       left (97%)