Speech-to-Text: Data Preprocessing on Speech Recognition Task

4 min readMay 28, 2024

A couple of weeks ago, I posted an article about input-output data alignments for the Automatic Speech Recognition (ASR) task. However, I assumed the input data was the same across the models. This article discusses how we can extend the data preprocessing on speech data with several techniques.

The first technique is the Handcrafted Data or Feature Extraction. In this technique, we decompose the data into a Spectrogram. This extraction could be beneficial in capturing the pattern of the audio data by looking at the frequency component. Later, we could implement Mel Filtering to ensure a pattern similar to human hearing. The second technique is to learn the feature representation. This could be directly implemented NN layer on top of raw data.

A. Handcrafted Data: Spectrogram

The first technique is to transform the sound wave of speech into a spectrogram. There are several steps for doing this:

Illustration of transformation from soundwave into spectrogram by ASR Class at the University of Edinburgh

Dithering: Add a Gaussian noise to the data to avoid zero values.
Removing DC offside (centering): Subtract the values with the mean to make it mean zero.
Pre-emphasize: emphasize the high-frequency component from the noise generated by the low-frequency component.
Discrete Fourier Transform (DFT): Decompose the signal data into frequency components with Fourier Transform.
Short-Time Fourier Transform (STFT): Extract the spectra with a 25ms window size and a 10ms hop.

B. Handcrafted Data: Mel Frequency Cepstral Coefficients (MFCCs)

MFCCs provide a better match of data representation to human sounds. Unlike Spectrogram where the main objective is to visualize the audio data into several frequency components, MFCC uses Mel Filters to scale the audio signal into a scale that is close to human perception of sound. Here’s the further details:

Generate Spectrogram data.
Extract Mel Spectrogram: This is the dot product between Spectrogram data with the n-Mel Filters. Initially, we have to define the number of mel filters. Then, we perform the dot product operation on every mel-filter.
Log Transform: Use a log transformation to get the log-Mel spectrum.
Discrete Cosine Transform (DCT): Apply DCT to the log transform result to obtain the MFCCs.

Illustration of MFCC by ASR Class at the University of Edinburgh

Learn more on Spectrogram and MFCC: https://jonathan-hui.medium.com/speech-recognition-feature-extraction-mfcc-plp-5455f5a69dd9

Generate Spectrogram using Scipy: https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.spectrogram.html

C. Feature Learning: Convolutional Neural Network (CNN)

Essentially, we could directly implement the RNN Layer such as LSTM or GRU on top of the raw audio input data. However, the variety due to the high-dimension audio data leads to inefficient model learning. Then, introducing a Convolutional Neural Network (CNN) layer is preferable on the speech dataset. This CNN layer reduces the variability by transforming into a lower dimensional feature representation using the convolution filters. The main objective is to learn the feature representation of the wave sound, and later use this representation as the replacement of the handcrafted data (A & B) on the ASR system.

Read More on Representation Learning with Contrastive Predictive Coding (CPC): https://arxiv.org/abs/1807.03748

Comparison

At the time of writing this article, Oord et al. (2018) compared the handcrafted data with the MFCC for full-supervised learning. The result showed that the handcrafted data outperformed the feature learning (CNN layer). However, the creation of MFCC could be computationally expensive during inference due to the multiple transformations such as DFT, STFT, Mel-Filter, Logarithmic, and DCT. As also the handcrafted data is rigid, it may be suitable only for certain use cases. Other examples like the model for language without written forms, the model for music, or the model for multi-task learning would not be suitable. Hence, the feature learning could be a good alternative.

Reference:

Oord, A. V. D., Li, Y., & Vinyals, O. (2018). Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
Automatic Speech Recognition (ASR) class by the School of Informatics University of Edinburgh
[accessed from ASR Slide] Chapter 1–5, Oppenheim, Willsky, and Nawab, “Signals and Systems,” 1997
[accessed from ASR Slide] Chapter 2, O’Shaughnessy, “Speech Communications: Human and Machine,” 2000

Speech-to-Text: Data Preprocessing on Speech Recognition Task

Written by Alim Hanif

No responses yet