Speech-to-Text: Automatic Speech Recognition (ASR) Models for input-output alignment
If two people record their speeches with exactly the same text paragraph at a different speed, the audio files would have different lengths. Given that the speech-to-text model could map both recordings into the exact text transcription. How do you think that’s possible?
# Ilustration
XA = [xa_1, xa_2, .., xa_n] #input A
XB = [xb_1, xb_2, .., xb_m] #input B
Y = [y_1, y_2, .., y_o] #output/label
where n != m != o
There were several follow-up questions in mind:
- How do they construct the label? Do they have a human labeler who labels second by second to the speech audio data? That would be ripped off!
- Even if we could construct the label every second, how does the model handle the different audio lengths for the same label?
Those questions arose and brought me to learn Automatic Speech Recognition (ASR) this year. This case aims to build an ML model to map speech audio files into human-readable text. This medium post will cover the various ASR models across generations, particularly discussing the time alignment between input and output.
For simplicity, assume we extract audio data (1 file = 1 sentence) into a small part with a window size of 25ms per frame and hop to the next frame in 10ms. Let’s call it an acoustic frame. The transcription should also be provided. In reality, few models could use spectrogram data with some transformations (which will not be covered here).
1. Hidden Markov Chain (HMM) based models
The first approach will use the HMM model to produce the output data. This approach usually consists of several transducers, such as:
- HMM transducers (H): The HMM model can have several observations (acoustic frames) in one state due to self-loop probability. However, one observation cannot coexist in two states (as we will pick up the highest probable state for each observation at a particular time). This transducer translates HMM states to the Context-Dependency (CD) of phones.
- Context-Dependency (C): it translates the context-dependency of phones into the phone
- Pronunciation Lexicon (L): phone-to-word transducer
- Word-Grammar (G): word-to-sentence transducer, which is the language model (LM) used to construct the sentence.
The idea is simply that the model will map the acoustic frame into a high-probability HMM state, which will map to the CD phones, then sets of CD phones into phones, then sets of phones into words, and then sets of words into sentences. Read more on decoding the graph in Kaldi (https://kaldi-asr.org/doc/graph.html)
Here’s the thing: We put a self-loop probability for each state in the HMM model so that given different speech lengths, it could lead to the same output.
Example: The two speakers produce 5 and 9 frames to speak “bill” Those observations would remain in the HMM states with the label “bill” due to the self-loop probability.
The HMM states consist of several parts, such as transition probabilities (including self-loop) and observation probabilities. The model employs the Viterbi Algorithm as a decoder to produce the outputs. The training objective is to find the best transition and observation probabilities with several methods, such as the EM Algorithm, by maximizing the likelihood of finding those parameters or using the Neural Network model to work in tandem with HMM.
Read More on HMM, Viterbi Algorithm, and Forward-Backward Training: https://web.stanford.edu/~jurafsky/slp3/A.pdf
2. CTC Model
The HMM-based model is quite complex due to its use of several components jointly together. The following approach is the Connectionist Temporal Classification (CTC) model, which maps the sequence of acoustic features X to the sequence of outputs Y (directly). It replaces the HMM as the alignment system with CTC compression (alignment-free). For simplicity, let’s assume the task is a multi-class classification problem with characters as the labels.
Here’s the fundamental concept of this system:
- This approach generates all possible valid alignment of the output during the training. Given the input-output data are X = [x1, x2, x3, x4, x5, x6, x7] and Y = “try”. There are plenty of probable valid alignments (i.e., “ttrryyy”, ttrrryy”, “tttttry”, etc.) as candidates.
- Introduce a blank symbol “ε” as the separator of each output. For example, the word “try” with 7 frames: “tεrεyyy”, “ttεrεyy”, etc. The output target may be expanded to double.
- The objective training is to minimize the negative log-likelihood, where the likelihood is computed by marginalizing over the set of valid alignments.
- Perform CTC compression: remove the repeating value and the blank token.
- Train with cross-entropy loss function against the transcription as the ground truth with WER metric.
This training process allows the model to learn how to align the input to the output. As an extension, the Viterbi Algorithm with pruning also reduces the number of valid alignments by filtering the low probable path from the computation. In addition, this model still requires a language model (LM) to construct the whole sentence (in the HMM-based model, it would be the G transducers).
Read more: CTC by Hannun https://distill.pub/2017/ctc/
3. Encoder-Decoder Layers
The final approach in this article is the encoder-decoder layers model with next token prediction as the objective function. The encoder layer is a Recurrent Neural Network (e.g., LSTM or Transformers Encoder Layer) that could learn the correlation of acoustic frames. The decoder layer ( in (c) and (d), it’s similar to the prediction network in (b) RNN Traducer) is analogous to the language model (LM) in the CTC approach. Even better, with this decoder layer, we do not have to worry about the alignment as the predicted label is conditioned to the previously predicted values (y_{i-1}, y_{i-2}, … ). It directly learns to avoid repeating chars or words, so there is no need for further post-processing like CTC compression anymore.
The model RNN-Transducers (with Attention) above could be a one-go solution for a speech-to-text problem without worrying about the alignment problems. However, the vast parameters of the model would need some computational resources. With the rise of the Large Language Model (LLM), some researchers have also built a similar approach to build a pre-trained model (Self-Supervised Learning) for multiple downstream tasks (other than speech-to-text problems).
Remark
All in all, there are three approaches that the Data Scientist could use to build a model for this problem: the HMM-based model for the alignment, the CTC approach, and the Encoder-Decoder Model. The choice of strategy should be based on the user needs and computational budget. Building a feature to recognize particular sounds like “Hello Jarvis” would be sufficient with the HMM-DNN model and the Viterbi algorithm as it’s lighter than the CTC/Encoder-Decoder approaches.
References
- HMM and WFST lectures by Kaldi: https://danielpovey.com/files/Lecture4.pdf
- Mohri, M., Pereira, F., Riley, M. (2008). Speech Recognition with Weighted Finite-State Transducers. In: Benesty, J., Sondhi, M.M., Huang, Y.A. (eds) Springer Handbook of Speech Processing. Springer Handbooks. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-49127-9_28
- Hannun, “Sequence Modeling with CTC”, Distill, 2017. https://distill.pub/2017/ctc/
- Prabhavalkar, R., Rao, K., Sainath, T.N., Li, B., Johnson, L., Jaitly, N. (2017) A Comparison of Sequence-to-Sequence Models for Speech Recognition. Proc. Interspeech 2017, 939–943, doi: 10.21437/Interspeech.2017–233
- Automatic Speech Recognition (ASR) course by the School of Informatics, The University of Edinburgh.