In the early 2010s, automatic speech recognition was largely based on classical models such as Hidden Markov Models (HMM) combined with GMMs (Gaussian Mixture Models). Around 2013–2015, the focus shifted strongly towards deep learning, mainly through the use of deep neural networks (DNNs) for acoustic modelling. This quickly led to significant leaps in accuracy in both laboratory environments and production systems.
In the mid-2010s, we saw the emergence of end-to-end systems that use generative models such as Connectionist Temporal Classification (CTC) and sequence-to-sequence with attention to convert speech directly to text, without the traditional components such as separate language and acoustic models.
Around 2018–2020, the introduction of transformers and self-learning representations played a key role. Methods such as Facebook/Meta's Wav2Vec 2.0 learned representations of raw audio without much labelled data, which greatly improved performance, especially in low-resource languages and domains.
In the early 2020s, the trend towards large, generic models continued. ‘Conformer’ architectures combined convolutional networks with transformers to effectively model both local and long-distance patterns in sound.
Finally, in 2022, OpenAI released Whisper, a large-scale trained, end-to-end speech recognition model that is robust against noise, accents and different languages. Whisper is trained on vast amounts of audio-text pairs and enables high-quality transcriptions without intensive pre-processing — a representative example of the current state of ASR: a single universal model that is widely applicable and relatively easy to use.