Minimum effort adaptation of automatic speech recognition system in air traffic management

Mrinmoy Bhattacharjee; Petr Motlicek; Srikanth Madikeri; Hartmut Helmke; Oliver Ohneiser; Matthias Kleinert; Heiko Ehr

doi:10.59490/ejtir.2024.24.4.7531

Minimum effort adaptation of automatic speech recognition system in air traffic management

Authors

Mrinmoy Bhattacharjee Idiap Research Institute, Switzerland; Petr Motlicek Idiap Research Institute; Srikanth Madikeri Idiap Research Institute; Hartmut Helmke Institute of Flight Guidance, German Aerospace Center (DLR) Braunschweig, Germany; Oliver Ohneiser Institute of Flight Guidance, German Aerospace Center (DLR) Braunschweig, Germany; Matthias Kleinert Institute of Flight Guidance, German Aerospace Center (DLR) Braunschweig, Germany; Heiko Ehr Institute of Flight Guidance, German Aerospace Center (DLR) Braunschweig, Germany;

DOI:

https://doi.org/10.59490/ejtir.2024.24.4.7531

Keywords:

Speech Recognition, Model Adaptation, Integration of prior knowledge, Customization of model, Rare-word integration

Abstract

Advancements in Automatic Speech Recognition (ASR) technology is exemplified by ubiquitous voice assistants such as Siri and Alexa. Researchers have been exploring the application of ASR for Air Traffic Management (ATM) systems. Initial prototypes utilized ASR to pre-fill aircraft radar labels and achieved a technological readiness level before industrialization (TRL6). However, accurately recognizing infrequently used but highly informative domain-specific vocabulary is still an issue. This includes waypoint names specific to each airspace region and unique airline designators, e.g., “dexon” or “pobeda”. Traditionally, open-source ASR toolkits or large pre-trained models require substantial domain-specific transcribed speech data to adapt to specialized vocabularies. However, typically, a “universal” ASR engine capable of reliably recognizing a core dictionary of several hundreds of frequently used words suffices for ATM applications. The challenge lies in dynamically integrating the additional region-specific words used less frequently. These uncommon words are crucial for maintaining clear communication within the ATM environment. This paper proposes a novel approach that facilitates the dynamic integration of these new and specific word entities into the existing universal ASR system. This paves the way for “plug-and-play” customization with minimal expert intervention and eliminates the need for extensive fine-tuning of the universal ASR model. The proposed approach demonstrably improves the accuracy of these region-specific words by a factor of ≈7 (from 10% F1-score to 70%) for all rare words and ≈5 (from 13% F1-score to 64%) for waypoints.

Downloads

Download data is not yet available.

Metrics

Metrics Loading ...

References

S. Chen, H. Helmke, R. M. Tarakan, O. Ohneiser, H. Kopald and M. Kleinert, “Effects of Language Ontology on Transatlantic Automatic Speech Understanding Research Collaboration in the Air Traffic Management Domain,” Aerospace, vol. 10, no. 6, pp. -29, 2023.

D. Schäfer, “Context-sensitive speech recognition in the air traffic control simulation,” Universität Der Bundeswehr München Fakultät für Luft- und Raumfahrttechnik, 2001.

I. Gerdes, M. Jameel, R. Hunger, L. Christoffels and H. Gürlük, “The automation evolves: Concept for a highly automated controller working position,” in Proc. 33nd Congress of the International Council of the Aeronautical Sciences (ICAS), 2022.

M. Jameel, L. Tyburzy, I. Gerdes, A. Pick, R. Hunger and L. Christoffels, “Enabling Digital Air Traffic Controller Assistant through Human-Autonomy Teaming Design,” in Proc. IEEE/AIAA 42nd Digital Avionics Systems Conference (DASC), 2023.

H. Helmke, M. Kleinert, N. Ahrenhold, H. Ehr, T. Mühlhausen, O. Ohneiser, L. Klamert, P. Motlicek, A. Prasad, J. Zuluaga-Gomez and others, “Automatic speech recognition and understanding for radar label maintenance support increases safety and reduces air traffic controllers’ workload,” in Proc. 15th USA/Europe Air Traffic Management Research and Development Seminar (ATM2023), 2023.

M. Bhattacharjee, I. Nigmatulina, A. Prasad, P. Rangappa, S. Madikeri, P. Motlicek, H. Helmke and M. Kleinert, “Contextual Biasing Methods for Improving Rare Word Detection in Automatic Speech Recognition,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Korea, 2024.

T. Munkhdalai, Z. Wu, G. Pundak, K. C. Sim, J. Li, P. Rondon and T. N. Sainath, “NAM+: Towards Scalable End-to-End Contextual Biasing for Adaptive ASR,” in Proc. IEEE Spoken Language Technology Workshop (SLT), 2022.

J. Tang, K. Kim, S. Shon, F. Wu and P. Sridhar, “Improving ASR Contextual Biasing with Guided Attention,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024.

P. Motlicek, F. Valente and P. N. Garner, “English Spoken Term Detection in Multilingual Recordings,” in INTERSPEECH, Makuhari, Japan, 2010.

R. A. Braun, S. Madikeri and P. Motlicek, “A Comparison of Methods for OOV-Word Recognition on a New Public Dataset,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021.

H. Helmke, M. Kleinert, A. Linß, L. Klamert, P. Motlicek, J. Harfmann, N. Cebola, H. Wiese, H. Arilíusson and T. Simiganosch, “The HAAWAII Framework for Automatic Speech Understanding of Air Traffic Communication,” in Proc. 13th SESAR Innovation Days (SID), Sevilla, Spain, 2023.

R. García, J. Albarrán, A. Fabio, F. Celorrio, C. P. d. Oliveira and C. Bárcena, “Automatic Flight Callsign Identification on a Controller Working Position: Real-Time Simulation and Analysis of Operational Recordings,” Aerospace, vol. 10, no. 5, pp. 1-16, 2023.

H. Helmke, M. Kleinert, S. Shetty, O. Ohneiser, H. Ehr, H. Arilíusson, T. S. Simiganoschi, A. Prasad, P. Motlicek, K. Vesel`y and others, “Readback error detection by automatic speech recognition to increase ATM safety,” in Proc. 14th USA/Europe Air Traffic Management Research and Development Seminar (ATM2021), 2021.

J. Zuluaga-Gomez, A. Prasad, I. Nigmatulina, P. Motlicek and M. Kleinert, “A virtual simulation-pilot agent for training of air traffic controllers,” Aerospace, vol. 10, no. 5, pp. 1-25, 2023.

M. Mohri, F. Pereira and M. Riley, “Weighted finite-state transducers in speech recognition,” Computer Speech & Language, vol. 16, no. 1, pp. 69-88, 2002.

M. Mohri, F. Pereira and M. Riley, “Speech recognition with weighted finite-state transducers,” Springer Handbook of Speech Processing, pp. 559-584, 2008.

M. Riley, C. Allauzen and M. Jansche, “OpenFST: An open-source, weighted finite-state transducer library and its applications to speech and language,” Proc. Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Tutorial Abstracts, pp. 9-10, 2009.

K. Vesel`y, A. Ghoshal, L. Burget and D. Povey, “Sequence-discriminative training of deep neural networks,” in Proc. INTERSPEECH, 2013.

H. A. Bourlard and N. Morgan, Connectionist speech recognition: a hybrid approach, vol. 247, Springer Science & Business Media, 2012.

A. a. M. P. a. H. I. a. S. G. a. O. Y. a. H. H. Srinivasamurthy, “Semi-supervised learning with semantic knowledge extraction for improved speech recognition in air traffic control,” in Proc. INTERSPEECH, Stockholm, Sweden, 2017.

M. Kleinert, H. Helmke, G. Siol, h. Ehr, C. Aneta, K. Christian, D. Klakow, P. Motlicek, Y. Oualil, M. Singh and A. Srinivasamurthy, “Semi-supervised Adaptation of Assistant Based Speech Recognition Models for different Approach Areas,” in 37th AIAA/IEEE Digital Avionics Systems Conference, London, UK, 2018.

J. Zuluaga-Gomez, I. Nigmatulina, A. Prasad, P. Motlicek, K. Vesely, M. Kocour and I. Szoke, “Contextual Semi-Supervised Learning: An Approach To Leverage Air-Surveillance and Untranscribed ATC Data in ASR Systems,” in Proc. INTERSPEECH, Brno, Czechia, 2021.

J. Zuluaga-Gomez, K. Vesel`y, I. Szöke, A. Blatt, P. Motlicek, M. Kocour, M. Rigault, K. Choukri, A. Prasad, S. S. Sarfjoo and others, “ATCO2 corpus: A large-scale dataset for research on automatic speech recognition and natural language understanding of air traffic control communications,” arXiv preprint arXiv:2211.04054, 2022.

J. Zuluaga-Gomez, K. Vesely, A. Blatt, P. Motlicek, D. Klakow, A. Tart, I. Szoke, A. Prasad, S. S. Sarfjoo, P. Kolcarek and others, “Automatic Call Sign Detection: Matching Air Surveillance Data with Air Traffic Spoken Communications,” in Proc. 8th OpenSky Symposium, 2020.

V. Peddinti, D. Povey and S. Khudanpur, “A time delay neural network architecture for efficient modeling of long temporal contexts,” in Proc. INTERSPEECH, 2015.

A. Vyas, S. Madikeri and H. Bourlard, “Comparing CTC and LFMMI for Out-of-Domain Adaptation of wav2vec 2.0 Acoustic Model,” in Proc. INTERSPEECH, 2021.

A. Conneau, A. Baevski, R. Collobert, A. Mohamed and M. Auli, “Unsupervised Cross-Lingual Representation Learning for Speech Recognition,” in Proc. INTERSPEECH, 2021.

A. Baevski, Y. Zhou, A. Mohamed and M. Auli, “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” in Proc. Advances in Neural Information Processing Systems, 2020.

S. Kumar, S. Madikeri, J. Zuluaga-Gomez, E. Villatoro-Tello, I. Nigmatulina, P. Motlicek, A. Ganapathiraju and others, “XLSR-Transducer: Streaming ASR for Self-Supervised Pretrained Models,” arXiv preprint arXiv:2407.04439, 2024.

F. Jelinek, Statistical methods for speech recognition, MIT press, 1998.

D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz and others, “The Kaldi speech recognition toolkit,” in Proc. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2011.

A. Stolcke, “SRILM - an extensible language modeling toolkit,” in Proc. 7th International Conference on Spoken Language Processing (ICSLP), 2002.

H. Helmke, J. Rataj, T. Mühlhausen, O. Ohneiser, H. Ehr, M. Kleinert, Y. Oualil, M. Schulder and D. Klakow, “Assistant-based speech recognition for ATM applications,” in Proc. 11th USA/Europe Air Traffic Management Research and Development Seminar (ATM2015), Lisbon, Portugal, 2015.

M. Kocour, K. Vesel`y, A. Blatt, J. Zuluaga-Gomez, I. Szöke, J. Cernock`y, D. Klakow and P. Motlicek, “Boosting of Contextual Information in ASR for Air-Traffic Call-Sign Recognition,” in Proc. INTERSPEECH, 2021.

I. Nigmatulina, S. Madikeri, E. Villatoro-Tello, P. Motlicek, J. Zuluaga-Gomez, K. Pandia and A. Ganapathiraju, “Implementing Contextual Biasing in GPU Decoder for Online ASR,” in Proc. INTERSPEECH, 2023.

M. Bhattacharjee, P. Motlicek, I. Nigmatulina, H. Helmke, O. Ohneiser, M. Kleinert and H. Ehr, “Customization of Automatic Speech Recognition Engines for Rare Word Detection Without Costly Model Re-Training,” in Proc. 13th SESAR Innovation Days (SID), Sevilla, Spain, 2023.