Abstract
This study proposes a model designed to promptly detect and report emergency situations that may occur in single-person or elderly households. To achieve this, we modified the model architecture suggested in the Whisper Audio Tagging (Whisper-AT) paper, based on the Whisper model, to enable both classification of emergency situations and prediction of their occurrence times. Additionally, Whisper and the classification model were fine-tuned jointly to perform Automatic Speech Recognition (ASR) training on emergency situation data. As a result, the proposed method achieved an accuracy of 97.70 % in the classification of 16 types of emergency situations. Furthermore, compared to the approach of solely fine-tuning Whisper, integrating emergency situation classification during training improved ASR performance, reducing the Character Error Rate (CER) from 12.03 to 10.11. The proposed model is capable of detecting emergency situations with a low latency of only 4.2 s.
| Translated title of the contribution | Emergency situation detection and speech recognition enhancement utilizing Whisper |
|---|---|
| Original language | Korean |
| Pages (from-to) | 132-143 |
| Number of pages | 12 |
| Journal | Journal of the Acoustical Society of Korea |
| Volume | 44 |
| Issue number | 2 |
| DOIs | |
| State | Published - 2025 |
Keywords
- Acoustic detection
- Deep learning
- Emergency situation
- Speech recognition