As cars become indispensable parts of human daily life, a secure and comfortable driving environment is more desirable. The traditional touch-based interaction in cockpit is easy to distract the drivers' attention, leading to inefficient operations and potential security risks. Therefore, as a natural human-computer interaction method, speech-based interaction has attracted more attention.
The in-car speech-based interaction aims to achieve a seamless driving and cabin experience for drivers and passengers through various speech processing applications, like speech recognition for command control, entertainment or navigation, etc. Different from the commonly used automatic speech recognition (ASR) systems deployed in the home or meeting scenarios, systems in the driving scene force exclusively challenges. First of all, the acoustic environment of the cockpit is complex. Since the cockpit is a closed and irregular space, it has special room impulse response (RIR), resulting in special reverberation conditions. In addition, there are various kinds of noise during driving from both inside and outside, such as wind, engine, wheel, background music, interfering speakers, etc. Finally, different driving situations also affect the system performance, such as parking, low-speed or high-speed driving, whether the driver-side window and sunroof are open, daytime or nighttime driving. Therefore, how to leverage recent advances in speech front-end processing and robust speech recognition to improve the robustness of in-car ASR systems is an essential study worth investigating.
In the last decade, several robust ASR challenges have been held to explore the potential solutions of speech recognition in real noisy environments. The CHiME challenge series targets the problem of distant microphone conversational speech recognition in dinner party scenarios. The MISP challenge introduced information about additional video modality to develop better environmental and speaker robustness speech applications in the home TV scenario. The recent ICASSP2022 M2MeT challenge focused on the meeting transcription scenario, including diverse meeting rooms, various number of meeting participants, different overlap ratios, and noises. The ICSR Challenge provided a 20-hour collection of single-channel audio without speaker overlap, collected in a hybrid electric vehicle with a Hi-Fi microphone placed on the display screen of the car, offering a novel evaluation set for in-car speech recognition scenarios. Nevertheless, there is still a lack of a public testbed and sizable open data, collected in a pure electric new energy vehicle with multi-microphone placed in different positions within the vehicle and speaker wearing headphones to benchmark the in-car speech processing technologies, considering the unique character of in-car scenes as mentioned above and speaker overlapping.
To benefit the research community and accelerate the research of speech processing in the driving scenario, we launch the ICASSP2024 In-Car Multi-Channel Automatic Speech Recognition Challenge (ICMC-ASR), which is dedicated to the domain of speech recognition in complex driving conditions. Different from previous challenges, the ICMC-ASR dataset comprises an extensive collection of 1000 hours of real-world recorded, multi-channel, multi-speaker, in-car conversational Mandarin speech data. We carefully designed the data to cover as many as possible diverse conditions and provide two competition tracks targeting in-car multi-speaker chatting scenarios. Additionally, over 400 hours of in-car recorded far-field noise audio will also be available for participants to explore the data simulation technologies. Researchers from both academia and industry are warmly welcome to participate in and investigate solid solutions in such challenging scenarios together.
The challenge organizers will invite the best 5 submissions to submit a 2-page paper and present it at ICASSP-2024 (accepted papers will be in the ICASSP proceedings, the review process is coordinated by the challenge organizers and the SPGC chairs). All 2-page proceedings papers should be covered by an ICASSP registration and should be presented in person at the conference. The teams that present their work at ICASSP in person are also invited to submit a full paper about their work to OJ-SP. A challenge special session will be held during the ICASSP2024 conference in Seoul, Korea during 14-19 April 2024. This session will include an overview presentation from the challenge organizers (including the announcement of the winners), followed by the paper presentations (oral or poster) of the top-5 participants, followed by a panel or open discussion.
The ICMC-ASR challenge comprises two distinct tracks:
It is important to note that, to ensure the authenticity of the results submitted by participants in the ASDR track, two distinct evaluation sets will be prepared for the two tracks.
For Track I, the accuracy of ASR system is measured by Character Error Rate (CER). The CER indicates the percentage of characters that are incorrectly predicted. Given a hypothesis output, it calculates the minimum number of insertions (Ins), substitutions (Subs), and deletions (Del) of characters that are required to obtain the reference transcript.
For Track II, we adopt concatenated minimum permutation character error rate (cpCER) as the metric for ASDR systems. The calculation of cpCER involves three steps. First, concatenate recognition results and reference transcriptions belonging to the same speaker along the timeline within a session. Second, calculate the CER for all permutations of speakers. Last, select the lowest CER as the cpCER for that session.
Figure 1: The Li Auto and microphones for data collections
The dataset utilized in this challenge is collected in a hybrid electric vehicle (depicted in Fig. 1) with speakers sitting in different positions, including the driver seat and passenger seats. The total number of speakers is over 160 and all of them are native Chinese speakers speaking Mandarin without strong accents. To comprehensively capture speech signals of the entire cockpit, two types of recording devices are used: far-field and near-field recording devices. Fig. 2 (a) illustrates the setup for the far-field devices. 8 distributed microphones are placed at four seats in the car, which are the driver's seat (DS01C01, DX01C01), the passenger seat (DS02C01, DX02C01), the rear right seat (DS03C01, DX03C01) and the rear left seat (DS04C01, DX04C01). Additionally, 2 linear microphone arrays, each consisting of 2 microphones, are placed on the display screen (DL01C01, DL02C02) and at the center of the inner sunroof (DL02C01, DL02C02), respectively. All 12 channels of far-field data are time-synchronized and included in the released dataset as far-field data. Fig. 2 (b) shows the setup for the near-field data. For transcription purposes, each speaker wears a high-fidelity headphone to record near-field audio, denoted by the seat where the speaker is situated. Specifically, DA01, DA02, DA03, and DA04 represent the driver seat, passenger seat, rear right seat and rear left seat, respectively. The near-field data only have single-channel audio recordings. Additionally, a sizable real noise dataset is provided, following the recording setup of the far-filed data but without speaker talking, to facilitate research of in-car scenario data simulation technology.
(a) The placement of far-field microphones
(b) The placement of near-field microphones
Figure 2: The diagram of recording devices placement
As the realistic acoustic environment is complex and involves a variety of noise, we carefully design the recording environments to ensure comprehensive coverage. Specifically, we achieve this by varying the following factors:
By combining these factors, we finally form 60 different scenarios covering most in-car acoustic environments.
Overall, the challenge dataset contains over 1000 hours (by oracle segmentation) of in-car chatting data in total, which is divided into training (Train) set, development (Dev) set, evaluation sets for Track I (Eval1) and Track II (Eval2). For each set, far-field audio of all channels will be included, but only the Train set will contain near-field audio. Particularly, for Eval1, oracle timestamps will be available, whereas for Eval2, it will not be provided, requiring participants to utilize speaker Diarization (SD) techniques for audio segmentation. The additional far-field noise data (Noise) is about 400 hours.
Moreover, we will provide over 400 hours of far-field microphones recorded real in-car noise for participants to explore the simulation technology. The total number of participants is over 160 with balanced gender coverage and all participants are native Chinese speakers speaking Mandarin without strong accents.
Participants can obtain the datasets by clicking the following button and then signing the data user agreement. Once signed, the data download links will automatically be sent to your registration email.
All participants should adhere to the following rules to be eligible for the challenge.
It is encouraged that participants prioritize technological innovation, particularly the exploration of novel model architectures, rather than relying solely on increased data usage. This challenge is not solely a competition but a "scientific" challenge activity, aligning with similar rules found in the CHiME challenge series.
Participants must sign up for an evaluation account where they can perform various activities such as registering for the evaluation, as well as uploading the submission and system description.
Registration email must be sent out from an official institutional or company email address, e.g.edu.cn; public email address (e.g., 163.com, qq.com or gmail.com) is not accepted.
Once the account has been created, the registration can be performed online. The registration is free to all individuals and institutes. The regular case is that the registration takes effect immediately, but the organizers may check the registration information and ask the participants to provide additional information to validate the registration.
To sign up for an evaluation account, please click Quick Registration.
Participants can download the source code of the baseline systems from [here]
Please contact e-mail email@example.com if you have any queries.