ICASSP2024 ICMC-ASR Grand Challenge

Call for Participation

As cars become indispensable parts of human daily life, a secure and comfortable driving environment is more desirable. The traditional touch-based interaction in cockpit is easy to distract the drivers' attention, leading to inefficient operations and potential security risks. Therefore, as a natural human-computer interaction method, speech-based interaction has attracted more attention.

The in-car speech-based interaction aims to achieve a seamless driving and cabin experience for drivers and passengers through various speech processing applications, like speech recognition for command control, entertainment or navigation, etc. Different from the commonly used automatic speech recognition (ASR) systems deployed in the home or meeting scenarios, systems in the driving scene force exclusively challenges. First of all, the acoustic environment of the cockpit is complex. Since the cockpit is a closed and irregular space, it has special room impulse response (RIR), resulting in special reverberation conditions. In addition, there are various kinds of noise during driving from both inside and outside, such as wind, engine, wheel, background music, interfering speakers, etc. Finally, different driving situations also affect the system performance, such as parking, low-speed or high-speed driving, whether the driver-side window and sunroof are open, daytime or nighttime driving. Therefore, how to leverage recent advances in speech front-end processing and robust speech recognition to improve the robustness of in-car ASR systems is an essential study worth investigating.

In the last decade, several robust ASR challenges have been held to explore the potential solutions of speech recognition in real noisy environments. The CHiME challenge series targets the problem of distant microphone conversational speech recognition in dinner party scenarios. The MISP challenge introduced information about additional video modality to develop better environmental and speaker robustness speech applications in the home TV scenario. The recent ICASSP2022 M2MeT challenge focused on the meeting transcription scenario, including diverse meeting rooms, various number of meeting participants, different overlap ratios, and noises. The ICSR Challenge provided a 20-hour collection of single-channel audio without speaker overlap, collected in a hybrid electric vehicle with a Hi-Fi microphone placed on the display screen of the car, offering a novel evaluation set for in-car speech recognition scenarios. Nevertheless, there is still a lack of a public testbed and sizable open data, collected in a pure electric new energy vehicle with multi-microphone placed in different positions within the vehicle and speaker wearing headphones to benchmark the in-car speech processing technologies, considering the unique character of in-car scenes as mentioned above and speaker overlapping.

To benefit the research community and accelerate the research of speech processing in the driving scenario, we launch the ICASSP2024 In-Car Multi-Channel Automatic Speech Recognition Challenge (ICMC-ASR), which is dedicated to the domain of speech recognition in complex driving conditions. Different from previous challenges, the ICMC-ASR dataset comprises an extensive collection of 1000 hours of real-world recorded, multi-channel, multi-speaker, in-car conversational Mandarin speech data. We carefully designed the data to cover as many as possible diverse conditions and provide two competition tracks targeting in-car multi-speaker chatting scenarios. Additionally, over 400 hours of in-car recorded far-field noise audio will also be available for participants to explore the data simulation technologies. Researchers from both academia and industry are warmly welcome to participate in and investigate solid solutions in such challenging scenarios together.

The challenge organizers will invite the best 5 submissions to submit a 2-page paper and present it at ICASSP-2024 (accepted papers will be in the ICASSP proceedings, the review process is coordinated by the challenge organizers and the SPGC chairs). All 2-page proceedings papers should be covered by an ICASSP registration and should be presented in person at the conference. The teams that present their work at ICASSP in person are also invited to submit a full paper about their work to OJ-SP. A challenge special session will be held during the ICASSP2024 conference in Seoul, Korea during 14-19 April 2024. This session will include an overview presentation from the challenge organizers (including the announcement of the winners), followed by the paper presentations (oral or poster) of the top-5 participants, followed by a panel or open discussion.

Track Setting and Evaluation

The ICMC-ASR challenge comprises two distinct tracks:

Track I Automatic Speech Recognition (ASR): In this track, participants will be provided with the oracle segmentation of the evaluation set. The primary objective of this track is to focus on the development of ASR systems based on the multi-channel multi-speaker speech data. Participants need to devise algorithms that can effectively fuse information across different channels, suppress inevitable background noise, handle multi-speaker overlaps, etc.
Track II Automatic Speech Diarization and Recognition (ASDR): Unlike Track I, Track II does not provide any prior or oracle information during evaluation (e.g. segmentation and speaker label for each utterance, total number of speakers in each session, etc.). Participants in this task are required to design automatic systems for both speaker diarization (identifying who is speaking when), and transcription (converting speech to text). Both pipeline and end-to-end systems are acceptable in this track, allowing for flexibility in system design and implementation.

It is important to note that, to ensure the authenticity of the results submitted by participants in the ASDR track, two distinct evaluation sets will be prepared for the two tracks.

For Track I, the accuracy of ASR system is measured by Character Error Rate (CER). The CER indicates the percentage of characters that are incorrectly predicted. Given a hypothesis output, it calculates the minimum number of insertions (Ins), substitutions (Subs), and deletions (Del) of characters that are required to obtain the reference transcript.

For Track II, we adopt concatenated minimum permutation character error rate (cpCER) as the metric for ASDR systems. The calculation of cpCER involves three steps. First, concatenate recognition results and reference transcriptions belonging to the same speaker along the timeline within a session. Second, calculate the CER for all permutations of speakers. Last, select the lowest CER as the cpCER for that session.

Figure 1: The Li Auto and microphones for data collections

Dataset

The dataset utilized in this challenge is collected in a hybrid electric vehicle (depicted in Fig. 1) with speakers sitting in different positions, including the driver seat and passenger seats. The total number of speakers is over 160 and all of them are native Chinese speakers speaking Mandarin without strong accents. To comprehensively capture speech signals of the entire cockpit, two types of recording devices are used: far-field and near-field recording devices. Fig. 2 (a) illustrates the setup for the far-field devices. 8 distributed microphones are placed at four seats in the car, which are the driver's seat (DS01C01, DX01C01), the passenger seat (DS02C01, DX02C01), the rear right seat (DS03C01, DX03C01) and the rear left seat (DS04C01, DX04C01). Additionally, 2 linear microphone arrays, each consisting of 2 microphones, are placed on the display screen (DL01C01, DL02C02) and at the center of the inner sunroof (DL02C01, DL02C02), respectively. All 12 channels of far-field data are time-synchronized and included in the released dataset as far-field data. Fig. 2 (b) shows the setup for the near-field data. For transcription purposes, each speaker wears a high-fidelity headphone to record near-field audio, denoted by the seat where the speaker is situated. Specifically, DA01, DA02, DA03, and DA04 represent the driver seat, passenger seat, rear right seat and rear left seat, respectively. The near-field data only have single-channel audio recordings. Additionally, a sizable real noise dataset is provided, following the recording setup of the far-filed data but without speaker talking, to facilitate research of in-car scenario data simulation technology.

(a) The placement of far-field microphones

(b) The placement of near-field microphones

Figure 2: The diagram of recording devices placement

As the realistic acoustic environment is complex and involves a variety of noise, we carefully design the recording environments to ensure comprehensive coverage. Specifically, we achieve this by varying the following factors:

Road driving condition: downtown street and highway.
Vehicle speed: parking (0km/h), slow (below 40km/h), medium (between 40km/h and 80km/h), and fast (between 80km/h and 120km/h).
Air-conditioner: off, medium and high.
Music player: off and on.
Driver-side window and sunroof: open one-third and close, close and open halfway, both open halfway, both close.
Driving time: daytime and nighttime.
In-car background noise: caused by the status of windows, sunroof, music player, and air conditioner.
Interfering speaker: when the target speaker (driver or one of the passengers) is speaking, speech from other people is regarded as interfering noises.

By combining these factors, we finally form 60 different scenarios covering most in-car acoustic environments.

Overall, the challenge dataset contains over 1000 hours (by oracle segmentation) of in-car chatting data in total, which is divided into training (Train) set, development (Dev) set, evaluation sets for Track I (Eval₁) and Track II (Eval₂). For each set, far-field audio of all channels will be included, but only the Train set will contain near-field audio. Particularly, for Eval₁, oracle timestamps will be available, whereas for Eval₂, it will not be provided, requiring participants to utilize speaker Diarization (SD) techniques for audio segmentation. The additional far-field noise data (Noise) is about 400 hours.

Moreover, we will provide over 400 hours of far-field microphones recorded real in-car noise for participants to explore the simulation technology. The total number of participants is over 160 with balanced gender coverage and all participants are native Chinese speakers speaking Mandarin without strong accents.

Rules

All participants should adhere to the following rules to be eligible for the challenge.

Use of External Resource: For both Track I and Track II, external audio data, timestamps, speaker tags, and other information are allowed to be used except for text contents. All the used external public datasets should be freely accessible to every research group in the world before December 5, 2023, and clearly indicated in the final system report.
Data augmentation: Augmentation is only permitted on the released Train set, including but not limited to adding noise or reverberation, speed perturbation, and tone change.
Prohibition of Evaluation Sets Usage: The use of the evaluation sets in any form of non-compliance is strictly prohibited, including but not limited to using the evaluation sets to fine-tune or train the model.
Multi-System Fusion: Participants are allowed to employ system fusion in both Track I and Track II. In the final submission of the system report, it is necessary to clearly explain all the systems involved in the fusion and provide the fusion method in detail.
Audio Alignment: If the forced alignment is used to obtain frame-level classification labels, the alignment model must be trained on the data allowed by the corresponding track.
Shallow Fusion: Shallow fusion is allowed to the end-to-end approaches, e.g., LAS, RNNT, and Transformer, but the training data of the language model can only come from the transcripts of the Train set.
Organizer's Interpretation: The organizer reserves the right of final interpretation. In cases of special circumstances, the organizer will coordinate the interpretation.

It is encouraged that participants prioritize technological innovation, particularly the exploration of novel model architectures, rather than relying solely on increased data usage. This challenge is not solely a competition but a "scientific" challenge activity, aligning with similar rules found in the CHiME challenge series.

Registration

Participants must sign up for an evaluation account where they can perform various activities such as registering for the evaluation, as well as uploading the submission and system description.

Registration email must be sent out from an official institutional or company email address, e.g.edu.cn; public email address (e.g., 163.com, qq.com or gmail.com) is not accepted.

Once the account has been created, the registration can be performed online. The registration is free to all individuals and institutes. The regular case is that the registration takes effect immediately, but the organizers may check the registration information and ask the participants to provide additional information to validate the registration.

Call for Participation

Track Setting and Evaluation

Dataset

Train

Noise

Dev

Eval₁

Eval₂

Rules

Registration

Baseline System

Submission and Leaderboard

Timeline (AOE Time)

Organizers

Call for Participation

Track Setting and Evaluation

Dataset

Train

Noise

Dev

Eval1

Eval2

Rules

Registration

Baseline System

Submission and Leaderboard

Timeline (AOE Time)

Organizers

Eval₁

Eval₂