2026-05-18datacode

Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German

Sajjad Abdoli, Ghassan Al-Sumaidaee, Clayton W. Taylor, Ahmad ElShiekh, Ahmed Rashad

Key claim

ElevenLabs Scribe v2 achieves best ASR performance.

This study benchmarks five commercial ASR systems on code-switching between various languages. The key finding is that ElevenLabs Scribe v2 outperforms others with the lowest WER and highest BERTScore, highlighting significant quality differences in ASR performance.

Novelty

7.5/10

The paper provides a new benchmark for evaluating ASR in code-switching contexts.

Reliability

8.0/10

The methodology includes a rigorous evaluation of multiple ASR systems with clear metrics.

Deep reliability assessment

The methodology supports the claim that BERTScore is a more reliable metric than WER for evaluating ASR systems on code-switching speech, particularly for Arabic and Persian. However, the assertion that WER systematically overstates performance differences may not hold in all contexts, especially with different language pairs.

Reproducibility

yes, the dataset is publicly available at https://huggingface.co/datasets/Perle-ai/ASR_Code_Switch.

Discussion questions

What assumptions are made about the generalizability of the benchmark results across different dialects and languages?
How can builders leverage these findings to improve ASR systems for multilingual environments?
What specific conditions or datasets would contradict the findings regarding the superiority of BERTScore over WER?

Key figure

Figure 1 shows the distribution of semantic topics across the 300 benchmark samples for each language pair, classified by GPT-4o using an inductively derived taxonomy.

Benchmark results

Egyptian Arabic–EnglishWER: 13.2vs Google Chirp 3-26.2%SOTA

Persian–EnglishBERTScore: 0.936vs OpenAI gpt-4o-transcribe+0.080SOTA

Codelink

huggingface.co/datasets/Perle-ai/ASR_Code_SwitchOfficial

Read on arXiv →