Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study
Irune Zubiaga, Aitor Soroa, Rodrigo Agerri
Read on arXiv →Key claim
Fine-tuning smaller models can match proprietary performance.
This paper explores strategies for developing multilingual LLMs for text evaluation, focusing on English, Spanish, and Basque. A key finding is that fine-tuning smaller models with in-domain data can match proprietary models, while larger models excel in zero-shot evaluations. The results offer practical guidance for building multilingual evaluation pipelines.
In plain English
This paper explores strategies for developing multilingual LLMs for text evaluation, focusing on English, Spanish, and Basque. A key finding is that fine-tuning smaller models with in-domain data can match proprietary models, while larger models excel in zero-shot evaluations. The results offer practical guidance for building multilingual evaluation pipelines.
The paper presents a meaningful extension of LLMs to multilingual evaluation, particularly for low-resource languages.
The study includes systematic analysis and extends existing datasets, providing solid evidence for its claims.
Deep reliability assessment
The methodology supports comparative guidance about multilingual LLM-as-judge strategies across English, Spanish, and Basque under in-domain versus out-of-domain conditions. Claims about multilingual evaluation more broadly are somewhat overextended because the benchmarks and training/evaluation data are machine-translated and limited to three languages, with Basque resource scarcity constraining generalization.
Reproducibility
Yes. The paper states that data and code are publicly available at hitz-zentroa/mJudge, and it extends two meta-evaluation datasets to Basque and Spanish.
Discussion questions
- 1.Is LLM-as-a-judge alignment with translated benchmark labels actually measuring multilingual evaluation ability, or mostly measuring robustness to translation artifacts and English-centric rubrics?
- 2.For builders in SEA deploying multilingual evaluators, when is it cheaper and safer to fine-tune a small local model versus using a larger proprietary or open-weight model zero-shot?
- 3.Would the main conclusions fail if evaluated on native, human-authored low-resource language data rather than machine-translated versions of English-origin benchmarks?
Key figure
No Figure 1 or architectural diagram is included in the provided excerpt; the key setup compares multilingual LLM-as-judge training/evaluation strategies across English, Spanish, and Basque with and without in-domain fine-tuning data.