2026-05-27datacode

Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study

Irune Zubiaga, Aitor Soroa, Rodrigo Agerri

PDF preview unavailable

Key claim

Fine-tuning smaller models can match proprietary performance.

This paper explores strategies for developing multilingual LLMs for text evaluation, focusing on English, Spanish, and Basque. A key finding is that fine-tuning smaller models with in-domain data can match proprietary models, while larger models excel in zero-shot evaluations. The results offer practical guidance for building multilingual evaluation pipelines.

In plain English

Novelty

7.0/10

The paper presents a meaningful extension of LLMs to multilingual evaluation, particularly for low-resource languages.

Reliability

8.0/10

The study includes systematic analysis and extends existing datasets, providing solid evidence for its claims.

Deep reliability assessment

The methodology supports comparative guidance about multilingual LLM-as-judge strategies across English, Spanish, and Basque under in-domain versus out-of-domain conditions. Claims about multilingual evaluation more broadly are somewhat overextended because the benchmarks and training/evaluation data are machine-translated and limited to three languages, with Basque resource scarcity constraining generalization.

Reproducibility

Yes. The paper states that data and code are publicly available at hitz-zentroa/mJudge, and it extends two meta-evaluation datasets to Basque and Spanish.

Discussion questions

1.Is LLM-as-a-judge alignment with translated benchmark labels actually measuring multilingual evaluation ability, or mostly measuring robustness to translation artifacts and English-centric rubrics?
2.For builders in SEA deploying multilingual evaluators, when is it cheaper and safer to fine-tune a small local model versus using a larger proprietary or open-weight model zero-shot?
3.Would the main conclusions fail if evaluated on native, human-authored low-resource language data rather than machine-translated versions of English-origin benchmarks?

Key figure

No Figure 1 or architectural diagram is included in the provided excerpt; the key setup compares multilingual LLM-as-judge training/evaluation strategies across English, Spanish, and Basque with and without in-domain fine-tuning data.

Codelink

hitz-zentroa/mJudgeOfficial