2026-05-27data

Can Large Language Models Handle Discourse Particles? A Case Study of Colloquial Malay

Mariah Al Giptiah Binte Yusoff, Jakin Tan, Bocheng Chen, Guangliang Liu, Xi Chen

Key claim

Current LLMs struggle with Malay discourse particles.

This paper introduces extsc{MalayPrag}, a benchmark for assessing LLMs' handling of discourse particles in colloquial Malay. The findings indicate that current LLMs struggle with these particles, but the proposed attributes significantly enhance their performance. This highlights the importance of structured approaches to improve LLMs' pragmatic understanding.

In plain English

Novelty

7.5/10

The introduction of a benchmark for evaluating LLMs on discourse particles in Malay represents a meaningful extension of existing research.

Reliability

7.0/10

The study provides experimental results and a structured framework, though it may lack extensive baselines.

Deep reliability assessment

The methodology supports a narrow benchmark-style claim: in zero-shot closed-label prompting on 187 annotated colloquial Malay utterances, current LLMs struggle with Malay discourse-particle pragmatics, and explicit linguistic attributes improve pragmatic-function prediction. It overclaims if read as a broad statement about LLM pragmatic competence, because the dataset is small, text-only, limited to mainly kan and ke, and does not test fine-tuning, multimodal/prosodic inputs, or real dialogue use.

Reproducibility

Partial. The paper says MALAYPRAG is accessible via a link and provides prompt templates in an appendix, but no repository URL or exact dataset URL is visible in the supplied text; no open-source code release is mentioned.

Discussion questions

1.Does decomposing pragmatic function into five discrete attributes actually model human pragmatic understanding, or does it mainly make the classification task easier by exposing labels that are close to the answer?
2.For SEA builders, should Malay/Singlish/Indonesian chatbots use explicit pragmatic scaffolds like these at inference time, or is the more practical path to collect conversational fine-tuning data with particles, prosody, and speaker context?
3.What result would falsify the paper's conclusion: for example, would strong performance from a Malay-specialized model on unseen particles, dialects, and audio-rich dialogue without attribute hints show that the observed failure is benchmark-specific rather than a general pragmatic gap?

Key figure

Figure 1 presents a five-dimensional annotation schema for Malay discourse-particle utterances, labeling each utterance by Epistemic Stance, Listener Agreement, Emotion, Question Type, and Particle Position.

Benchmark results

MALAYPRAGaverage classification accuracy across five attributes: 0.752vs GPT-5+0.010

MALAYPRAGaverage classification accuracy across five attributes: 0.724vs Gemini 3.1 Flash+0.023

MALAYPRAGclassification accuracy over seven pragmatic-function labels: 0.529vs GPT-5 direct prompting+0.230