← Back to feed
2026-05-25agentsreasoningcode

When Do LLM Agents Treat Surface Noise Differently from Semantic Noise? A 68-Cell Measurement Study with a Held-Out Trace-Level Validation

Liyun Zhang, Jiayi Guo

PDF preview unavailable
Read on arXiv →

Key claim

Semantic perturbations significantly affect reasoning consistency.

This paper investigates how different types of perturbations affect the reasoning of large language models. It finds that meaning-bearing perturbations lead to greater inconsistencies in answers compared to presentation perturbations. This insight could inform future model training and evaluation strategies.

In plain English

This paper investigates how different types of perturbations affect the reasoning of large language models. It finds that meaning-bearing perturbations lead to greater inconsistencies in answers compared to presentation perturbations. This insight could inform future model training and evaluation strategies.

Novelty
7.5/10

The paper presents a significant finding regarding the impact of semantic perturbations on reasoning in large language models, extending understanding in this area.

Reliability
8.0/10

The study is supported by extensive empirical data across multiple models and tasks, with rigorous statistical analysis and released code for validation.

Deep reliability assessment

The methodology supports the existence of a significant inconsistency gap between meaning-bearing and presentation perturbations in LLM agents, but claims of causal relationships and generalizability across architectures may be overstated.

Reproducibility

Yes, the paper mentions that all code, the perturbation corpus, and raw trajectories are released for review at https://anonymous.4open.science/r/agentdiff-emnlp-0BB4/.

Discussion questions

  1. What assumptions about the nature of semantic versus surface noise are being made, and how might they be challenged?
  2. How can builders apply these findings to improve the robustness of LLMs in real-world applications?
  3. What experimental conditions or results would lead to a re-evaluation of the reported inconsistency gap?

Key figure

Figure 1 illustrates the per-cell severity-matched inconsistency rate gap across 68 cells, showing a mean gap of +19.69 percentage points with a high statistical significance.

Codelink
anonymous.4open.science/r/agentdiff-emnlp-0BB4Official