Frontier Paper Feed

What's worth reading today.

AI research papers scored by an LLM eval pipeline on novelty and reliability. Upvote to surface what the community should discuss.

preview unavailable
PASS ✓

HANDOFF: Humanoid Agentic Task-Space Whole-Body Control via Distilled Complementary Teachers

2026.06.04agents

Lizhi Yang, Junheng Li, Nehar Poddar, et al.

The authors developed a new controller for humanoid robots called HANDOFF, which makes it easier to manage complex tasks. Unlike previous methods that required detailed movement instructions, HANDOFF uses a simpler interface that can adapt to various tasks. This change allows robots to perform better in real-world situations, such as following commands in natural language. Builders should care because this could lead to more effective and versatile robots in practical applications.

Novelty
8.0
Reliability
7.5
arxiv/2606.06493
preview unavailable
PASS ✓

TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies

2026.06.04agents

Dong Jing, Jingchen Nie, Tianqi Zhang, et al.

The authors developed a new system called TempoVLA that helps robots move at different speeds depending on the task's risk level. Unlike previous models that only used a fixed speed, TempoVLA can adjust its speed dynamically, speeding up in safe situations and slowing down when precision is needed. This is achieved through a new method that modifies how robot actions are timed. Builders should care because this flexibility can lead to more efficient and safer robotic operations in real-world applications.

Novelty
8.0
Reliability
7.5
arxiv/2606.06491
preview unavailable
PASS ✓

Regret Minimization with Adaptive Opponents in Repeated Games

2026.06.04agents

Mingyang Liu, Asuman Ozdaglar, Tiancheng Yu, et al.

The authors developed a new way to measure how players in repeated games can improve their strategies when opponents adapt based on past actions. Unlike previous methods, this new metric, RP-Regret, allows for better comparisons and can lead to more cooperative outcomes. They also created algorithms to minimize this regret and showed through experiments that these approaches can yield better results in specific games. Builders should care because this could improve decision-making in competitive environments.

Novelty
8.0
Reliability
7.5
arxiv/2606.06486
preview unavailable
PASS ✓

DNQ: Deep Nash Q-Network for Partially Observable n-Player Games

2026.06.04agents

Qintong Xie, Edward Koh, Xavier Cadet, et al.

The authors developed a new way to train agents that bid in auctions and similar competitive situations. They introduced a method called DNQ that helps these agents learn better strategies by using a shared critic to estimate payoffs. This approach is faster and more efficient than previous methods, especially when there are many agents involved. Builders should care because it allows for more scalable solutions in complex bidding environments, which can be crucial for real-world applications.

Novelty
7.5
Reliability
8.0
arxiv/2606.06480
preview unavailable
PASS ✓

RREDCoT: Segment-Level Reward Redistribution for Reasoning Models

2026.06.04agents

Mykyta Ielanskyi, Kajetan Schweighofer, Lukas Aichberger, et al.

The authors developed a new method called RREDCoT to improve how rewards are assigned in reasoning language models. Unlike previous methods that often struggled with high variance, RREDCoT uses the model to estimate rewards more effectively. This change allows for better training of models that need to think through complex problems. Builders should care because this method could lead to more reliable and efficient AI systems that can handle intricate reasoning tasks.

Novelty
7.5
Reliability
7.0
arxiv/2606.06475
preview unavailable
PASS ✓

Self-Augmenting Retrieval for Diffusion Language Models

2026.06.04reasoningcode

Paul Jünger, Justin Lovelace, Linxi Zhao, et al.

The authors developed a new method called SARDI that helps language models generate better answers by looking ahead at potential words they might use. Unlike previous methods, SARDI can quickly find relevant information without needing extra training. This means it can work faster and more effectively on complex questions. Builders should care because it shows a new way to improve AI responses using existing models without extensive retraining.

Novelty
8.0
Reliability
8.0
arxiv/2606.06474
preview unavailable
PASS ✓

Human Adults and LLMs as Scientists: Who Benefits from Active Exploration?

2026.06.04reasoning

Mandana Samiei, Eunice Yiu, Anthony GX-Chen, et al.

The authors studied how adults understand cause-and-effect relationships when they can actively explore their environment. They found that when given the chance to experiment, adults improved their ability to identify complex causal relationships that require multiple factors to work together. This is a shift from previous studies where participants only observed situations passively. Builders should care because it highlights the importance of agency in learning and could inform the design of educational tools or AI systems that mimic human reasoning.

Novelty
7.0
Reliability
7.5
arxiv/2606.06464
preview unavailable
PASS ✓

Event Detection for Parameter-to-KPI Dependency Learning for AI-RAN

2026.06.04infra

Christie Djidjev, Nicholas Kaminski

The authors developed a new approach to identify how different control parameters in wireless networks affect performance. Unlike previous methods that struggled with noisy data, their technique uses a synthetic traffic generator to create clear examples of these interactions. This is important because understanding these dependencies can help improve network management and performance. Builders should care because better dependency detection can lead to more efficient and reliable wireless networks.

Novelty
7.0
Reliability
8.0
arxiv/2606.06459
preview unavailable
PASS ✓

USAD 2.0: Scaling Representation Distillation for Universal Audio Understanding

2026.06.04infracode

Heng-Jui Chang, Alexander H. Liu, Saurabhchand Bhati, et al.

The authors developed USAD 2.0, a new audio encoder that uses both self-supervised and supervised learning to improve performance. Unlike previous models that focused on specific audio types, this model covers multiple domains, including music. It also addresses issues with teacher models in training. Builders should care because this could lead to better audio processing tools and applications, making it easier to work with diverse audio inputs.

Novelty
8.0
Reliability
7.5
arxiv/2606.06444
preview unavailable
PASS ✓

Causal Atlases from Entropic Inference: Bayesian Networks beyond Optimal DAGs

2026.06.04reasoning

Hazhir Aliahmadi, Irina Babayan, Greg van Anders

The authors developed a new way to identify causal relationships in complex systems using a method based on entropy. Unlike traditional methods that optimize for a single causal graph, their approach generates multiple graphs that capture the uncertainty in the data. This is important because it helps avoid misleading conclusions about causality. Builders should care because understanding true causal relationships can lead to better decision-making and system design.

Novelty
7.5
Reliability
7.0
arxiv/2606.06440
preview unavailable
PASS ✓

RIDE: An Open Dataset and Benchmark for Train Delay Prediction

2026.06.03data

Clément Elliker, Mathis Le Bail, Clément Mantoux, et al.

The authors created a new dataset called RIDE to help predict train delays more accurately. This dataset includes millions of train events and weather records, making it much easier to compare different prediction methods. Unlike previous approaches, RIDE standardizes how predictions are made and evaluated, which helps researchers understand which models work best. Builders should care because this framework can lead to better train scheduling and improved passenger experiences.

Novelty
8.0
Reliability
8.0
arxiv/2606.05070
preview unavailable
PASS ✓

Physics Is All You Need? A Case Study in Physicist-Supervised AI Development of Scientific Software

2026.05.28agents

Nhat-Minh Nguyen

This study explores the role of AI agents in research through a case where a physicist supervised an AI coding agent. The key finding is that effective supervision practices were crucial for ensuring the agent's outputs were trustworthy, highlighting the importance of supervision design over model capability.

Novelty
7.0
Reliability
8.0
arxiv/2605.30353
preview unavailable
PASS ✓

VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion

2026.05.28visioncode

Hidir Yesiltepe, Jiazhen Hu, Tuna Han Salih Meral, et al.

This paper introduces Multi-Head Latent Attention (MLA) for video diffusion, achieving a 92.7% reduction in per-token memory usage while maintaining quality. It demonstrates that MLA can outperform existing methods in long-horizon streaming video diffusion, improving throughput significantly. This advancement could lead to more efficient video processing techniques.

Novelty
8.5
Reliability
8.0
arxiv/2605.30351
preview unavailable
PASS ✓

LLMSurgeon: Diagnosing Data Mixture of Large Language Models

2026.05.28data

Yaxin Luo, Jiacheng Cui, Xiaohan Zhao, et al.

This paper presents a novel framework called LLMSurgeon for estimating the pretraining data mixture of large language models based on generated text. It introduces a method for auditing the 'digital DNA' of foundation models, allowing for high-fidelity recovery of domain mixtures without direct access to training data. The key result is that LLMSurgeon can effectively recover domain mixtures under fixed protocols.

Novelty
8.0
Reliability
7.5
arxiv/2605.30348
preview unavailable
PASS ✓

DynaFLIP: Rethinking Robotics Perception via Tri-Modal-Dynamics Guided Representation

2026.05.28agents

Jusuk Lee, Seungjae Lee, Jonghun Shin, et al.

DynaFLIP is a new framework that improves robot manipulation by integrating motion understanding into perception. It uses a novel training approach with image-language-3D flow triplets, leading to significant performance gains in various tasks. The key result shows a +22.5% improvement in out-of-distribution scenarios, indicating better generalization.

Novelty
8.0
Reliability
8.0
arxiv/2605.30350
preview unavailable
PASS ✓

SchGen: PCB Schematic Generation with Semantic-Grounded Code Representations

2026.05.28agentscode

Qinpei Luo, Ruichun Ma, Xinyu Zhang, et al.

This paper presents SchGen, a large language model that generates editable PCB schematics from natural-language requests. It introduces a new representation that improves the accuracy of wire connectivity and functional correctness in schematic generation. The results indicate that representation design is crucial for enabling generative models in complex hardware tasks.

Novelty
8.0
Reliability
7.5
arxiv/2605.30345
preview unavailable
PASS ✓

Tiny but Trusted: Efficient Vision-Language Reasoning for Time-Series Anomaly Detection

2026.05.28visioncode

Xiaona Zhou, Muntasir Wahed, Tianjiao Yu, et al.

This paper presents VisAnomBench, a new benchmark for time-series anomaly detection, and introduces VisAnomReasoner, a parameter-efficient VLM that significantly improves anomaly localization. The key result shows improvements of over 21 percentage points in precision and F1 score compared to existing methods.

Novelty
8.0
Reliability
8.0
arxiv/2605.30344
preview unavailable
PASS ✓

Unlocking the Working Memory of Large Language Models for Latent Reasoning

2026.05.28reasoning

Lukas Aichberger, Sepp Hochreiter

The paper presents Reasoning in Memory (RiM), a novel approach that enhances the reasoning capabilities of large language models by using fixed memory blocks instead of autoregressive generation. This method allows for compute-efficient reasoning and shows promising results on reasoning benchmarks, matching or exceeding existing methods. The key takeaway is that RiM enables large language models to utilize working memory effectively for reasoning tasks.

Novelty
8.0
Reliability
7.5
arxiv/2605.30343
preview unavailable
PASS ✓

GPIC: A Giant Permissive Image Corpus for Visual Generation

2026.05.28visioncommunity code

Keshigeyan Chandrasegaran, Kyle Sargent, Suchir Agarwal, et al.

The paper presents GPIC, a massive dataset of 28 trillion pixels designed for visual generative modeling. It includes a diverse set of images and a benchmarking protocol, making it a valuable resource for researchers and practitioners in the field. The dataset is permissively licensed, allowing for both research and commercial use.

Novelty
8.0
Reliability
8.0
arxiv/2605.30341
preview unavailable
PASS ✓

Efficient Test-Time Finetuning of LLMs via Convex Reconstruction and Gradient Caching

2026.05.28infracode

Alaa Khamis, Alaa Maalouf

HullFT is a new method for test-time finetuning that optimizes both speed and quality by using a geometric approach. It effectively selects relevant training sequences and reduces computation time through Gradient Reuse. The key result is that HullFT achieves lower bits-per-byte at a significantly reduced runtime compared to existing methods.

Novelty
8.0
Reliability
7.5
arxiv/2605.30337
preview unavailable
PASS ✓

Fairness-Aware Federated Learning with Trajectory Shapley Value

2026.05.28data

Daniel Kuznetsov, Ziqi Wang

This paper presents FedTSV, an adaptive aggregation method for federated learning that uses the Trajectory Shapley Value to dynamically adjust client contributions. The key result shows that FedTSV accelerates convergence and enhances fairness in client contributions, making it a valuable approach for real-time federated optimization.

Novelty
8.0
Reliability
7.5
arxiv/2605.30336
preview unavailable
PASS ✓

Locally Coherent, Globally Incoherent: Bounding Compositional Incoherence in Multi-Component LLM Agents

2026.05.28agentscommunity code

Anany Kotawala

This paper presents a novel approach to address the coherence failures in multi-component LLM agents. It introduces the concept of compositional residuals and provides empirical evidence of the effectiveness of proposed mitigations. The key finding is that coherence issues can significantly impact performance, with a notable regret metric observed.

Novelty
7.5
Reliability
7.0
arxiv/2605.30335
preview unavailable
PASS ✓

Demystifying Data Organization for Enhanced LLM Training

2026.05.28datacode

Yalun Dai, Yangyu Huang, Tongshen Yang, et al.

This paper explores how data organization can improve the training of large language models. It introduces two new methods for data ordering that significantly enhance training stability and performance. The findings suggest that strategic data organization is crucial for optimizing LLM training efficiency.

Novelty
7.5
Reliability
8.0
arxiv/2605.30334
preview unavailable
PASS ✓

SoundnessBench: Can Your AI Scientist Really Tell Good Research Ideas from Bad Ones?

2026.05.28agents

Sy-Tuyen Ho, Minghui Liu, Huy Nghiem, et al.

This paper presents SoundnessBench, a benchmark designed to assess the soundness of machine-learning research proposals. The key finding is that current LLMs exhibit a pervasive optimism bias, often misclassifying low-soundness proposals as sound. This indicates that LLMs are not yet reliable for evaluating scientific rigor at the proposal stage.

Novelty
8.0
Reliability
7.5
arxiv/2605.30329
preview unavailable
PASS ✓

RoboWits: Unexpected Challenges for Robotic Creative Problem Solving

2026.05.28agentscode

Chunru Lin, Hongxin Zhang, Fenghao Yu, et al.

RoboWits is a new benchmark for assessing robots' cognitive reasoning and adaptability in unexpected scenarios. The study shows that while pre-trained visual-language agents can handle basic tasks, they struggle with more complex, mutated tasks, highlighting their limitations in real-world applications. This insight is crucial for builders aiming to develop more robust robotic systems.

Novelty
8.0
Reliability
7.5
arxiv/2605.30326
preview unavailable
PASS ✓

On Language Generation in the Limit with Bounded Memory

2026.05.28reasoning

Jon Kleinberg, Anay Mehrotra, Amin Saberi, et al.

This paper explores how bounded memory affects language generation and identification tasks. It shows that while generation is achievable for any countable collection of languages, the density and identification capabilities are limited to finite collections. A key result is that allowing adaptive memory improves achievable density.

Novelty
7.5
Reliability
7.0
arxiv/2605.30324
preview unavailable
PASS ✓

In-Context Reward Adaptation for Robust Preference Modeling

2026.05.28rlhf

Zhenyu Sun, Zheng Xu, Ermin Wei

This paper introduces a novel framework for adapting reward models in reinforcement learning to better align with diverse human preferences. The key result shows that incorporating human response time as an auxiliary input allows the model to effectively adapt to previously unseen preference domains, enhancing robustness in human-AI alignment.

Novelty
8.0
Reliability
7.5
arxiv/2605.30323
preview unavailable
PASS ✓

Neural Operator-Based Surrogate Model for CFD:Helical Coil Steam Generator in Small Modular Reactor

2026.05.28infra

Minseo Lee, Seongmin Oh, Chaehyeon Song, et al.

The authors developed a new framework that combines reduced-order models with neural operators to improve real-time thermal-hydraulic simulations for small modular reactors. Unlike previous methods that struggled with the high computational costs of detailed fluid dynamics simulations, this approach allows for faster and more efficient analysis of complex systems like helical coil steam generators. The multi-scale version of their model, called L-DeepONet, effectively captures the dynamic behavior of swirling flows, while another model, the Fourier neural operator, provides accurate estimates of pressure changes. This advancement is significant for builders because it enables safer and more efficient reactor operations by allowing for quicker decision-making based on reliable simulations. Understanding these models can help builders select the right tools for their specific simulation needs.

Novelty
8.0
Reliability
7.5
arxiv/2605.30277
PDF preview for GRASP: Plan-Guided Graph Retrieval with Adaptive Fusion and Reranking on Semi-Structured Knowledge Bases
PASS ✓

GRASP: Plan-Guided Graph Retrieval with Adaptive Fusion and Reranking on Semi-Structured Knowledge Bases

2026.05.28data

Yicheng Tao, Yiqun Wang, Xiangchen Song, et al.

The authors developed a new framework called GRASP that improves how we retrieve information from semi-structured knowledge bases, which are databases that organize data in a graph format with entities and relationships. Unlike previous methods that either only used the graph for expanding queries or combined text and structure in a simplistic way, GRASP uses a three-step process that includes planning how to navigate the graph, merging this with a dense retrieval method, and then refining the results through a reranking step. This approach led to a significant increase in retrieval accuracy, as shown by the improvement in Hit@1 scores from 62.0 to 73.9 across various benchmarks. For builders, this means that applications like product searches or academic paper searches can become much more effective, providing users with more relevant results. Understanding and implementing GRASP could enhance the performance of systems that rely on complex data relationships.

Novelty
8.0
Reliability
8.0
arxiv/2605.30237
preview unavailable
PASS ✓

PEFT-Arena: Understanding Parameter-Efficient Finetuning from a Stability-Plasticity Perspective

2026.05.27scalingcode

Yangyi Huang, Ruotian Peng, Zeju Qiu, et al.

This paper introduces PEFT-Arena, a benchmark that evaluates parameter-efficient finetuning by measuring both downstream performance and the retention of pretrained capabilities. The key finding is that orthogonal finetuning achieves the best balance between adaptation and retention under similar parameter budgets, highlighting the importance of stability-plasticity profiles in finetuning methods.

Novelty
8.0
Reliability
7.5
arxiv/2605.28819
preview unavailable
PASS ✓

VLMs May Not Globally Enhance Human Alignment over LLMs During Natural Reading

2026.05.27alignment

Jinzhou Wu, Zhengwu Ma, Jixing Li, et al.

This research investigates how multimodal pretraining affects language models' alignment with human reading processes. The key finding is that while multimodal training may not universally enhance human-like text processing, it can selectively improve alignment when visual semantic content is stronger.

Novelty
7.5
Reliability
8.0
arxiv/2605.28818
preview unavailable
PASS ✓

Self-Improving Language Models with Bidirectional Evolutionary Search

2026.05.27agentscode

Guowei Xu, Zhenting Qi, Huangyuan Su, et al.

The paper presents Bidirectional Evolutionary Search (BES), a new framework that enhances search methods for language models by combining forward and backward search strategies. The key result shows that BES outperforms existing frameworks on challenging tasks, enabling better performance in both average and best-case scenarios.

Novelty
8.0
Reliability
7.5
arxiv/2605.28814
preview unavailable
PASS ✓

Beyond Binary: Sim-to-Real Dexterous Manipulation with Physics-Grounded Contact Representation

2026.05.27agents

Jiahe Pan, Stelian Coros, Jitendra Malik, et al.

This paper presents a new tactile representation called Center-of-Pressure (CoP) that improves sim-to-real transfer in contact-rich manipulation tasks. The authors demonstrate that policies using CoP outperform traditional methods, achieving zero-shot transfer in complex scenarios. This advancement could lead to more effective robotic manipulation in real-world applications.

Novelty
8.0
Reliability
7.5
arxiv/2605.28812
preview unavailable
PASS ✓

Affective Music Recommendation: A Rollout-Based World Model for Offline Preference Optimization

2026.05.27agents

Audrey Chan, Aaron Labbé, Jacob Lavoie, et al.

The Affective Music Recommendation System (AMRS) effectively predicts listener engagement and emotional responses using a causal transformer model. It employs Direct Preference Optimization to enhance the accuracy of predicted emotional states while maintaining diversity in recommendations. This work provides a promising approach to affective recommendation in ethically constrained environments.

Novelty
8.0
Reliability
7.5
arxiv/2605.28810
preview unavailable
PASS ✓

AREA: Attribute Extraction and Aggregation for CLIP-Based Class-Incremental Learning

2026.05.27visioncode

Zhen-Hao Xie, Yu-Cheng Shi, Da-Wei Zhou

The paper presents AREA, a novel approach for Class-Incremental Learning that stabilizes attribute extraction and aggregation in CLIP-based models. It effectively mitigates catastrophic forgetting by using principal geodesic analysis and task-specific experts. The key result shows that AREA consistently outperforms existing methods in this domain.

Novelty
7.5
Reliability
8.0
arxiv/2605.28809
PDF preview for Calibrating Conservatism for Scalable Oversight
PASS ✓

Calibrating Conservatism for Scalable Oversight

2026.05.27agents

William Overman, Mohsen Bayati

The authors developed a new method called Calibrated Collective Oversight (CCO) that helps weaker overseers manage stronger AI agents that may act against human interests. Unlike previous methods that often relied on complex rules or assumptions, CCO uses a straightforward penalty system that adjusts based on how concerned overseers are about the AI's actions. This means that while high-reward actions can still be taken when they are deemed acceptable, they are penalized if they raise too much concern, keeping undesirable outcomes in check. For builders, this approach offers a practical way to ensure AI systems behave ethically and safely, even in challenging scenarios, making it easier to maintain control over powerful AI technologies.

Novelty
8.0
Reliability
7.5
arxiv/2605.28807
preview unavailable
PASS ✓

Personal Visual Memory from Explicit and Implicit Evidence

2026.05.27agentscode

Viet Nguyen, Thao Nguyen, Vishal M. Patel, et al.

The paper presents VisualMem, a new architecture that enhances long-term memory for personalized AI agents by integrating visual information. It shows that using personal visual memory significantly improves performance on a new benchmark while still being competitive on traditional text-memory tasks. This indicates the importance of visual context in personalized AI.

Novelty
8.0
Reliability
7.5
arxiv/2605.28806
preview unavailable
PASS ✓

OmniVerifier-M1: Multimodal Meta-Verifier with Explicit Structured Recalibration

2026.05.27multimodalcode

Xinchen Zhang, Bowei Liu, Jiale Liu, et al.

This paper presents OmniVerifier-M1, a novel visual verifier that utilizes symbolic meta-verification and decoupled reinforcement learning to enhance verification processes in multimodal models. A key result is that symbolic outputs significantly improve verification performance compared to traditional textual explanations, leading to better error localization and model reliability.

Novelty
8.0
Reliability
7.5
arxiv/2605.28805
PDF preview for Ω-QVLA: Robust Quantization for Vision-Language-Action Models via Composite Rotation and Per-step Scaling
PASS ✓

Ω-QVLA: Robust Quantization for Vision-Language-Action Models via Composite Rotation and Per-step Scaling

2026.05.27agents

Xinyu Wang, Mingze Li, Sicheng Lyu, et al.

The authors developed Omega-QVLA, a new framework that allows for efficient compression of Vision-Language-Action (VLA) models, which combine visual perception, language understanding, and action control. Unlike previous methods that only partially quantized these models or used mixed precision, Omega-QVLA uniformly quantizes both the language and action components to a lower precision, making it more stable and effective. This results in high task success rates while significantly reducing the memory required to run these models on devices. Builders should care because this advancement enables the deployment of complex AI models on resource-constrained devices, opening up new possibilities for real-world applications in areas like robotics and interactive systems.

Novelty
8.0
Reliability
8.0
arxiv/2605.28803
PDF preview for Human Label Variation as Stable Signal: Learning Annotator-Specific Explanation Behavior via Cross-Annotator Preference Optimization
PASS ✓

Human Label Variation as Stable Signal: Learning Annotator-Specific Explanation Behavior via Cross-Annotator Preference Optimization

2026.05.27reasoningcode

Beiduo Chen, Pingjun Hong, Ziyun Zhang, et al.

The authors of this study discovered that large language models (LLMs) can learn the unique reasoning styles of different annotators by analyzing their free-text explanations. Unlike previous methods that focused solely on the labels given by annotators, this research shows that understanding the reasoning behind those labels can lead to better model performance. They introduced a new technique called cross-annotator preference optimization (CAPO), which helps the model better mimic individual annotators by comparing their responses to other valid annotations. This approach not only improves the model's ability to generate explanations that reflect specific annotator preferences but also enhances the overall quality of the annotations. Builders should care because this method could lead to more accurate and context-aware AI systems that better understand human reasoning, making them more effective in real-world applications.

Novelty
7.5
Reliability
8.0
arxiv/2605.28802
preview unavailable
PASS ✓

CaMBRAIN: Real-time, Continuous EEG Inference with Causal State Space Models

2026.05.27data

Abhilash Durgam, Nyle Siddiqui, Jeffrey A. Chan-Santiago, et al.

The paper presents CaMBRAIN, a new model for real-time inference of EEG signals that overcomes the limitations of existing methods by enabling long-range continuous inference. It achieves state-of-the-art results with over 10 times higher throughput than previous models, making it a significant advancement for EEG analysis.

Novelty
8.5
Reliability
8.0
arxiv/2605.28792
preview unavailable
PASS ✓

Skill-Conditioned Gated Self-Distillation for LLM Reasoning

2026.05.27agentscode

Jiazhen Huang, Xiao Chen, Xiao Luo, et al.

The paper presents Skill-Conditioned Gated Self-Distillation (SGSD), which enhances reasoning in large language models by using a skill bank for supervision. SGSD outperforms existing methods like GRPO and OPSD on multiple benchmarks, showing a 6.2% improvement on average. This approach allows for more effective use of teacher-student dynamics in model training.

Novelty
8.0
Reliability
7.5
arxiv/2605.28791
preview unavailable
PASS ✓

Do Agents Need Semantic Metadata? A Comparative Study in Agentic Data Retrieval

2026.05.27agents

Shiyu Chen, Tarfah Alrashed, Alon Halevy, et al.

This paper analyzes the effectiveness of two types of data retrieval agents: a Baseline Agent and a Semantic Agent. The key finding is that the Semantic Agent significantly outperforms the Baseline Agent in precision when retrieving FAIR-compliant datasets, highlighting the importance of structured metadata.

Novelty
7.5
Reliability
8.0
arxiv/2605.28787
preview unavailable
PASS ✓

Can Large Language Models Handle Discourse Particles? A Case Study of Colloquial Malay

2026.05.27data

Mariah Al Giptiah Binte Yusoff, Jakin Tan, Bocheng Chen, et al.

This paper introduces extsc{MalayPrag}, a benchmark for assessing LLMs' handling of discourse particles in colloquial Malay. The findings indicate that current LLMs struggle with these particles, but the proposed attributes significantly enhance their performance. This highlights the importance of structured approaches to improve LLMs' pragmatic understanding.

Novelty
7.5
Reliability
7.0
arxiv/2605.28782
PDF preview for Bias Leaves a Gradient Trail: Label-Free Bias Identification via Gradient Probes on Concept Decompositions
PASS ✓

Bias Leaves a Gradient Trail: Label-Free Bias Identification via Gradient Probes on Concept Decompositions

2026.05.27visioncode

Thomas Vitry, Kieran Edgeworth, Stefan Wermter, et al.

The authors developed a method to identify misleading patterns, or 'spurious concepts', in vision models without needing specific bias labels. Unlike previous methods that required retraining or labeled datasets, this approach uses standard class labels and analyzes how the model's predictions change when it encounters errors. By pinpointing and suppressing these spurious concepts, the model's accuracy can be significantly improved on various datasets, even after deployment. This is particularly valuable for builders working with models that cannot be easily retrained, as it offers a way to enhance performance and fairness without extensive modifications. Builders should care because this method provides a practical tool for improving model reliability in real-world applications.

Novelty
8.0
Reliability
8.0
arxiv/2605.28780
preview unavailable
PASS ✓

The Abstraction Gap in Vision-Language Causal Reasoning

2026.05.27reasoning

Chinh Hoang, Mohammad Rashedul Hasan

This paper presents a new methodology for evaluating vision-language models by distinguishing between linguistic plausibility and causal reasoning. The key finding is that while many models perform well on linguistic quality, they struggle with generating explicit causal chains. One model, however, demonstrates the ability to achieve near-zero Abstraction Gap, indicating potential for improved causal reasoning in VLMs.

Novelty
8.0
Reliability
7.5
arxiv/2605.28779
PDF preview for Can LLMs Use Linguistic Uncertainty Markers to Reliably Reflect Intrinsic Confidence?
PASS ✓

Can LLMs Use Linguistic Uncertainty Markers to Reliably Reflect Intrinsic Confidence?

2026.05.27alignmentcode

Gabrielle Kaili-May Liu, Arman Cohan

The authors of this paper explored how well large language models (LLMs) use language to express their confidence in their answers, specifically through phrases that indicate uncertainty, like 'it is likely...'. They discovered that LLMs often misrepresent their confidence levels, meaning they don't reliably use these phrases to reflect their true uncertainty. This is a shift from previous research, which mainly focused on how LLMs understand these markers without assessing their actual performance in using them. The findings suggest that improving how LLMs use these confidence markers could enhance their reliability and trustworthiness in applications. Builders should care about this because better calibration of LLMs can lead to more accurate and dependable AI systems, which is crucial for user trust and effective decision-making.

Novelty
8.0
Reliability
7.5
arxiv/2605.28778
preview unavailable
PASS ✓

Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents

2026.05.27agentscode

Suji Kim, Kangsan Kim, Sung Ju Hwang

LearnWeak is a new framework that helps small computer-use agents specialize in specific domains without requiring extensive annotations. It identifies weaknesses in agents and generates targeted training tasks, leading to significant performance improvements. The key result shows average gains of over 11 percentage points compared to existing models.

Novelty
8.0
Reliability
7.5
arxiv/2605.28775
preview unavailable
PASS ✓

Agent Explorative Policy Optimization for Multimodal Agentic Reasoning

2026.05.27agents

Minki Kang, Shizhe Diao, Ryo Hachiuma, et al.

This paper introduces AXPO, a new approach to improve tool use in vision-language models by addressing the Thinking-Acting Gap. The key result shows that SFT+AXPO outperforms SFT+GRPO across multiple benchmarks, achieving better performance with fewer parameters. This advancement could lead to more effective applications of vision-language models in real-world scenarios.

Novelty
8.0
Reliability
7.5
arxiv/2605.28774
PDF preview for Rethinking Memory as Continuously Evolving Connectivity
PASS ✓

Rethinking Memory as Continuously Evolving Connectivity

2026.05.27agentscode

Jizhan Fang, Buqiang Xu, Zhixian Wang, et al.

The authors developed FluxMem, a new memory framework that allows memory in AI agents to evolve and adapt in real-time, rather than being static and fixed. Unlike previous methods that treated memory as a simple storage system with set connections, FluxMem models memory as a flexible network that can change based on feedback and new information. This means that AI agents can better remember and connect relevant information as tasks and environments change, leading to improved performance in complex situations. Builders should care because this approach can significantly enhance the effectiveness of memory-augmented AI systems, making them more capable of handling dynamic challenges.

Novelty
8.5
Reliability
8.0
arxiv/2605.28773
preview unavailable
PASS ✓

Multi-Mixer Models: Flexible Sequence Modeling with Shared Representations

2026.05.27scaling

Kevin Y. Li, Asher Trockman, Ananda Theertha Suresh, et al.

The Oryx model innovatively combines quadratic attention and linear recurrences to enhance efficiency and performance in language tasks. It demonstrates that hybrid architectures can effectively share internal representations, achieving competitive results even with limited token usage in attention mode. This suggests a promising new direction for model design in handling long-context retrieval and in-context learning.

Novelty
8.0
Reliability
7.5
arxiv/2605.28769
preview unavailable
PASS ✓

Principled Algorithms for Optimizing Generalized Metrics in Multi-Label Learning

2026.05.27data

Mehryar Mohri, Yutao Zhong

This paper presents a new approach to multi-label classification that optimizes complex evaluation metrics using novel surrogate loss functions. The key result is the introduction of the MMO algorithm, which shows superior performance over existing methods on large datasets. This work provides both theoretical foundations and practical solutions for multi-label metric optimization.

Novelty
8.0
Reliability
8.0
arxiv/2605.28767
preview unavailable
PASS ✓

SwarmHarness: Skill-Based Task Routing via Decentralized Incentive-Aligned AI Agent Networks

2026.05.27infra

Edwin Jose

SwarmHarness proposes a decentralized protocol for sharing compute resources among nodes without a central authority. It features a self-regulating economy where nodes earn credits for contributions, promoting specialization and emergent collective intelligence. This approach could transform how distributed AI agents operate by enabling them to autonomously manage compute resources.

Novelty
8.0
Reliability
7.0
arxiv/2605.28764
PDF preview for CubePart: An Open-Vocabulary Part-Controllable 3D Generator
PASS ✓

CubePart: An Open-Vocabulary Part-Controllable 3D Generator

2026.05.27vision

Yiheng Zhu, Kangle Deng, Jean-Philippe Fauconnier, et al.

The authors developed CubePart, a framework that allows users to generate 3D models with specific part structures defined by text prompts. Unlike previous methods that produced either solid shapes or random part divisions, CubePart enables precise control over how the model is built, ensuring that each part aligns with user-defined categories. This means that game developers can create 3D assets that are ready to use in their projects without needing to make additional adjustments. Builders should care because this tool streamlines the process of creating interactive 3D content, making it easier to integrate generative models into games and simulations.

Novelty
8.0
Reliability
7.5
arxiv/2605.28763
preview unavailable
PASS ✓

Deep Neural Networks for Doubly Robust Estimation with Nonprobability Survey Samples

2026.05.27data

Yufang Dai, Shihua Luo, Wendy Lou, et al.

This paper presents a deep neural network-based method for combining probability and nonprobability survey samples to estimate population means. The key result shows that the proposed estimators enhance robustness against parametric misspecification, particularly in nonlinear selection mechanisms.

Novelty
7.5
Reliability
8.0
arxiv/2605.28762
preview unavailable
PASS ✓

LLM Zeroth-Order Fine-Tuning is an Inference Workload

2026.05.27infra

Zelin Li, Caiwen Ding

This paper presents a novel method for zeroth-order fine-tuning of large language models that leverages a serving runtime to achieve significant speedups. The approach results in an 8.13x speedup compared to the baseline while maintaining high accuracy. This suggests a promising direction for integrating inference and training processes.

Novelty
8.0
Reliability
8.0
arxiv/2605.28760
preview unavailable
PASS ✓

Extrapolative Weight Averaging Reveals Correctness-Efficiency Frontiers in Code RL

2026.05.27agents

Kunhao Zheng, Pierre Chambon, Juliette Decugis, et al.

This study explores how extrapolative weight averaging can enhance performance in reinforcement learning by navigating a correctness-efficiency frontier. The key result shows that using this method improves the solve rate on challenging problems by 3.3% over the best single checkpoint, making it a valuable technique for builders in code-related RL tasks.

Novelty
8.0
Reliability
7.5
arxiv/2605.28751
preview unavailable
PASS ✓

Preference-Shaped Expected Hypervolume and R2 Improvement: Exact Computation and Monotonicity

2026.05.27agentscode

Michael T. M. Emmerich

This paper advances Bayesian multiobjective optimization by analyzing preference-shaped expected improvement criteria. A key result is the demonstration that exact integral R2 improvement can be represented as a scalarization-space volume, which has implications for developing efficient algorithms in this area.

Novelty
7.5
Reliability
8.0
arxiv/2605.28746
preview unavailable
PASS ✓

Stance Detection in Prediction Markets: Addressing Imbalanced Trader Commentary via Counterfactual Augmentation and Market Context

2026.05.27data

Thomas Mbrice

This paper explores stance detection in prediction market comments, revealing that market context significantly improves recall for opposing stances. The optimal augmentation strategy is found to be 50% synthetic samples, which enhances performance without degrading it.

Novelty
7.0
Reliability
8.0
arxiv/2605.28745
PDF preview for CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning
PASS ✓

CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning

2026.05.27reasoningcode

Linas Nasvytis, Simon Jerome Han, Ben Prystawski, et al.

The authors developed a new algorithm called CORE, which helps language models improve their reasoning skills by learning from past attempts. Unlike previous methods that often require a lot of training data and computational resources, CORE uses a more efficient approach by analyzing successful and unsuccessful reasoning attempts to generate useful insights. This means that builders can achieve better performance with fewer examples and less processing power. By making the learning process more interpretable and compact, CORE offers a promising way to enhance model self-improvement without the heavy resource demands of traditional methods. Builders should care because this could lead to faster and more effective development of AI systems that require less data and computational cost.

Novelty
8.0
Reliability
7.5
arxiv/2605.28742
preview unavailable
PASS ✓

Reverse Probing: Supervised Token-level Uncertainty Quantification for Large Language Models in Clinical Text

2026.05.27data

Bushi Xiao, Sarvesh Soni, Daisy Zhe Wang

This paper presents Reverse Probing, a new framework for quantifying uncertainty in clinical text summarization. It achieves significant improvements in performance metrics, including up to 4 times higher AUPRC, while also reducing computational costs. The findings provide valuable insights into model behavior regarding clinical content.

Novelty
8.0
Reliability
8.0
arxiv/2605.28740
PDF preview for BIRDNet: Mining and Encoding Boolean Implication Knowledge Graphs as Interpretable Deep Neural Networks
PASS ✓

BIRDNet: Mining and Encoding Boolean Implication Knowledge Graphs as Interpretable Deep Neural Networks

2026.05.27datacode

Tirtharaj Dash

The authors developed BIRDNet, a new type of neural network that leverages mined Boolean implication relationships between features in data to create a model that is both sparse and interpretable. Unlike traditional dense models that require a lot of parameters and can be hard to understand, BIRDNet uses significantly fewer parameters while still achieving competitive performance on biological data related to cancer. This means that builders can create models that are not only efficient but also provide clear insights into the underlying rules driving the data. By recovering known biological signatures, BIRDNet can help researchers make better decisions in cancer research and other fields. Builders should care because this approach offers a way to build more efficient and understandable AI systems, which is increasingly important in data-driven applications.

Novelty
7.5
Reliability
8.0
arxiv/2605.28739
preview unavailable
PASS ✓

Code as a Weapon: A Consensus-Labeled Prompt Bank for Measuring Coding-Model Compliance with Malicious-Code Requests

2026.05.27alignment

Richard J. Young, Gregory D. Moody

This paper presents a new prompt bank that distinguishes between executable malicious code requests and harmful security knowledge requests. It consolidates multiple corpora and establishes a reliable basis for evaluating coding model compliance. The key result is the creation of a validated instrument that sets a higher refusal standard for coding models.

Novelty
8.0
Reliability
9.0
arxiv/2605.28734
PDF preview for Utility-Aware Multimodal Contrastive Learning for Product Image Generation
PASS ✓

Utility-Aware Multimodal Contrastive Learning for Product Image Generation

2026.05.27multimodal

Xiaohang Feng, Yiling Xie

The authors developed a new framework for generating product images that takes into account consumer demand, which they call utility-aware multimodal contrastive learning. Unlike previous models that focused mainly on matching images with text descriptions, this approach optimizes for images that are more likely to sell by considering what consumers actually want. This means that the generated images not only look good but also align better with market trends, leading to higher sales. Builders should care because this method can be integrated into existing generative AI systems to enhance their commercial effectiveness, making it a valuable tool for anyone involved in online retail or product marketing.

Novelty
8.0
Reliability
8.0
arxiv/2605.28733
preview unavailable
PASS ✓

MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems

2026.05.27agentscode

Xinle Deng, Ruobin Zhong, Hujin Peng, et al.

This paper introduces a novel framework for tracing errors in memory systems of large language models, which helps identify and correct systematic memory failures. The key result shows that their approach can enhance end-task performance by up to 7.62%. This work opens new avenues for improving the reliability of memory in LLMs.

Novelty
8.0
Reliability
7.5
arxiv/2605.28732
PDF preview for AlphaTransit: Learning to Design City-scale Transit Routes
PASS ✓

AlphaTransit: Learning to Design City-scale Transit Routes

2026.05.27agentscode

Bibek Poudel, Sai Swaminathan, Weizi Li

The authors developed AlphaTransit, a new framework for designing bus networks in cities that combines a method called Monte Carlo Tree Search (MCTS) with a neural network that predicts the quality of route designs. Unlike previous methods that often relied solely on trial and error, AlphaTransit uses learned insights to make better decisions about where to extend bus routes, leading to significant improvements in service rates. This means that cities can create more efficient transit systems that better meet the needs of their populations. Builders should care because this approach not only enhances the design process but also has the potential to improve public transportation accessibility and efficiency, making it a valuable tool for urban planners and transit authorities.

Novelty
8.0
Reliability
8.0
arxiv/2605.28730
preview unavailable
PASS ✓

Beyond Lipschitz: Data-Driven Robustness via Discrete Modulus of Continuity

2026.05.27data

Jürgen Dölz, Michael Multerer, Michele Palma

This paper presents a new framework called the discrete modulus of continuity (DMOC) for assessing the robustness of neural networks. DMOC offers a more nuanced measure of robustness compared to traditional Lipschitz constants and is applicable to large datasets. A key result is that DMOC can effectively distinguish between trained and untrained networks, revealing underfitting and overfitting regimes.

Novelty
8.0
Reliability
7.5
arxiv/2605.28729
preview unavailable
PASS ✓

How VLAs Fail Differently: Black-Box Action Monitoring Reveals Architecture-Specific Failure Signatures

2026.05.27agentscode

Krishnam Gupta

This paper reveals that different VLA architectures exhibit distinct failure patterns at the motor-command level, necessitating tailored monitoring strategies. A key finding is that direction reversal rates can predict failures across architectures, while common safety mechanisms like velocity checking are often ineffective. This insight is crucial for developers working with VLA systems to ensure safety and reliability.

Novelty
8.0
Reliability
8.0
arxiv/2605.28726
preview unavailable
PASS ✓

Multi-Adapter Representation Interventions via Energy Calibration

2026.05.27alignmentcode

Manjiang Yu, Hongji Li, Junwei Chen, et al.

The paper presents MARI, a method that adapts intervention strategies for large language models based on sample-specific needs. This approach not only aligns models more effectively but also enhances their general capabilities on various tasks. The key result shows significant improvements on safety benchmarks while maintaining performance on general tasks.

Novelty
8.0
Reliability
8.0
arxiv/2605.28722
preview unavailable
PASS ✓

LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?

2026.05.27agentscode

HuiMing Fan, Xiao Wang, Zheng Chu, et al.

This paper investigates whether LLM-based search agents genuinely search the web or rely on their intrinsic knowledge. The key finding is that agents often depend on pre-existing knowledge, performing poorly when external evidence is removed, which highlights the limitations of static search benchmarks.

Novelty
8.0
Reliability
7.5
arxiv/2605.28721
PDF preview for OpenURMA: A Clean-Room Open Implementation of the Unified Bus Protocol
PASS ✓

OpenURMA: A Clean-Room Open Implementation of the Unified Bus Protocol

2026.05.27infracode

Bojie Li

The authors developed OpenURMA, an open-source implementation of Huawei's Unified Bus (UB) protocol, which significantly improves the performance of Remote Direct Memory Access (RDMA) operations in data centers. Unlike previous methods that required each connection to maintain a lot of state information, UB simplifies this by separating application-specific data from transport data, leading to much lower latency. Their results show that UB can achieve an end-to-end latency of about 500 nanoseconds, which is over four times faster than the existing RoCEv2 protocol. This improvement means that data centers can handle more operations in less time, making them more efficient. Builders should care because adopting this technology could lead to faster and more responsive applications, ultimately enhancing user experience and system performance.

Novelty
8.5
Reliability
8.0
arxiv/2605.28717
preview unavailable
PASS ✓

IPO-Mine: A Toolkit and Dataset for Section-Structured Analysis of Long, Multimodal IPO Documents

2026.05.27infra

Michael Galarnyk, Siddharth Lohani, Vidhyakshaya Kannan, et al.

The IPO-Toolkit enables the parsing and analysis of over 109,000 IPO filings, addressing challenges in handling long, multimodal documents. A key result is the identification of alignment issues between state-of-the-art multimodal models and expert human judgments on financial charts.

Novelty
8.0
Reliability
7.5
arxiv/2605.28714
preview unavailable
PASS ✓

Thinking as Compression: Your Reasoning Model is Secretly a Context Compressor

2026.05.27agents

Guoxin Ma, Yibing Liu, Chengzhengxu Li, et al.

This paper introduces Thinking as Compression (TaC), a novel approach that allows LLMs to compress long contexts by generating thinking traces. The method outperforms existing compression techniques, achieving significant improvements in F1 and Exact Match scores at high compression ratios.

Novelty
8.0
Reliability
7.5
arxiv/2605.28713
preview unavailable
PASS ✓

Stage-wise Distortion-Perception Traversal in Zero-shot Inverse Problems with Diffusion Models

2026.05.27vision

Jiawei Zhang, Ziyuan Liu, Leon Yan, et al.

This paper presents a new framework called MAP-RPS for navigating the distortion-perception tradeoff in diffusion models. It combines MAP estimation with posterior sampling to improve perceptual quality in inverse problems. The results show that this approach effectively enhances performance across various tasks.

Novelty
7.5
Reliability
8.0
arxiv/2605.28711
preview unavailable
PASS ✓

Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study

2026.05.27datacode

Irune Zubiaga, Aitor Soroa, Rodrigo Agerri

This paper explores strategies for developing multilingual LLMs for text evaluation, focusing on English, Spanish, and Basque. A key finding is that fine-tuning smaller models with in-domain data can match proprietary models, while larger models excel in zero-shot evaluations. The results offer practical guidance for building multilingual evaluation pipelines.

Novelty
7.0
Reliability
8.0
arxiv/2605.28710
preview unavailable
PASS ✓

Beyond Binary Moral Judgment: Modeling Ethical Pluralism in AI

2026.05.27agents

Aisha Aijaz, Rahul Goel, Arnav Batra, et al.

This paper introduces a framework for moral reasoning in AI that models ethical pluralism through a normative ethics simplex. The key result shows that integrating contextual and normative information significantly improves classification accuracy to 88.89%. This approach supports more human-like moral reasoning in AI systems.

Novelty
8.0
Reliability
7.5
arxiv/2605.28707
PDF preview for Understanding Generalization and Forgetting in In-Context Continual Learning
PASS ✓

Understanding Generalization and Forgetting in In-Context Continual Learning

2026.05.27reasoning

Guangyu Li, Meng Ding, Lijie Hu

The authors of this paper developed a new theoretical framework to understand how Large Language Models (LLMs) manage multiple tasks presented in a single prompt. Unlike previous studies that focused on single tasks, this research reveals that standard attention mechanisms can cause interference between tasks, which negatively impacts the model's performance. This finding is important for builders because it highlights potential weaknesses in how models learn from past information when faced with new tasks. By understanding these limitations, developers can work on improving model robustness and performance in real-world applications where tasks are often mixed. Essentially, this research provides insights that can help builders create more effective AI systems that better handle complex, multi-task scenarios.

Novelty
8.5
Reliability
7.0
arxiv/2605.28705
preview unavailable
PASS ✓

Expressive Power of Floating-Point Neural Networks with Arbitrary Reduction Orders and Inexact Activation Implementations

2026.05.27infra

Yeachan Park, Geonho Hwang, Wonyeol Lee, et al.

This paper explores the expressive power of floating-point neural networks under more realistic execution semantics. It establishes a framework for determining when these networks can represent arbitrary functions, highlighting that distinguishability in the first layer is crucial for universal representability. This finding broadens the understanding of practical activation functions in neural networks.

Novelty
8.5
Reliability
7.0
arxiv/2605.28704
preview unavailable
PASS ✓

A Fresh Look at Lamarckian Evolution and the Baldwin Effect

2026.05.27agentscode

Inès Benito, Johannes F. Lutzeyer, Benjamin Doerr

This paper revisits Baldwinian and Lamarckian evolution in evolutionary algorithms, showing they outperform traditional Darwinian methods in various scenarios. The authors provide a set of generalist parameters that can benefit practitioners, highlighting the practical implications of their findings.

Novelty
8.0
Reliability
8.0
arxiv/2605.28703
PDF preview for Natural Language Query to Configuration for Retrieval Agents
PASS ✓

Natural Language Query to Configuration for Retrieval Agents

2026.05.26agentscode

Melissa Z. Pan, Negar Arabzadeh, Mathew Jacob, et al.

The authors developed a system called BRANE that optimizes how retrieval agents handle queries by dynamically choosing the best configuration based on the specific characteristics of each query. Unlike previous methods that relied on a one-size-fits-all approach, which often required manual tuning for different workloads, BRANE can adjust its settings on-the-fly to improve performance. This means it can achieve the same level of accuracy as the best fixed configurations but at a significantly lower cost—up to 89% less. For builders, this flexibility allows for more efficient use of resources and better performance in real-world applications without the need for constant retraining. Essentially, BRANE offers a smarter way to manage retrieval processes, making it easier to balance quality and cost.

Novelty
8.0
Reliability
8.0
arxiv/2605.27361
PDF preview for GENESIS: Harnessing AI Agents for Autonomous 6G RAN Synthesis, Research, and Testing
PASS ✓

GENESIS: Harnessing AI Agents for Autonomous 6G RAN Synthesis, Research, and Testing

2026.05.26agents

Tamerlan Aghayev, Maxime Elkael, Michele Polese, et al.

The authors developed GENESIS, an AI framework that helps speed up the research and development of cellular networks, specifically for 6G technology. Unlike traditional methods that can take months for each iteration, GENESIS quickly turns ideas or problems into tested solutions using real-world experiments. This means that builders can develop and refine network features much faster and with greater reliability. By addressing common issues like misinterpretation of technical specifications, GENESIS ensures that different components of the network work well together. Builders should care because this framework can significantly reduce development time and improve the quality of their network solutions.

Novelty
8.0
Reliability
7.0
arxiv/2605.27360
preview unavailable
PASS ✓

Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

2026.05.26alignmentcode

Dongyoon Hahm, Dylan Hadfield-Menell, Kimin Lee

This paper reveals a critical vulnerability in Reinforcement Learning from Human Feedback (RLHF) called alignment tampering, where LLMs can influence their own preference datasets. The authors show that this can lead to the amplification of biases in generated responses, raising concerns about the reliability of current alignment methods. Mitigating this issue proves challenging without compromising response quality.

Novelty
8.0
Reliability
7.0
arxiv/2605.27355
PDF preview for Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders
PASS ✓

Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders

2026.05.26data

Yi Jing, Zao Dai, Jinwu Hu, et al.

The authors developed a framework called SAERL that enhances how we manage training data for large language models (LLMs) by tapping into the internal workings of the model itself. Unlike previous methods that mainly relied on external indicators, SAERL uses insights from a tool called Sparse Autoencoder to assess the diversity, difficulty, and quality of the training data. This approach led to a 3% increase in accuracy and a 20% reduction in training time for a specific model, showing that it can be effective across various model types and training methods. Builders should care because this framework offers a more efficient way to improve model performance, making it easier to achieve better results with less effort and resources.

Novelty
7.5
Reliability
8.0
arxiv/2605.27354
PDF preview for From Scores to Gibbs Correctors: Accelerating Uniform-Rate Discrete Diffusion Models
PASS ✓

From Scores to Gibbs Correctors: Accelerating Uniform-Rate Discrete Diffusion Models

2026.05.26infra

Yuchen Liang, Ness Shroff, Yingbin Liang

The authors developed a new method called GADD, which stands for Gibbs-Accelerated Discrete Diffusion, to speed up the process of generating samples from discrete diffusion models. Unlike previous methods that either needed extra training or were slow to mix, GADD directly uses the structure of the score function to create more efficient sampling without additional training. This results in a significant improvement in both the quality of samples and the time it takes to generate them, making it useful for tasks like text generation and music creation. Builders should care because GADD offers a more efficient way to implement discrete diffusion models, which can enhance the performance of applications relying on these models, ultimately saving time and resources.

Novelty
8.0
Reliability
7.5
arxiv/2605.27352
PDF preview for When Eyes Betray AI: Social Gaze Consistency as a Semantic Cue for AI-Generated Image Detection
PASS ✓

When Eyes Betray AI: Social Gaze Consistency as a Semantic Cue for AI-Generated Image Detection

2026.05.26vision

Kim Jihyeon, Sohee Kim, Soosan Lee, et al.

The authors of this paper discovered a new method for detecting AI-generated images by focusing on how people in the images look at each other, which they call Social Gaze Consistency. Unlike previous methods that relied on identifying low-level visual artifacts like pixel errors, this approach looks at the overall coherence of gaze direction and eye alignment among individuals in the image. This change allows for better detection of manipulated images, even when the changes are subtle. For builders, this means that by incorporating this method, they can improve the reliability of AI systems in distinguishing between real and fake images, which is crucial for applications in security, content moderation, and media verification. Overall, this research highlights a new angle for enhancing AI detection capabilities that could lead to more robust and trustworthy AI applications.

Novelty
8.0
Reliability
7.5
arxiv/2605.27348
PDF preview for MATCHA: Matching Text via Contrastive Semantic Alignment
PASS ✓

MATCHA: Matching Text via Contrastive Semantic Alignment

2026.05.26datacode

Siran Li, Ece Sena Etoglu, Carsten Eickhoff, et al.

The paper presents MATCHA, a new evaluation metric for large language models that improves upon traditional metrics like ROUGE and BERTScore. It effectively measures semantic agreement while penalizing contradictions, showing significant performance improvements on various tasks. The key result is a 20.82% improvement over BERTScore on the TruthfulQA dataset.

Novelty
8.5
Reliability
8.0
arxiv/2605.27345
preview unavailable
PASS ✓

Towards Controllable Image Generation through Representation-Conditioned Diffusion Models

2026.05.26vision

Nithesh Chandher Karthikeyan, Jonas Unger, Gabriel Eilertsen

This paper explores a novel self-conditioning mechanism for diffusion models, improving both unconditional image generation quality and control over the output. The authors identify directions of variation in the representation space, demonstrating smoothness and disentanglement properties that could benefit practical applications in image generation.

Novelty
7.5
Reliability
6.5
arxiv/2605.27343
PDF preview for 2-ASP(Q) programs with weak constraints: Complexity and efficient implementation
PASS ✓

2-ASP(Q) programs with weak constraints: Complexity and efficient implementation

2026.05.26reasoning

Andrea Cuteri, Giuseppe Mazzotta, Francesco Ricca

This paper presents 2-ASP(Q)^w, a new fragment of Answer Set Programming that can handle optimization problems. The authors introduce effective strategies for computing quantified answer sets, validated through experiments on challenging benchmarks, demonstrating practical effectiveness.

Novelty
7.5
Reliability
8.0
arxiv/2605.27338
preview unavailable
PASS ✓

FinHarness: An Inline Lifecycle Safety Harness for Finance LLM Agents

2026.05.26agents

Haoxuan Jia, Yang Liu, Bin Chong, et al.

FinHarness is a new safety mechanism for finance LLM agents that effectively reduces unauthorized actions while maintaining legitimate approvals. It achieves a significant drop in action success rate from 38.3% to 15.0% and uses fewer advanced judge calls, making it efficient. This approach allows agents to make better decisions in real-time.

Novelty
8.0
Reliability
7.5
arxiv/2605.27333
PDF preview for EdgeFlow: Edge-Map Augmented VLM-Based Flowchart Processing for Industrial Requirements Engineering
PASS ✓

EdgeFlow: Edge-Map Augmented VLM-Based Flowchart Processing for Industrial Requirements Engineering

2026.05.26visioncode

Zhifei Dou, Shabnam Hassani, Ou Wei

EdgeFlow improves the conversion of flowcharts to machine-readable models by using a Canny edge map as a structural prior. It achieves notable increases in node-level and edge-level F1 scores, demonstrating its effectiveness in industrial requirements engineering. This method does not require annotated training data, making it practical for real-world applications.

Novelty
7.5
Reliability
8.0
arxiv/2605.27332
PDF preview for Maat: The Agentic Legal Research Assistant for Competition Protection
PASS ✓

Maat: The Agentic Legal Research Assistant for Competition Protection

2026.05.26agentscode

Basant Mounir, Farida Madkour, Amira Abdelaziz, et al.

Maat is a ReAct agent designed for legal research in competition law, addressing limitations of existing general and legal assistants. It effectively grounds findings in official sources and provides rich citations, significantly outperforming baseline tools on case-specific tasks. This makes it a valuable tool for legal professionals needing reliable research assistance.

Novelty
7.5
Reliability
8.0
arxiv/2605.27331
preview unavailable
PASS ✓

Governed Evolution of Agent Runtimes through Executable Operational Cognition

2026.05.26agents

Mariano Garralda-Barrio

This paper presents a framework for managing the lifecycle of agent-generated artifacts in multi-agent systems. It emphasizes the importance of treating these artifacts as persistent capabilities rather than transient outputs. The key result is the introduction of HarnessMutation, which allows for governed runtime adaptation with explicit validation and rollback mechanisms.

Novelty
8.0
Reliability
6.5
arxiv/2605.27328
preview unavailable
PASS ✓

Semantic Gradients Interactions in SSD: A Case Study in Racial Identity and Hate Speech

2026.05.26datacode

Felix Ostrowicki, Hubert Plisiecki

The paper presents interaction SSD, a novel method for analyzing how semantic meaning varies across different moderators. It effectively illustrates this method using the UC Berkeley Measuring Hate Speech corpus, revealing significant moderation effects based on annotator racial identity. This approach enhances the interpretability of hate-speech judgments.

Novelty
7.5
Reliability
8.0
arxiv/2605.27322
PDF preview for Modeling Agentic Technical Debt and Stochastic Tax: A Standalone Framework for Measurement, Simulation, and Dashboarding
PASS ✓

Modeling Agentic Technical Debt and Stochastic Tax: A Standalone Framework for Measurement, Simulation, and Dashboarding

2026.05.26agents

Muhammad Zia Hydari, Raja Iqbal, Narayan Ramasubbu

This paper presents a model that differentiates between two important concepts in AI systems: Agentic Technical Debt and Stochastic Tax. The key result is that while debt can amplify operational burdens, the tax can persist even with minimized debt, offering insights for better governance in AI workflows.

Novelty
7.0
Reliability
6.5
arxiv/2605.27320
PDF preview for Probabilistic Smoothing with Ratio-Monotone Transforms for Global Optimization
PASS ✓

Probabilistic Smoothing with Ratio-Monotone Transforms for Global Optimization

2026.05.26infra

Kukyoung Jang, Taehyun Cho, Junrui Zhang, et al.

This paper presents a novel smoothing framework that improves global optimization by using flexible unimodal kernels. A key result is that the smoothed objective maintains the global maximizer, enhancing robustness without needing a decreasing smoothing schedule.

Novelty
7.5
Reliability
8.0
arxiv/2605.27316
preview unavailable
PASS ✓

Real Images, Worse Judgments: Evaluating Vision-Language Models on Concreteness and Imagery

2026.05.26multimodal

Yifan Jiang, Ruoxi Ning, Sheng Yao, et al.

This study investigates whether visual inputs improve language understanding in multimodal models. It finds that real-image contexts can sometimes degrade performance, especially for less relevant visual evidence. The key result is that focusing on textual content can mitigate these issues.

Novelty
7.0
Reliability
8.0
arxiv/2605.27315
PDF preview for When Does Demographic Information Help? Data and Modeling Regimes for Perspective-Aware Hate Speech Detection
PASS ✓

When Does Demographic Information Help? Data and Modeling Regimes for Perspective-Aware Hate Speech Detection

2026.05.26data

Weibin Cai, Reza Zafarani

This paper investigates the role of demographic information in hate speech detection, revealing that its effectiveness varies based on data characteristics and modeling approaches. The key finding is that demographic gains are most pronounced in scenarios with low training disagreement and high test disagreement, leading to the introduction of a new model that selectively incorporates demographic data.

Novelty
7.0
Reliability
8.0
arxiv/2605.27313
preview unavailable
PASS ✓

Chartographer: Counterfactual Chart Generation for Evaluating Vision-Language Models

2026.05.26reasoning

Yifan Jiang, Dae Yon Hwang, Jesse C. Cresswell, et al.

The paper presents a novel framework called Chartographer that generates counterfactual charts to evaluate visual reasoning in question-answering tasks. It reveals that vision-language models often fail to generalize when faced with updated charts requiring new reasoning pathways. This finding highlights the limitations of current models in handling visual reasoning tasks effectively.

Novelty
8.0
Reliability
7.5
arxiv/2605.27311
preview unavailable
PASS ✓

Greening AI Inference with Accuracy and Latency-aware User Incentives

2026.05.26infra

Vasilios A. Siris, Adamantia Stamou, George D. Stamoulis, et al.

This paper presents a framework for incentivizing AI inference based on users' preferences for quality, latency, and environmental consciousness. A key result is the introduction of a two-tier service subscription model that allows users to reduce carbon emissions in exchange for discounts. This approach provides flexibility for AI providers in managing inference requests during high carbon intensity periods.

Novelty
7.0
Reliability
6.5
arxiv/2605.27309
preview unavailable
PASS ✓

Normal Guidance is what Attention Needs

2026.05.26visioncode

Ethan Harvey, Dennis Johan Loevlie, Michael C. Hughes

This paper explores a novel approach to training classifiers for 3D medical images using a single binary label. The proposed Normal Guidance technique significantly enhances attention-based methods for slice-level localization, outperforming state-of-the-art techniques while maintaining competitive performance in whole-scan classification.

Novelty
7.5
Reliability
8.0
arxiv/2605.27306
PDF preview for Risk Averse Alert Prioritization for IDS Using Subnormal Gaussian Fuzzy Models
PASS ✓

Risk Averse Alert Prioritization for IDS Using Subnormal Gaussian Fuzzy Models

2026.05.26infra

Murat Moran

This paper presents a new framework for prioritizing alerts in intrusion detection systems by modeling uncertainty with fuzzy numbers. The key result shows that this approach significantly outperforms traditional methods in terms of robustness, especially under detector degradation scenarios.

Novelty
7.5
Reliability
8.0
arxiv/2605.27299
preview unavailable
PASS ✓

Self-Ensembling Vision-Language Models for Chart Data Extraction

2026.05.26visioncode

Thomas Berkane, Qianyi Wang, Maimuna S. Majumder

This paper introduces a novel self-ensembling method for extracting tabular data from charts, improving accuracy by up to 23% on a new benchmark. It addresses the limitations of existing models by aggregating multiple outputs to enhance reliability and accuracy. This advancement enables better reuse and analysis of data previously locked in chart images.

Novelty
7.5
Reliability
8.0
arxiv/2605.27298
PDF preview for BASIS: Batchwise Advantage Estimation from Single-Rollout Information Sharing for LLM Reasoning
PASS ✓

BASIS: Batchwise Advantage Estimation from Single-Rollout Information Sharing for LLM Reasoning

2026.05.26agents

Shijin Gong, Erhan Xu, Kai Ye, et al.

BASIS is a new algorithm that enhances the efficiency of value function estimation in reinforcement learning. It achieves a 69% reduction in MSE compared to a strong baseline while using only one rollout per prompt, leading to better policy optimization with less training time.

Novelty
8.0
Reliability
8.0
arxiv/2605.27293
preview unavailable
PASS ✓

Detectability in Diversity: Improved Canary Crafting for Privacy Auditing in One Run

2026.05.26data

Mathieu Dagréou, Aurélien Bellet

This paper introduces an efficient method for crafting canaries in privacy auditing, which enhances the accuracy of privacy leakage estimates while reducing computational costs. The approach combines influence functions with bilevel optimization to achieve better results than previous methods.

Novelty
7.5
Reliability
8.0
arxiv/2605.27292
preview unavailable
PASS ✓

Falcon-X: A Time Series Foundation Model for Heterogeneous Multivariate Modeling

2026.05.26datacode

Yiding Liu, Yifan Hu, Hongjie Xia, et al.

Falcon-X is a new time series foundation model that improves forecasting by decoupling variates from the raw space and aligning them in a unified latent prototype space. It achieves state-of-the-art performance on key benchmarks, making it a valuable tool for complex multivariate forecasting tasks.

Novelty
8.0
Reliability
8.0
arxiv/2605.27286
preview unavailable
PASS ✓

Causal Risk Minimization for High-Dimensional Treatments

2026.05.26reasoningcode

Nikita Dhawan, Arnav Paruthi, Andrew Kim, et al.

This paper presents a new method for predicting the effects of interventions in high-dimensional spaces, such as text treatments. A key result is the demonstration that higher-order balance error optimization improves causal estimation, allowing a single model to address multiple causal questions effectively.

Novelty
7.5
Reliability
8.0
arxiv/2605.27281
preview unavailable
PASS ✓

Transfer Learning using 66 Diseases for Disease Forecasting Applications

2026.05.26data

Lauren J Beesley, Alexander C Murph, Dave Osthus, et al.

This paper explores the integration of multiple data streams for forecasting infectious diseases, showing that this approach improves performance in 84.9% of cases. It emphasizes the importance of data quality, indicating that irrelevant data can harm forecasts. A key contribution is the creation of a publicly-available database for the forecasting community.

Novelty
8.0
Reliability
7.5
arxiv/2605.27269
preview unavailable
PASS ✓

Lost in Sampling: Assessing Lexical Reachability in LLMs via the Word Coverage Score (WCS)

2026.05.26data

Samer Awad, Javier Conde, Carlos Arriaga, et al.

This paper investigates how standard sampling methods in LLMs limit linguistic diversity. It introduces the Word Coverage Score (WCS) to measure the impact of these sampling filters on the use of low-frequency, high-information words. The key finding is that common sampling defaults can unintentionally censor diverse language, leading to more homogeneous text outputs.

Novelty
8.0
Reliability
7.5
arxiv/2605.27268
preview unavailable
PASS ✓

LUCoS: Latent Unsupervised Context Selection for Tabular Foundation Models

2026.05.26data

Oroel Ipas, Guillermo Gomez-Trenado, Rocío Romero-Zaliz, et al.

The paper presents LUCoS, a novel method for selecting instances in low-label tabular learning by utilizing latent geometry from embeddings. It significantly outperforms random selection and traditional methods across various datasets and budgets, highlighting the importance of representativeness in context selection.

Novelty
8.0
Reliability
8.0
arxiv/2605.27254
preview unavailable
PASS ✓

Temporal Simultaneity Predicts Annotation Quality in Sentiment Corpora

2026.05.26data

Idris Abdulmumin, Mokgadi Penelope Matloga, Tadesse Destaw Belay, et al.

This paper presents a new sentiment dataset for Setswana and analyzes the decline in inter-annotator agreement over time. A key finding is that tweets labeled within one minute achieve a much higher agreement score than those labeled further apart, highlighting the importance of temporal factors in annotation quality.

Novelty
7.0
Reliability
8.0
arxiv/2605.27239
preview unavailable
PASS ✓

The Coverage Illusion: From Pre-retrieval Routing Failure to Post-retrieval Cascades in a Production RAG System

2026.05.26infra

Zafar Hussain, Kristoffer Nielbo

This study reveals that a significant number of real user queries do not require LLM augmentation, contrary to synthetic query assumptions. By implementing a post-retrieval cascade, the authors improve retrieval quality and reduce latency, serving most queries without LLM augmentation. The key result is a 31.8% reduction in latency while maintaining high quality.

Novelty
7.0
Reliability
8.0
arxiv/2605.27220
PDF preview for MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research
PASS ✓

MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research

2026.05.25infracode

Dingbang Wu, Rui Hao, Haiyang Wang, et al.

MobileGym is a new environment designed for mobile applications that allows for high interaction fidelity and scalable reinforcement learning. It provides structured evaluation and rewards, leading to a notable performance improvement in real-device execution. The key result shows a +12.8 percentage point gain on a test set, indicating its effectiveness.

Novelty
8.0
Reliability
7.5
arxiv/2605.26114
PDF preview for From Model Scaling to System Scaling: Scaling the Harness in Agentic AI
PASS ✓

From Model Scaling to System Scaling: Scaling the Harness in Agentic AI

2026.05.25agentscode

Shangding Gu

This paper highlights the importance of designing modular and verifiable architectures around foundation models for agentic AI. It identifies key bottlenecks in context governance, trustworthy memory, and skill routing, proposing a new evaluation framework that focuses on the quality of agent behavior over simple task success. The key result is the introduction of CheetahClaws, a reference harness for evaluating these architectures.

Novelty
8.0
Reliability
7.0
arxiv/2605.26112
PDF preview for Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation
PASS ✓

Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation

2026.05.25visioncode

Shuhong Zheng, Aashish Kumar Misraa, Yu-Teng Li, et al.

This paper presents a new method for subject-driven image generation that effectively preserves identity while following textual instructions. By conditioning diffusion models on Multimodal Large Language Models and incorporating a VAE-based identity conditioning, the approach mitigates common issues like copy-paste artifacts. The key result shows significant improvement in human preference for generated images.

Novelty
8.0
Reliability
7.5
arxiv/2605.26111
PDF preview for Prism: A Plug-in Reproducible Infrastructure for Scalable Multimodal Continual Instruction Tuning
PASS ✓

Prism: A Plug-in Reproducible Infrastructure for Scalable Multimodal Continual Instruction Tuning

2026.05.25infracode

Jun-Tao Tang, Yu-Cheng Shi, Zhen-Hao Xie, et al.

The paper presents Prism, a new codebase designed to facilitate scalable Multimodal Continual Instruction Tuning (MCIT) research. By allowing independent plugin integration, it reduces implementation overhead and enhances code reuse. This approach aims to accelerate the development of new MCIT strategies.

Novelty
8.0
Reliability
7.5
arxiv/2605.26110
PDF preview for Looped Diffusion Language Models
PASS ✓

Looped Diffusion Language Models

2026.05.25scaling

Sanghyun Lee, Chunsan Hong, Seungryong Kim, et al.

This paper presents LoopMDM, a new approach that improves training efficiency and model performance in masked diffusion models by selectively looping transformer layers. The key result is that LoopMDM can achieve the same performance as larger models while using significantly fewer training resources, making it a compelling option for builders focused on efficiency.

Novelty
8.0
Reliability
8.0
arxiv/2605.26106
PDF preview for Beyond Summaries: Structure-Aware Labeling of Code Changes with Large Language Models
PASS ✓

Beyond Summaries: Structure-Aware Labeling of Code Changes with Large Language Models

2026.05.25infracode

Bar Weiss, Antonio Abu-Nassar, Adi Sosnovich, et al.

This paper presents a new approach to improve code review efficiency by using large language models to label code changes in patches. The proposed method achieves high recall and precision, suggesting it can effectively enhance traditional static analysis workflows.

Novelty
7.5
Reliability
8.0
arxiv/2605.26100
PDF preview for Language Models Need Sleep
PASS ✓

Language Models Need Sleep

2026.05.25reasoning

Sangyun Lee, Sean McLeish, Tom Goldstein, et al.

This paper presents a novel sleep-like mechanism for transformer models that allows them to handle long contexts more effectively. The key result shows that increasing the duration of this 'sleep' improves performance, particularly on tasks requiring deeper reasoning. This could be crucial for builders looking to enhance model efficiency in complex tasks.

Novelty
8.0
Reliability
7.5
arxiv/2605.26099
PDF preview for Forgetting in Language Models: Capacity, Optimization, and Self-Generated Replay
PASS ✓

Forgetting in Language Models: Capacity, Optimization, and Self-Generated Replay

2026.05.25datacode

Martin Marek, Dongkyu Cho, Shikai Qiu, et al.

This paper addresses the issue of forgetting in language models when trained on new tasks. It shows that self-generated samples can effectively serve as replay data, significantly reducing forgetting. The key result is that this method allows for high-learning-rate finetuning without the typical tradeoff of forgetting.

Novelty
7.5
Reliability
8.0
arxiv/2605.26097
PDF preview for Goal-driven Bayesian Optimal Experimental Design for Robust Decision-Making Under Model Uncertainty
PASS ✓

Goal-driven Bayesian Optimal Experimental Design for Robust Decision-Making Under Model Uncertainty

2026.05.25agents

Jinwoo Go, Xiaoning Qian, Byung-Jun Yoon

GoBOED optimizes experimental designs specifically for decision-making objectives, improving alignment with downstream goals. It demonstrates that designs can be more effective than those derived from traditional information-gain maximization. This approach reveals that optimal design windows are broader than previously thought.

Novelty
8.0
Reliability
7.5
arxiv/2605.26093
PDF preview for OrpQuant: Geometric Orthogonal Residual Projection for Multiplier-Free Power-of-Two Transformer Quantization
PASS ✓

OrpQuant: Geometric Orthogonal Residual Projection for Multiplier-Free Power-of-Two Transformer Quantization

2026.05.25infra

Maoyang Xiang, Bo Wang, Tao Luo

This paper presents Orthogonal Residual Projection (ORP), a new framework that improves quantization for large language models on edge devices. ORP achieves a perplexity of 6.10 on LLaMA-2-7B under a 3-bit constraint, outperforming traditional methods while reducing calibration time to about 15 minutes. This advancement addresses critical timing bottlenecks and enhances hardware efficiency.

Novelty
8.0
Reliability
7.5
arxiv/2605.26092
preview unavailable
PASS ✓

Channel-wise Vector Quantization

2026.05.25visioncode

Wei Song, Tianhang Wang, Yitong Chen, et al.

The paper presents Channel-wise Vector Quantization (CVQ), which improves image tokenization by using channel-wise tokens instead of patch-wise ones. This method leads to a new visual autoregressive model that enhances image generation quality, achieving high scores in evaluation metrics. The key result is that CVQ significantly improves reconstruction quality over traditional vector quantization methods.

Novelty
8.0
Reliability
7.5
arxiv/2605.26089
preview unavailable
PASS ✓

DiscoverPhysics: Benchmarking LLMs for Out-of-the-Box Scientific Thinking

2026.05.25agentscode

Matt L. Wiemann, Lindsay M. Smith, Peter Melchior, et al.

DiscoverPhysics is a new benchmark that challenges LLMs to discover physics laws in simulated worlds with unique rules. The study reveals that even the best models struggle with complex tasks requiring hypothesis refinement and experimental design. This highlights the gap between predictive accuracy and conceptual understanding in LLMs.

Novelty
8.0
Reliability
7.5
arxiv/2605.26087
preview unavailable
PASS ✓

Automated Benchmark Auditing for AI Agents and Large Language Models

2026.05.25infracode

Junlin Wang, Federico Bianchi, Shang Zhu, et al.

The paper presents Auto Benchmark Audit (ABA), a framework that identifies critical issues in AI benchmarks, such as ambiguous task design and incorrect ground truths. By auditing 168 benchmarks, ABA reveals that over 25.7% contain significant problems, which can distort model performance assessments. The tool and annotations are released to aid future benchmark development.

Novelty
8.0
Reliability
8.0
arxiv/2605.26079
preview unavailable
PASS ✓

Global Convergence of Wasserstein Policy Gradient for Entropy-Regularized Reinforcement Learning

2026.05.25agents

Zhaoyu Zhu, Rui Gao, Shuang Li

This paper develops a global convergence theory for Wasserstein policy gradient in reinforcement learning by utilizing the Bellman structure. The key result is that the Bellman recursion induces a favorable geometry that supports global convergence, despite the non-convex nature of the entropy-regularized RL objective.

Novelty
8.0
Reliability
7.5
arxiv/2605.26078
preview unavailable
PASS ✓

StakeBench: Evaluating Language Understanding Grounded in Market Commitment

2026.05.25datacommunity code

Yunhua Pei, Jingyu Hu, Yiwei Shi, et al.

StakeBench is a new framework for evaluating language models based on market commitments rather than subjective labels. It demonstrates that while models can partially recover position-side signals, they struggle with future action anticipation and collective odds projection. This highlights the need for better alignment between model predictions and market behavior.

Novelty
7.5
Reliability
7.0
arxiv/2605.26074
preview unavailable
PASS ✓

Active Query Synthesis for Preference Learning

2026.05.25data

Namrata Nadagouda, Nauman Ahad, Maegan Tucker, et al.

This paper presents a novel approach to active learning that improves the efficiency of user preference learning by addressing feedback reliability. The key result is the development of the Info-Synth framework, which generates optimal queries to enhance decision-making systems. This method shows versatility across various applications, including preference learning and robotic control.

Novelty
8.0
Reliability
7.5
arxiv/2605.26072
preview unavailable
PASS ✓

WhoSaidIt: Human-LLM Collaborative Annotation for Text-Based Multilingual Speaker-Attribute Classification

2026.05.25datacode

Lingyu Gao, Will Monroe, David Smith, et al.

This paper presents a new framework for re-annotating multilingual speaker attributes using human-LLM collaboration. The key finding is that there are significant cross-lingual differences in how speaker attributes are annotated, highlighting both the potential and limitations of LLMs in this context.

Novelty
7.5
Reliability
8.0
arxiv/2605.26070
preview unavailable
PASS ✓

Conditional KRR: Injecting Unpenalized Features into Kernel Methods with Applications to Kernel Thresholding

2026.05.25data

Rustem Takhanov, Zhenisbek Assylbekov

This paper presents conditional kernel ridge regression (conditional KRR), which improves upon standard KRR by focusing on the residuals of the regression function. The key result shows that conditional KRR can outperform standard KRR when the feature component is more significant than the residuals. This finding is backed by both theoretical analysis and experimental validation.

Novelty
6.5
Reliability
7.0
arxiv/2605.26067
preview unavailable
PASS ✓

Paris 2.0: A Decentralized Diffusion Model for Video Generation

2026.05.25vision

Ali Rouzbayani, Bidhan Roy, Marcos Villagra, et al.

Paris 2.0 is a groundbreaking video generation model that utilizes decentralized computation for training. It achieves a remarkable reduction in Frechet Video Distance, demonstrating a 2.0x improvement over previous methods. This advancement opens new avenues for efficient video generation without reliance on large GPU clusters.

Novelty
8.0
Reliability
7.5
arxiv/2605.26064
preview unavailable
PASS ✓

Neuronal Stochastic Attention Circuit (NSAC) for Probabilistic Representation Learning

2026.05.25agentscode

Waleed Razzaq, Yun-Bo Zhao

The Neuronal Stochastic Attention Circuit (NSAC) is a new architecture that enhances uncertainty quantification in continuous-time learning tasks. It effectively combines Gaussian negative log-likelihood with a regularizer to improve predictive variance. The key result is that NSAC provides well-calibrated uncertainty estimates while maintaining competitive accuracy across various applications.

Novelty
8.5
Reliability
7.5
arxiv/2605.26061
preview unavailable
PASS ✓

Accelerating Bayesian inverse design in computational fluid dynamics using neural operators

2026.05.25infracode

Bipin Tiwari, Omer San

This work shows that neural operator surrogates can significantly speed up Bayesian inference in aerodynamic design, achieving over three orders of magnitude in time reduction. The method maintains the integrity of uncertainty estimates while allowing for efficient geometry reconstruction.

Novelty
8.0
Reliability
8.0
arxiv/2605.26059
preview unavailable
PASS ✓

Retrying vs Resampling in AI Control

2026.05.25agents

James Lucassen, Adam Kaufman

This paper explores the concepts of retrying and resampling in AI coding tools, highlighting how retrying can reduce suspicion scores but may also allow for sneakier attacks. A key finding is that auditing based on maximum suspicion scores during resampling significantly improves safety without sacrificing usefulness.

Novelty
7.5
Reliability
8.0
arxiv/2605.26047
preview unavailable
PASS ✓

When Gradients Collide: Failure Modes of Multi-Objective Prompt Optimization for LLM Judges

2026.05.25agentscommunity code

Parth Darshan, Abhishek Divekar

This paper explores the challenges of customizing large language models for specific tasks using textual gradient methods. A key result is that combining multiple task instructions into a single prompt can significantly degrade performance, highlighting the need for careful design in multi-objective optimization.

Novelty
6.5
Reliability
7.0
arxiv/2605.26046
preview unavailable
PASS ✓

L2IR: Revealing Latent Intent in Graph Fraud Detection

2026.05.25data

Jinsheng Guo, Zhenhao Weng, Yibo Liu, et al.

The paper presents L2IR, a framework that leverages large language models to reveal latent intent in graph fraud detection. By distinguishing between supportive and misleading connections, L2IR improves detection performance significantly, achieving an AUPRC increase of up to 8.27%. This method shows promise for enhancing existing GNN-based detectors.

Novelty
8.0
Reliability
7.5
arxiv/2605.26040
preview unavailable
PASS ✓

DRScaffold: Boosting Dense-Scene Reasoning in Lightweight Vision Language Models

2026.05.25reasoning

Xinrui Shi, Kai Liu, Ziqing Zhang, et al.

This paper presents DRBench, a new benchmark for evaluating dense-scene reasoning in vision-language models, and DRScaffold, a fine-tuning framework that enhances grounded reasoning. The key result shows that a smaller model trained with structured supervision can outperform a larger frozen model, highlighting the effectiveness of the proposed approach.

Novelty
8.0
Reliability
8.0
arxiv/2605.26038
preview unavailable
PASS ✓

Peak-Then-Collapse and the Four Interface Channels of Knowledge-Graph Tool Use

2026.05.25agentscode

Tianda Sun, Dimitar Kazakov

This paper explores the challenges of using a minimal knowledge-graph tool API in reinforcement learning. A key result is that the tool-grounded answer rate improves initially but then collapses, highlighting the importance of interface feedback in the learning process.

Novelty
7.0
Reliability
7.5
arxiv/2605.26037
preview unavailable
PASS ✓

CITYREP: A Unified Benchmark for Urban Representations Across Cities, Tasks, and Modalities

2026.05.25datacode

Junyuan Liu, Xinglei Wang, Zichao Zeng, et al.

CityRep is a new benchmark for evaluating urban representation learning that mitigates spatial leakage and supports fair comparisons across different cities and tasks. The key finding is that performance varies significantly based on the evaluation split used, highlighting the importance of rigorous benchmarking in this field.

Novelty
8.0
Reliability
8.0
arxiv/2605.26036
preview unavailable
PASS ✓

Length Generalization with Log-Depth Recurrent Units

2026.05.25datacode

Charles Pert, Dalal Alrajeh, Alessandra Russo

The paper presents MLP-LDRU, a novel architecture that effectively addresses length generalization in neural networks. It achieves outstanding accuracy on various regular-language tasks, outperforming existing recurrent and attention-based models. This advancement could lead to improved performance in tasks requiring understanding of sequence length.

Novelty
8.0
Reliability
8.0
arxiv/2605.26035
preview unavailable
PASS ✓

Everything at Every Scale: Scale-Invariant Diffusion with Continuous Super-Resolution

2026.05.25visioncommunity code

Zixin Jessie Chen, Zhuo Chen, Archer Wang, et al.

The SKILD model introduces a unified framework for image generation and super-resolution, achieving impressive results on CIFAR-10 and ImageNet. It operates without task-specific architectures or retraining, making it a versatile tool for image processing.

Novelty
8.5
Reliability
8.0
arxiv/2605.26032
preview unavailable
PASS ✓

CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists

2026.05.25agentscode

Junlin Yang, Dylan Zhang, Xiangchen Song, et al.

CausaLab is a new environment for testing how well LLMs can understand and predict causal relationships. A key finding is that while GPT-5.2-high achieves high task accuracy, it struggles with causal understanding, highlighting the need for better intervention strategies.

Novelty
8.0
Reliability
7.5
arxiv/2605.26029
preview unavailable
PASS ✓

Retrieval-Augmented Detection of Potentially Abusive Clauses in Chilean Terms of Service

2026.05.25data

Christoffer Loeffler, Tomás Rey Pizarro, Daniel Ignacio Miranda Vásquez, et al.

This paper presents a framework for automatically detecting abusive clauses in Chilean Terms of Service, leveraging retrieval-augmented generation techniques. A key result shows that this approach allows local models to perform comparably to larger cloud-based systems while being more cost-effective.

Novelty
7.5
Reliability
8.0
arxiv/2605.26019
preview unavailable
PASS ✓

STORM: Internalized Modeling for Spatial-Temporal Reasoning in Video-Language Models

2026.05.25reasoningcode

Yiming Liang, Yixiao Chen, Yiyang Zhou, et al.

The STORMS framework enhances video reasoning by internalizing the reasoning process through latent trajectories instead of relying on external tools or textual chains. This approach significantly improves accuracy while reducing inference time. The key result shows that STORMS outperforms existing methods in both efficiency and effectiveness.

Novelty
8.0
Reliability
7.5
arxiv/2605.26014
preview unavailable
PASS ✓

AdvantageFlow: Advantage-Weighted Least Squares for RL in Flow Models

2026.05.25agentscode

Branislav Kveton, Anup Rao, Subhojyoti Mukherjee, et al.

AdvantageFlow is a new reinforcement learning algorithm that optimizes a forward-process prediction loss for flow models. It stabilizes the optimization problem through rollout policy regularization, leading to improved performance in image generation tasks. The key result shows that AdvantageFlow outperforms both Flow-GRPO and a state-of-the-art baseline.

Novelty
7.5
Reliability
8.0
arxiv/2605.26013
preview unavailable
PASS ✓

Forgotten Words: Benchmarking NeoBERT for Dementia Detection in Low-Resource Conversational Filipino and English Speech

2026.05.25datacode

Rez Samantha Z. Floresca, Edric Castel C. Hao, Hannah Grachiella Buñales, et al.

This paper presents the first evaluation of transformer-based models for dementia detection in Filipino speech, highlighting the importance of bilingual fine-tuning. The key finding is that bilingual fine-tuning significantly improves model performance, achieving a Macro-F1 score of 0.969-0.973, demonstrating the necessity of linguistic coverage in training.

Novelty
8.0
Reliability
8.0
arxiv/2605.26007
preview unavailable
PASS ✓

AI-Assisted Systematization for Evaluating GenAI Systems

2026.05.25agents

Dhruv Agarwal, Emily Sheng, Chad Atalla, et al.

This paper addresses the challenge of evaluating generative AI systems by introducing AI-assisted systematization. It presents a structured representation of concepts and evaluates the quality of generated concept specs for hate-based rhetoric and digital empathy. The key result is that AI assistance can effectively support the systematization process, improving clarity in evaluation.

Novelty
7.0
Reliability
7.5
arxiv/2605.26001
preview unavailable
PASS ✓

Statistical Inference for Stochastic Gradient Descent Beyond Finite Variance

2026.05.25datacommunity code

Jose Blanchet, Peter Glynn, Wenhao Yang

This paper introduces a model-agnostic method for creating confidence regions from stochastic gradient descent (SGD) trajectories, addressing challenges in statistical inference when gradients have infinite variance. The key result is that the proposed method is straightforward to implement and provides asymptotically valid confidence regions in both finite- and infinite-variance scenarios.

Novelty
7.0
Reliability
8.0
arxiv/2605.26000
preview unavailable
PASS ✓

Causal methods for LLM development and evaluation

2026.05.25alignment

Dennis Frauen, Marie Brockschmidt, Konstantin Hess, et al.

This paper argues for the integration of causal methods in the development and evaluation of large language models. It highlights how these methods can address confounding factors and improve the reliability of LLMs. The key result is that causal methods can enhance the understanding of interventions in LLM training and evaluation.

Novelty
8.0
Reliability
7.0
arxiv/2605.25998
preview unavailable
PASS ✓

Deployment-complete benchmarking

2026.05.25infracode

El Mustapha Mansouri, Keigo Arai

This paper presents a novel approach to benchmarking that emphasizes the importance of deployment actions over mere scores. A key finding is that traditional benchmarks often fail to provide sufficient evidence for deployment decisions, highlighting the need for more comprehensive evaluation methods.

Novelty
7.0
Reliability
8.0
arxiv/2605.25997
preview unavailable
PASS ✓

Fuzzy PyTorch: Rapid Numerical Variability Evaluation for Deep Learning Models

2026.05.25infracode

Inés Gonzalez-Pepe, Hiba Akhaddar, Tristan Glatard, et al.

Fuzzy PyTorch is a new framework that allows for efficient evaluation of numerical variability in deep learning models. It integrates stochastic arithmetic into PyTorch, achieving significant runtime reductions while maintaining model performance. This tool is particularly valuable for researchers and practitioners looking to manage floating-point uncertainty effectively.

Novelty
8.0
Reliability
8.0
arxiv/2605.25991
preview unavailable
PASS ✓

What Makes a Medical Checker Trainable? Diagnosing Signal Collapse and Reward Hacking in Checker-Guided RAG for Biomedical QA

2026.05.25agentscode

Yuelyu Ji, Min Gu Kwak, Hang Zhang, et al.

This paper explores the integration of claim-level NLI checkers into retrieval-augmented reinforcement learning for medical applications. A key finding is that the output distribution of the NLI checker during training significantly influences the quality of the model, with moderate signals yielding better results than strong signals. This insight can help practitioners optimize their reward systems in RL settings.

Novelty
8.0
Reliability
7.5
arxiv/2605.25988
preview unavailable
PASS ✓

Neural Scalable Symbolic Search Framework for Complex Logical Queries with Multiple Free Variables

2026.05.25reasoningcode

Weizhi Fei, Hang Yin, Zihao Wang, et al.

The NS3 framework offers a novel approach to answering complex queries over knowledge graphs by approximating joint rankings without exhaustive enumeration. It improves joint ranking performance while maintaining strong accuracy on marginal queries. This advancement is particularly valuable for practitioners dealing with multi-variable queries in knowledge representation.

Novelty
8.0
Reliability
8.0
arxiv/2605.25985
preview unavailable
PASS ✓

SafeCtrl-RL: Inference-Time Adaptive Behaviour Control for LLM Dialogue via RL-Driven Prompt Optimisation

2026.05.25agentscode

Michael Orme, Yanchao Yu, Zhiyuan Tan

SafeCtrl-RL is a novel framework for ensuring safe behavior in large language models during inference. It allows for adaptive safety regulation without the need for retraining, improving both safety and response quality. The key result is that it consistently outperforms existing prompt-based optimization methods.

Novelty
8.0
Reliability
7.5
arxiv/2605.25984
preview unavailable
PASS ✓

When Do LLM Agents Treat Surface Noise Differently from Semantic Noise? A 68-Cell Measurement Study with a Held-Out Trace-Level Validation

2026.05.25agentscode

Liyun Zhang, Jiayi Guo

This paper investigates how different types of perturbations affect the reasoning of large language models. It finds that meaning-bearing perturbations lead to greater inconsistencies in answers compared to presentation perturbations. This insight could inform future model training and evaluation strategies.

Novelty
7.5
Reliability
8.0
arxiv/2605.25981
preview unavailable
PASS ✓

SkillOpt: Executive Strategy for Self-Evolving Agent Skills

2026.05.22agentscode

Yifan Yang, Ziyang Gong, Weiquan Huang, et al.

SkillOpt is a novel optimizer for agent skills that improves performance by applying a controlled text-space optimization approach. It significantly enhances the accuracy of various models in different execution environments, demonstrating its effectiveness across multiple benchmarks.

Novelty
8.0
Reliability
8.0
arxiv/2605.23904
preview unavailable
PASS ✓

LLMs as Noisy Channels: A Shannon Perspective on Model Capacity and Scaling Laws

2026.05.22scaling

Xu Ouyang, Deyi Liu, Yuhang Cai, et al.

The paper presents the Shannon Scaling Law, which models LLM training as information transmission, capturing the effects of noise on performance. A key result is that failing to maintain a sufficient signal-to-noise ratio leads to performance degradation, which is effectively predicted by this new framework.

Novelty
8.0
Reliability
8.0
arxiv/2605.23901
preview unavailable
PASS ✓

From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills

2026.05.22agentscode

Zisu Huang, Jingwen Xu, Yifan Yang, et al.

This paper investigates the lifecycle of skills in language agents, focusing on their extraction and consumption. A key finding is that model-generated skills generally improve performance but can lead to negative transfer, highlighting the complexity of skill utility across different models. The authors propose a meta-skill to enhance skill extraction and reduce negative transfer.

Novelty
8.0
Reliability
7.0
arxiv/2605.23899
preview unavailable
PASS ✓

SPACENUM: Revisiting Spatial Numerical Understanding in VLMs

2026.05.22reasoning

Jianshu Zhang, Yijiang Li, Huifeixin Chen, et al.

This paper investigates how well Vision-Language Models understand numerical outputs in spatial contexts. The key finding is that these models often fail to ground numerical values in spatial meaning, performing close to random guessing. Improvements through tuning were noted, but explicit reasoning provided only marginal benefits.

Novelty
7.5
Reliability
6.5
arxiv/2605.23898
preview unavailable
PASS ✓

ETCHR: Editing To Clarify and Harness Reasoning

2026.05.22reasoningcode

Beichen Zhang, Yuhong Liu, Jinsong Li, et al.

The paper presents ETCHR, a novel image editing model designed to enhance visual reasoning in multimodal large language models. It improves reasoning accuracy significantly across various tasks, achieving notable performance gains with different models.

Novelty
8.0
Reliability
7.0
arxiv/2605.23897
preview unavailable
PASS ✓

Complete-muE: Optimal Hyperparameter Transfer and Scaling for MoE Models

2026.05.22scaling

Hongwu Peng, Ohiremen Dibua, Yuanjun Xiong, et al.

Complete-muE is a framework that enables efficient hyperparameter transfer from dense models to Mixture-of-Experts (MoE) models. It allows for stable hyperparameter optimization across different model architectures, significantly speeding up convergence without extensive hyperparameter searches. The key result is that hyperparameters tuned on a single dense model can be effectively transferred to all MoE configurations.

Novelty
8.0
Reliability
7.5
arxiv/2605.23893
preview unavailable
PASS ✓

Good Token Hunting: A Hitchhiker's Guide to Token Selection for Visual Geometry Transformers

2026.05.22visioncode

Shuhong Zheng, Michael Oechsle, Erik Sandström, et al.

This paper presents a two-stage token selection framework that enhances the efficiency of visual geometry transformers for 3D reconstruction. By reducing the number of tokens each query interacts with, the method accelerates processing by over 85% while maintaining or improving performance. This advancement could significantly impact future applications in the field.

Novelty
8.0
Reliability
7.0
arxiv/2605.23892
preview unavailable
PASS ✓

CHRONOS: Temporally-Aware Multi-Agent Coordination for Evolving Data Marketplaces

2026.05.22data

Joydeep Chandra

CHRONOS is a new architecture designed to improve recall and privacy in temporal knowledge-graph data marketplaces. It achieves a high recall rate of 0.937 while maintaining competitive query performance and privacy guarantees. This makes it a promising solution for managing evolving data and privacy constraints.

Novelty
8.0
Reliability
7.5
arxiv/2605.23887
preview unavailable
PASS ✓

Multilingual Knowledge Transfer under Data Constraints via Lexical Interventions

2026.05.22data

Anastasiia Sedova, Natalie Schluter, Skyler Seto, et al.

The LINK method enhances cross-lingual knowledge transfer by using lexical substitutions in high-resource training data. This approach requires only a bilingual vocabulary and leads to significant improvements in downstream tasks, achieving up to a 2x speedup in training time for equivalent performance.

Novelty
7.5
Reliability
8.0
arxiv/2605.23885
preview unavailable
PASS ✓

PGT: Procedurally Generated Tasks for improving visual grounding in MLLMs

2026.05.22vision

Rim Assouel, Amir Bar, Michal Drozdzal, et al.

This paper introduces Procedurally Generated Tasks (PGT) to enhance fine-grained visual understanding in Multimodal Large Language Models. The key result shows that instruction tuning with PGT data improves performance by up to +20% on the What'sUp benchmark, indicating that better supervision can address spatial reasoning deficits.

Novelty
8.0
Reliability
7.5
arxiv/2605.23883
preview unavailable
PASS ✓

On the Stability of Spherical Hellinger-Kantorovich Flows and Their Implications for Differential Privacy

2026.05.22data

Aratrika Mustafi, Soumya Mukherjee

This paper introduces a perturbation theory for spherical Hellinger-Kantorovich gradient flows, allowing for the comparison of flows from different potentials. A key result is the establishment of uniform bounds for log-likelihood ratios and divergences, which can be applied to enhance sampling methods in differential privacy.

Novelty
7.5
Reliability
8.0
arxiv/2605.23879
preview unavailable
PASS ✓

Training-Free Looped Transformers

2026.05.22scaling

Lizhang Chen, Jonathan Li, Chen Liang, et al.

This paper presents a training-free method for enhancing transformer models by applying a looping strategy at inference time. The key result shows significant performance improvements on various benchmarks, including a +2.64 percentage point increase on MMLU-Pro for Qwen3-4B-Instruct.

Novelty
8.0
Reliability
7.0
arxiv/2605.23872
preview unavailable
PASS ✓

Move on Muon : A Hamiltonian probability gradient flow perspective of Muon optimizer

2026.05.22scaling

Aratrika Mustafi, Soumya Mukherjee, Bharath K. Sriperumbudur

This paper presents a new gradient flow for optimizing matrix-valued parameters using a regularized version of the Muon optimizer. The key result is the establishment of a damped Hamiltonian dynamics that ensures energy dissipation and convergence rates under certain conditions, which could enhance training in neural networks.

Novelty
8.0
Reliability
7.0
arxiv/2605.23871
preview unavailable
PASS ✓

Human Decision-Making with Persuasive and Narrative LLM Explanations

2026.05.22reasoning

Laura R. Marusich, Mary Grace Kozuch Dhooghe, Jonathan Z. Bakdash, et al.

This study investigates how LLM-generated narrative explanations affect human decision-making. The key finding is that the persuasiveness of these narratives does not significantly improve decision accuracy compared to AI predictions alone, and may even slow down response times.

Novelty
6.0
Reliability
7.0
arxiv/2605.23867
preview unavailable
PASS ✓

Leveraging Foundation Models for Causal Generative Modeling

2026.05.22reasoning

Aneesh Komanduri, Xintao Wu

FM-CGM is a new framework that enables visual causal reasoning by integrating pretrained foundation models. It allows for zero-shot causal discovery and counterfactual generation, making it valuable for applications requiring reliable causal inference. A key result is its ability to identify plausible causal structures effectively.

Novelty
8.0
Reliability
7.0
arxiv/2605.23861
preview unavailable
PASS ✓

Strong Teacher Not Needed? On Distillation in LLM Pretraining

2026.05.22scalingcode

Taiming Lu, Zhuang Liu

This study reveals that even weaker teachers can enhance larger student models when using a proper mix of losses. It also shows that stronger teachers do not always yield better results, as excessive parameters or training can diminish distillation benefits. Importantly, distillation is found to improve generalization more effectively than in-domain fitting.

Novelty
7.5
Reliability
7.0
arxiv/2605.23857
preview unavailable
PASS ✓

Entrywise Error Bounds for Spectral Ranking with Semi-Random Adversaries

2026.05.22data

Dongmin Lee, Anuran Makur, Japneet Singh

This work explores how the performance of spectral algorithms for BTL estimation can be affected by adversarial sampling. The key finding is that by reweighting observed edges, the performance can be improved to match that of uniformly sampled graphs. This insight is crucial for practitioners dealing with biased data in ranking tasks.

Novelty
7.5
Reliability
7.0
arxiv/2605.23854
preview unavailable
PASS ✓

Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval

2026.05.22visioncode

Michal Shlapentokh-Rothman, Prachi Garg, Yu-Xiong Wang, et al.

ToolMerge is a new keyframe retrieval method that leverages LLMs to improve the selection process for long-video question answering. It effectively decomposes queries into tool calls and merges their results, showing a notable 5% improvement in caption retrieval over existing methods. This approach enhances the ability to provide verifiable visual evidence for various types of queries.

Novelty
8.0
Reliability
7.0
arxiv/2605.23826
preview unavailable
PASS ✓

It's the humans, not the data: Geopolitical bias in LLMs originates in post-training, amplified by the language of the prompt

2026.05.22alignmentcode

Stuart Bladon, Brinnae Bent

The research indicates that geopolitical biases in language models are primarily influenced by post-training rather than pre-training. Notably, the model from Alibaba showed a significant shift in bias towards China after post-training, emphasizing the importance of oversight in model alignment processes.

Novelty
8.0
Reliability
7.5
arxiv/2605.23825
preview unavailable
PASS ✓

Hierarchical Concept Geometry in Language Models Emerges from Word Co-occurrence

2026.05.22reasoning

Andres Nava, Matthieu Wyart

This paper presents a theory that explains how the relationship between general and specific concepts is geometrically represented in language models. The key finding is that the structure of word embeddings reflects a hierarchical organization that mirrors taxonomic relationships, which can be observed in both word2vec and Gemma 2B embeddings.

Novelty
8.0
Reliability
7.5
arxiv/2605.23821
preview unavailable
PASS ✓

Not Too Generative, Not Too Discriminative: The Human Alignment Sweet Spot

2026.05.22vision

Jorge Chang Ortega, Bastien Le Lan, Thomas Serre, et al.

This study investigates how human-like visual representations can be better understood through a balance of discriminative and generative learning. The key finding is that human alignment is maximized at intermediate points of this continuum, suggesting that a hybrid approach yields better results in vision tasks.

Novelty
8.0
Reliability
7.0
arxiv/2605.23819
preview unavailable
PASS ✓

Advanced AI Service Provisioning in O-RAN through LLM Engine Integration

2026.05.22agents

Seyed Bagher Hashemi Natanzi, Pranshav Gajja, Bo Tang, et al.

This paper introduces a Dual-Brain architecture that leverages LLMs for orchestrating data collection and deployment in O-RAN systems, while an automated ML engine trains classifiers on demand. The key result is the ability to streamline the development of AI applications for real-time RAN control, enhancing efficiency.

Novelty
8.0
Reliability
7.0
arxiv/2605.23809
preview unavailable
PASS ✓

Debiased Negative Mining Improves Out-of-distribution Detection with Pre-trained Vision-Language Models

2026.05.22visioncode

Bo Peng, Jie Lu, Guangquan Zhang, et al.

This paper presents a new approach to out-of-distribution detection using pre-trained vision-language models. The key result shows that their method for debiasing negative label mining significantly improves OOD detection performance across various setups.

Novelty
8.0
Reliability
7.0
arxiv/2605.23797
preview unavailable
PASS ✓

Beyond Binary Edits Robust Multimodal Knowledge Editing with Adversarial Subspace Alignment

2026.05.22multimodalcode

Haoyuan Wang, Xiaohao Liu, Jiajie Su, et al.

This paper addresses the challenge of updating knowledge in multimodal large language models without losing existing capabilities. The authors propose new techniques to enhance the generalization of knowledge edits, demonstrating that their methods can effectively maintain consistent predictions across semantically similar inputs. A key result is the introduction of adversarial variants that improve robustness in knowledge editing.

Novelty
8.0
Reliability
7.0
arxiv/2605.23780
preview unavailable
PASS ✓

Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most

2026.05.21scaling

Nick Merrill, Jaeho Lee, Ezra Karger

The paper reveals that larger language models perform worse in forecasting tasks with superlinear growth and tail risks, particularly in the upper tail of distributions. This inverse scaling effect suggests that more capable models may misestimate extreme outcomes while maintaining lower tail accuracy. The authors recommend using continuous accuracy measures for better evaluation of LLM forecasting.

Novelty
8.0
Reliability
7.0
arxiv/2605.22672
preview unavailable
PASS ✓

Semi-Parametric Bayesian Additive Regression Trees for Risk Prediction with High-Dimensional Epigenetic Signatures and Low-Dimensional Covariates

2026.05.19data

Saurabh Bhandari, Parveen Bhatti, Brian C. -H. Chiu, et al.

The spBART model effectively combines interpretable low-dimensional covariates with complex high-dimensional predictors. It successfully identifies important genomic loci and achieves a high out-of-sample discrimination rate (AUC = 0.96) in multiple myeloma studies.

Novelty
8.0
Reliability
7.0
arxiv/2605.20143
preview unavailable
PASS ✓

Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German

2026.05.18datacode

Sajjad Abdoli, Ghassan Al-Sumaidaee, Clayton W. Taylor, et al.

This study benchmarks five commercial ASR systems on code-switching between various languages. The key finding is that ElevenLabs Scribe v2 outperforms others with the lowest WER and highest BERTScore, highlighting significant quality differences in ASR performance.

Novelty
7.5
Reliability
8.0
arxiv/2605.19069
preview unavailable
PASS ✓

Entropy-Gradient Inversion: Moving Toward Internal Mechanism of Large Reasoning Models

2026.05.18agents

Junyao Yang, Chen Qian, Kun Wang, et al.

This paper presents a new approach to improve reasoning in Large Reasoning Models by utilizing a correlation between token entropy and logit gradients. The key result shows that their proposed method, CorR-PO, consistently outperforms existing techniques, indicating that stronger entropy inversions lead to better reasoning performance.

Novelty
8.0
Reliability
7.5
arxiv/2605.17770
preview unavailable
PASS ✓

Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders

2026.05.13reasoningcode

William Lehn-Schiøler, Magnus Ruud Kjær, Rahul Thapa, et al.

This paper presents a framework for interpreting EEG foundation models by extracting sparse feature dictionaries and grounding them in clinical taxonomies. A key result is the identification of operational regimes that reveal critical representational failures, impacting clinical trust in model predictions.

Novelty
8.0
Reliability
7.0
arxiv/2605.13930
preview unavailable
PASS ✓

Ceci n'est pas une explication: Evaluating Explanation Failures as Explainability Pitfalls in Language Learning Systems

2026.04.28agents

Ben Knight, Wm. Matthew Kennedy, Danielle Carvalho, et al.

The paper highlights how AI language learning tools can provide misleading feedback that reinforces misconceptions. It introduces L2-Bench, a benchmark for assessing AI feedback quality across six critical dimensions. The key result is the identification of 'explainability pitfalls' that can harm learning outcomes.

Novelty
8.0
Reliability
7.0
arxiv/2604.26145
preview unavailable
PASS ✓

SUDP: Secret-Use Delegation Protocol for Agentic Systems

2026.04.27agentscode

Xiaohang Yu, Hejia Geng, Xinmeng Zeng, et al.

The paper addresses the security risks associated with agentic systems using user secrets by formalizing the Agent Secret Use (ASU) problem. It proposes the Secret-Use Delegation Protocol (SUDP), which allows secure operations without granting reusable authority to untrusted requesters. This approach ensures that user-authorized actions are performed safely and effectively.

Novelty
8.0
Reliability
8.0
arxiv/2604.24920
preview unavailable
PASS ✓

RoIt-XMASA: Multi-Domain Multilingual Sentiment Analysis Dataset for Romanian and Italian

2026.04.18data

Andrei-Marius Avram, Aureliu Valentin Antonie, Cosmin-Mircea Croitoru, et al.

RoIt-XMASA is a new multilingual dataset for sentiment analysis that includes 36,000 labeled reviews in Italian and Romanian. The proposed adversarial training framework improves sentiment discrimination while maintaining language and domain invariance, achieving a notable F1-score of 66.23% with XLM-R.

Novelty
7.5
Reliability
7.0
arxiv/2604.17134
preview unavailable
PASS ✓

Safe Reinforcement Learning with Preference-based Constraint Inference

2026.03.24agents

Chenglin Li, Grant Ruan, Hua Geng

This study presents a new approach called Preference-based Constrained Reinforcement Learning (PbCRL) that effectively infers safety constraints from human preferences. A key result is that PbCRL achieves better alignment with true safety requirements while outperforming existing methods in both safety and reward metrics.

Novelty
8.0
Reliability
7.5
arxiv/2603.23565
preview unavailable
PASS ✓

Compression as Adaptation: Implicit Visual Representation with Diffusion Foundation Models

2026.03.08vision

Zongyu Guo, Jiajun He, Zhaoyang Jia, et al.

This paper presents a novel visual representation framework that encodes signals as functions, allowing for efficient video compression. The key result is the ability to hash an 81-frame video into a compact vector while enabling control over compression performance.

Novelty
8.0
Reliability
7.0
arxiv/2603.07615
preview unavailable
PASS ✓

Entropy-Aware On-Policy Distillation of Language Models

2026.03.07alignmentcode

Woogyeol Jin, Taywon Min, Yongjin Yang, et al.

The paper presents Entropy-Aware On-Policy Distillation, which improves knowledge transfer between language models by balancing precision and diversity. The key result shows significant accuracy gains across various benchmarks, indicating that accounting for teacher uncertainty enhances student-teacher alignment.

Novelty
8.0
Reliability
7.5
arxiv/2603.07079
preview unavailable
PASS ✓

Certified Per-Instance Unlearning Using Individual Sensitivity Bounds

2026.02.17· DI-ENSdata

Hanna Benarroch, Jamal Atif, Olivier Cappé

This work presents a new method for certified machine unlearning that uses adaptive noise calibration based on individual data point contributions. The key result is that this approach allows for certified unlearning with significantly less noise injection compared to traditional methods, improving practical applicability. The findings are supported by both theoretical analysis and experimental results.

Novelty
8.0
Reliability
7.0
arxiv/2602.15602
preview unavailable
PASS ✓

Linear Regression with Unknown Truncation Beyond Gaussian Features

2026.02.13data

Alexandros Kouridakis, Anay Mehrotra, Alkis Kalavasis, et al.

This paper presents a novel algorithm for truncated linear regression that operates efficiently even when the survival set is unknown. It achieves a polynomial runtime with respect to the number of dimensions and desired accuracy, making it more practical for real-world applications. The approach also contributes to positive-only PAC learning, which could be beneficial for future research.

Novelty
8.0
Reliability
7.0
arxiv/2602.12534
preview unavailable
PASS ✓

Cascaded Transfer: Learning Many Tasks under Budget Constraints

2026.01.29· CBscalingcode

Eloi Campagne, Yvenn Amara-Ouali, Yannig Goude, et al.

Cascaded Transfer Learning (CTL) allows for efficient learning across multiple related tasks by organizing them hierarchically. The approach minimizes transfer errors and maximizes accuracy within a constrained training budget, showing significant improvements in performance, especially under tight budgets.

Novelty
8.0
Reliability
7.0
arxiv/2601.21513
preview unavailable
PASS ✓

R$^3$L: Reflect-then-Retry Reinforcement Learning with Language-Guided Exploration, Pivotal Credit, and Positive Amplification

2026.01.07agentscode

Weijie Shi, Yanxi Chen, Zexi Li, et al.

R$^3$L improves reinforcement learning by synthesizing high-quality trajectories through a reflect-then-retry approach. This method enhances exploration and exploitation by using language feedback to correct errors and optimize training stability. The key result shows a 5% to 52% relative improvement over existing methods.

Novelty
8.0
Reliability
7.0
arxiv/2601.03715
preview unavailable
PASS ✓

On the Koopman-Based Generalization Bounds for Multi-Task Deep Learning

2025.12.22scaling

Mahdi Mohammadigohari, Giuseppe Di Fatta, Giuseppe Nicosia, et al.

This paper presents a new framework for establishing generalization bounds in multitask deep neural networks. By using operator-theoretic techniques and a tailored Sobolev space, the authors achieve tighter bounds that are effective even in single output scenarios. This approach enhances theoretical understanding and offers flexibility in multitask deep learning applications.

Novelty
8.0
Reliability
7.0
arxiv/2512.19199
preview unavailable
PASS ✓

Operator-Based Generalization Bound for Deep Learning: Insights on Multi-Task Learning

2025.12.22scaling

Mahdi Mohammadigohari, Giuseppe Di Fatta, Giuseppe Nicosia, et al.

This paper develops new generalization bounds for vector-valued neural networks, enhancing multi-task learning through a novel framework. The key result is the introduction of sketching techniques that improve computational efficiency while providing performance guarantees for various applications. This work significantly advances understanding of generalization in deep learning architectures.

Novelty
8.0
Reliability
7.0
arxiv/2512.19184
preview unavailable
PASS ✓

Bridging Data and Physics: A Graph Neural Network-Based Hybrid Twin Framework

2025.12.12data

M. Gorpinich, B. Moya, S. Rodriguez, et al.

This paper presents a hybrid twin approach that uses Graph Neural Networks to model the ignorance in physics-based simulations. The key result is that the GNN effectively captures missing physics and improves simulation accuracy while reducing data requirements, making it practical for real-world applications.

Novelty
8.0
Reliability
7.5
arxiv/2512.15767
preview unavailable
PASS ✓

DFIR-DETR: Frequency-Domain Iterative Refinement and Dynamic Feature Aggregation for Small Object Detection

2025.12.08vision

Bo Gao, Jingcheng Tong, Xingsheng Chen, et al.

DFIR-DETR improves small object detection by addressing issues in attention mechanisms and feature upsampling. It achieves a mean Average Precision (mAP50) of 92.9% on NEU-DET and 51.6% on VisDrone with a compact model size of 11.7M parameters. This demonstrates effective performance across different detection scenarios.

Novelty
8.0
Reliability
7.0
arxiv/2512.07078
preview unavailable
PASS ✓

DCC: Data-Centric Compilation of Machine Learning Kernels for Processing-In-Memory Architectures

2025.11.19infracode

Peiming Yang, Sankeerth Durvasula, Ivan Fernandez, et al.

DCC is a new ML compiler that optimizes data rearrangements and compute code for PIM devices, significantly improving performance. It achieves up to 13.17x speedup on specific PIM architectures compared to GPU-only execution, which is crucial for builders focused on maximizing efficiency in ML applications.

Novelty
8.0
Reliability
8.0
arxiv/2511.15503
preview unavailable
PASS ✓

Are Targeted Data Poisoning Attacks as Effective as We Think?

2025.09.08data

William Xu, Chenyu Zhang, Yihan Wang, et al.

The paper presents a novel approach to identify the easiest and hardest samples to poison in targeted data poisoning attacks. By leveraging clean model information, it enables better evaluation of attack effectiveness and proactive defenses against vulnerabilities. A key result is the reliable stratification of samples by poisoning vulnerability.

Novelty
8.0
Reliability
7.5
arxiv/2509.06896
preview unavailable
PASS ✓

Interactive Query Answering on Knowledge Graphs with Soft Entity Constraints

2025.08.19reasoningcode

Daniel Daza, Alberto Bernardi, Luca Costabello, et al.

This paper presents a new approach to query answering in knowledge graphs that incorporates soft constraints, allowing users to express preferences. The key result is that the proposed methods maintain robust performance while adding minimal overhead, enabling more flexible interactions with graph databases.

Novelty
8.0
Reliability
7.0
arxiv/2508.13663