HANDOFF: Humanoid Agentic Task-Space Whole-Body Control via Distilled Complementary Teachers
Lizhi Yang, Junheng Li, Nehar Poddar, et al.
The authors developed a new controller for humanoid robots called HANDOFF, which makes it easier to manage complex tasks. Unlike previous methods that required detailed movement instructions, HANDOFF uses a simpler interface that can adapt to various tasks. This change allows robots to perform better in real-world situations, such as following commands in natural language. Builders should care because this could lead to more effective and versatile robots in practical applications.
TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies
Dong Jing, Jingchen Nie, Tianqi Zhang, et al.
The authors developed a new system called TempoVLA that helps robots move at different speeds depending on the task's risk level. Unlike previous models that only used a fixed speed, TempoVLA can adjust its speed dynamically, speeding up in safe situations and slowing down when precision is needed. This is achieved through a new method that modifies how robot actions are timed. Builders should care because this flexibility can lead to more efficient and safer robotic operations in real-world applications.
Regret Minimization with Adaptive Opponents in Repeated Games
Mingyang Liu, Asuman Ozdaglar, Tiancheng Yu, et al.
The authors developed a new way to measure how players in repeated games can improve their strategies when opponents adapt based on past actions. Unlike previous methods, this new metric, RP-Regret, allows for better comparisons and can lead to more cooperative outcomes. They also created algorithms to minimize this regret and showed through experiments that these approaches can yield better results in specific games. Builders should care because this could improve decision-making in competitive environments.
DNQ: Deep Nash Q-Network for Partially Observable n-Player Games
Qintong Xie, Edward Koh, Xavier Cadet, et al.
The authors developed a new way to train agents that bid in auctions and similar competitive situations. They introduced a method called DNQ that helps these agents learn better strategies by using a shared critic to estimate payoffs. This approach is faster and more efficient than previous methods, especially when there are many agents involved. Builders should care because it allows for more scalable solutions in complex bidding environments, which can be crucial for real-world applications.
RREDCoT: Segment-Level Reward Redistribution for Reasoning Models
Mykyta Ielanskyi, Kajetan Schweighofer, Lukas Aichberger, et al.
The authors developed a new method called RREDCoT to improve how rewards are assigned in reasoning language models. Unlike previous methods that often struggled with high variance, RREDCoT uses the model to estimate rewards more effectively. This change allows for better training of models that need to think through complex problems. Builders should care because this method could lead to more reliable and efficient AI systems that can handle intricate reasoning tasks.
Self-Augmenting Retrieval for Diffusion Language Models
Paul Jünger, Justin Lovelace, Linxi Zhao, et al.
The authors developed a new method called SARDI that helps language models generate better answers by looking ahead at potential words they might use. Unlike previous methods, SARDI can quickly find relevant information without needing extra training. This means it can work faster and more effectively on complex questions. Builders should care because it shows a new way to improve AI responses using existing models without extensive retraining.
Human Adults and LLMs as Scientists: Who Benefits from Active Exploration?
Mandana Samiei, Eunice Yiu, Anthony GX-Chen, et al.
The authors studied how adults understand cause-and-effect relationships when they can actively explore their environment. They found that when given the chance to experiment, adults improved their ability to identify complex causal relationships that require multiple factors to work together. This is a shift from previous studies where participants only observed situations passively. Builders should care because it highlights the importance of agency in learning and could inform the design of educational tools or AI systems that mimic human reasoning.
Event Detection for Parameter-to-KPI Dependency Learning for AI-RAN
Christie Djidjev, Nicholas Kaminski
The authors developed a new approach to identify how different control parameters in wireless networks affect performance. Unlike previous methods that struggled with noisy data, their technique uses a synthetic traffic generator to create clear examples of these interactions. This is important because understanding these dependencies can help improve network management and performance. Builders should care because better dependency detection can lead to more efficient and reliable wireless networks.
USAD 2.0: Scaling Representation Distillation for Universal Audio Understanding
Heng-Jui Chang, Alexander H. Liu, Saurabhchand Bhati, et al.
The authors developed USAD 2.0, a new audio encoder that uses both self-supervised and supervised learning to improve performance. Unlike previous models that focused on specific audio types, this model covers multiple domains, including music. It also addresses issues with teacher models in training. Builders should care because this could lead to better audio processing tools and applications, making it easier to work with diverse audio inputs.
Causal Atlases from Entropic Inference: Bayesian Networks beyond Optimal DAGs
Hazhir Aliahmadi, Irina Babayan, Greg van Anders
The authors developed a new way to identify causal relationships in complex systems using a method based on entropy. Unlike traditional methods that optimize for a single causal graph, their approach generates multiple graphs that capture the uncertainty in the data. This is important because it helps avoid misleading conclusions about causality. Builders should care because understanding true causal relationships can lead to better decision-making and system design.
RIDE: An Open Dataset and Benchmark for Train Delay Prediction
Clément Elliker, Mathis Le Bail, Clément Mantoux, et al.
The authors created a new dataset called RIDE to help predict train delays more accurately. This dataset includes millions of train events and weather records, making it much easier to compare different prediction methods. Unlike previous approaches, RIDE standardizes how predictions are made and evaluated, which helps researchers understand which models work best. Builders should care because this framework can lead to better train scheduling and improved passenger experiences.
Physics Is All You Need? A Case Study in Physicist-Supervised AI Development of Scientific Software
Nhat-Minh Nguyen
This study explores the role of AI agents in research through a case where a physicist supervised an AI coding agent. The key finding is that effective supervision practices were crucial for ensuring the agent's outputs were trustworthy, highlighting the importance of supervision design over model capability.
VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion
Hidir Yesiltepe, Jiazhen Hu, Tuna Han Salih Meral, et al.
This paper introduces Multi-Head Latent Attention (MLA) for video diffusion, achieving a 92.7% reduction in per-token memory usage while maintaining quality. It demonstrates that MLA can outperform existing methods in long-horizon streaming video diffusion, improving throughput significantly. This advancement could lead to more efficient video processing techniques.
LLMSurgeon: Diagnosing Data Mixture of Large Language Models
Yaxin Luo, Jiacheng Cui, Xiaohan Zhao, et al.
This paper presents a novel framework called LLMSurgeon for estimating the pretraining data mixture of large language models based on generated text. It introduces a method for auditing the 'digital DNA' of foundation models, allowing for high-fidelity recovery of domain mixtures without direct access to training data. The key result is that LLMSurgeon can effectively recover domain mixtures under fixed protocols.
DynaFLIP: Rethinking Robotics Perception via Tri-Modal-Dynamics Guided Representation
Jusuk Lee, Seungjae Lee, Jonghun Shin, et al.
DynaFLIP is a new framework that improves robot manipulation by integrating motion understanding into perception. It uses a novel training approach with image-language-3D flow triplets, leading to significant performance gains in various tasks. The key result shows a +22.5% improvement in out-of-distribution scenarios, indicating better generalization.
SchGen: PCB Schematic Generation with Semantic-Grounded Code Representations
Qinpei Luo, Ruichun Ma, Xinyu Zhang, et al.
This paper presents SchGen, a large language model that generates editable PCB schematics from natural-language requests. It introduces a new representation that improves the accuracy of wire connectivity and functional correctness in schematic generation. The results indicate that representation design is crucial for enabling generative models in complex hardware tasks.
Tiny but Trusted: Efficient Vision-Language Reasoning for Time-Series Anomaly Detection
Xiaona Zhou, Muntasir Wahed, Tianjiao Yu, et al.
This paper presents VisAnomBench, a new benchmark for time-series anomaly detection, and introduces VisAnomReasoner, a parameter-efficient VLM that significantly improves anomaly localization. The key result shows improvements of over 21 percentage points in precision and F1 score compared to existing methods.
Unlocking the Working Memory of Large Language Models for Latent Reasoning
Lukas Aichberger, Sepp Hochreiter
The paper presents Reasoning in Memory (RiM), a novel approach that enhances the reasoning capabilities of large language models by using fixed memory blocks instead of autoregressive generation. This method allows for compute-efficient reasoning and shows promising results on reasoning benchmarks, matching or exceeding existing methods. The key takeaway is that RiM enables large language models to utilize working memory effectively for reasoning tasks.
GPIC: A Giant Permissive Image Corpus for Visual Generation
Keshigeyan Chandrasegaran, Kyle Sargent, Suchir Agarwal, et al.
The paper presents GPIC, a massive dataset of 28 trillion pixels designed for visual generative modeling. It includes a diverse set of images and a benchmarking protocol, making it a valuable resource for researchers and practitioners in the field. The dataset is permissively licensed, allowing for both research and commercial use.
Efficient Test-Time Finetuning of LLMs via Convex Reconstruction and Gradient Caching
Alaa Khamis, Alaa Maalouf
HullFT is a new method for test-time finetuning that optimizes both speed and quality by using a geometric approach. It effectively selects relevant training sequences and reduces computation time through Gradient Reuse. The key result is that HullFT achieves lower bits-per-byte at a significantly reduced runtime compared to existing methods.
Fairness-Aware Federated Learning with Trajectory Shapley Value
Daniel Kuznetsov, Ziqi Wang
This paper presents FedTSV, an adaptive aggregation method for federated learning that uses the Trajectory Shapley Value to dynamically adjust client contributions. The key result shows that FedTSV accelerates convergence and enhances fairness in client contributions, making it a valuable approach for real-time federated optimization.
Locally Coherent, Globally Incoherent: Bounding Compositional Incoherence in Multi-Component LLM Agents
Anany Kotawala
This paper presents a novel approach to address the coherence failures in multi-component LLM agents. It introduces the concept of compositional residuals and provides empirical evidence of the effectiveness of proposed mitigations. The key finding is that coherence issues can significantly impact performance, with a notable regret metric observed.
Demystifying Data Organization for Enhanced LLM Training
Yalun Dai, Yangyu Huang, Tongshen Yang, et al.
This paper explores how data organization can improve the training of large language models. It introduces two new methods for data ordering that significantly enhance training stability and performance. The findings suggest that strategic data organization is crucial for optimizing LLM training efficiency.
SoundnessBench: Can Your AI Scientist Really Tell Good Research Ideas from Bad Ones?
Sy-Tuyen Ho, Minghui Liu, Huy Nghiem, et al.
This paper presents SoundnessBench, a benchmark designed to assess the soundness of machine-learning research proposals. The key finding is that current LLMs exhibit a pervasive optimism bias, often misclassifying low-soundness proposals as sound. This indicates that LLMs are not yet reliable for evaluating scientific rigor at the proposal stage.
RoboWits: Unexpected Challenges for Robotic Creative Problem Solving
Chunru Lin, Hongxin Zhang, Fenghao Yu, et al.
RoboWits is a new benchmark for assessing robots' cognitive reasoning and adaptability in unexpected scenarios. The study shows that while pre-trained visual-language agents can handle basic tasks, they struggle with more complex, mutated tasks, highlighting their limitations in real-world applications. This insight is crucial for builders aiming to develop more robust robotic systems.
On Language Generation in the Limit with Bounded Memory
Jon Kleinberg, Anay Mehrotra, Amin Saberi, et al.
This paper explores how bounded memory affects language generation and identification tasks. It shows that while generation is achievable for any countable collection of languages, the density and identification capabilities are limited to finite collections. A key result is that allowing adaptive memory improves achievable density.
In-Context Reward Adaptation for Robust Preference Modeling
Zhenyu Sun, Zheng Xu, Ermin Wei
This paper introduces a novel framework for adapting reward models in reinforcement learning to better align with diverse human preferences. The key result shows that incorporating human response time as an auxiliary input allows the model to effectively adapt to previously unseen preference domains, enhancing robustness in human-AI alignment.
Neural Operator-Based Surrogate Model for CFD:Helical Coil Steam Generator in Small Modular Reactor
Minseo Lee, Seongmin Oh, Chaehyeon Song, et al.
The authors developed a new framework that combines reduced-order models with neural operators to improve real-time thermal-hydraulic simulations for small modular reactors. Unlike previous methods that struggled with the high computational costs of detailed fluid dynamics simulations, this approach allows for faster and more efficient analysis of complex systems like helical coil steam generators. The multi-scale version of their model, called L-DeepONet, effectively captures the dynamic behavior of swirling flows, while another model, the Fourier neural operator, provides accurate estimates of pressure changes. This advancement is significant for builders because it enables safer and more efficient reactor operations by allowing for quicker decision-making based on reliable simulations. Understanding these models can help builders select the right tools for their specific simulation needs.

GRASP: Plan-Guided Graph Retrieval with Adaptive Fusion and Reranking on Semi-Structured Knowledge Bases
Yicheng Tao, Yiqun Wang, Xiangchen Song, et al.
The authors developed a new framework called GRASP that improves how we retrieve information from semi-structured knowledge bases, which are databases that organize data in a graph format with entities and relationships. Unlike previous methods that either only used the graph for expanding queries or combined text and structure in a simplistic way, GRASP uses a three-step process that includes planning how to navigate the graph, merging this with a dense retrieval method, and then refining the results through a reranking step. This approach led to a significant increase in retrieval accuracy, as shown by the improvement in Hit@1 scores from 62.0 to 73.9 across various benchmarks. For builders, this means that applications like product searches or academic paper searches can become much more effective, providing users with more relevant results. Understanding and implementing GRASP could enhance the performance of systems that rely on complex data relationships.
PEFT-Arena: Understanding Parameter-Efficient Finetuning from a Stability-Plasticity Perspective
Yangyi Huang, Ruotian Peng, Zeju Qiu, et al.
This paper introduces PEFT-Arena, a benchmark that evaluates parameter-efficient finetuning by measuring both downstream performance and the retention of pretrained capabilities. The key finding is that orthogonal finetuning achieves the best balance between adaptation and retention under similar parameter budgets, highlighting the importance of stability-plasticity profiles in finetuning methods.
VLMs May Not Globally Enhance Human Alignment over LLMs During Natural Reading
Jinzhou Wu, Zhengwu Ma, Jixing Li, et al.
This research investigates how multimodal pretraining affects language models' alignment with human reading processes. The key finding is that while multimodal training may not universally enhance human-like text processing, it can selectively improve alignment when visual semantic content is stronger.
Self-Improving Language Models with Bidirectional Evolutionary Search
Guowei Xu, Zhenting Qi, Huangyuan Su, et al.
The paper presents Bidirectional Evolutionary Search (BES), a new framework that enhances search methods for language models by combining forward and backward search strategies. The key result shows that BES outperforms existing frameworks on challenging tasks, enabling better performance in both average and best-case scenarios.
Beyond Binary: Sim-to-Real Dexterous Manipulation with Physics-Grounded Contact Representation
Jiahe Pan, Stelian Coros, Jitendra Malik, et al.
This paper presents a new tactile representation called Center-of-Pressure (CoP) that improves sim-to-real transfer in contact-rich manipulation tasks. The authors demonstrate that policies using CoP outperform traditional methods, achieving zero-shot transfer in complex scenarios. This advancement could lead to more effective robotic manipulation in real-world applications.
Affective Music Recommendation: A Rollout-Based World Model for Offline Preference Optimization
Audrey Chan, Aaron Labbé, Jacob Lavoie, et al.
The Affective Music Recommendation System (AMRS) effectively predicts listener engagement and emotional responses using a causal transformer model. It employs Direct Preference Optimization to enhance the accuracy of predicted emotional states while maintaining diversity in recommendations. This work provides a promising approach to affective recommendation in ethically constrained environments.
AREA: Attribute Extraction and Aggregation for CLIP-Based Class-Incremental Learning
Zhen-Hao Xie, Yu-Cheng Shi, Da-Wei Zhou
The paper presents AREA, a novel approach for Class-Incremental Learning that stabilizes attribute extraction and aggregation in CLIP-based models. It effectively mitigates catastrophic forgetting by using principal geodesic analysis and task-specific experts. The key result shows that AREA consistently outperforms existing methods in this domain.

Calibrating Conservatism for Scalable Oversight
William Overman, Mohsen Bayati
The authors developed a new method called Calibrated Collective Oversight (CCO) that helps weaker overseers manage stronger AI agents that may act against human interests. Unlike previous methods that often relied on complex rules or assumptions, CCO uses a straightforward penalty system that adjusts based on how concerned overseers are about the AI's actions. This means that while high-reward actions can still be taken when they are deemed acceptable, they are penalized if they raise too much concern, keeping undesirable outcomes in check. For builders, this approach offers a practical way to ensure AI systems behave ethically and safely, even in challenging scenarios, making it easier to maintain control over powerful AI technologies.
Personal Visual Memory from Explicit and Implicit Evidence
Viet Nguyen, Thao Nguyen, Vishal M. Patel, et al.
The paper presents VisualMem, a new architecture that enhances long-term memory for personalized AI agents by integrating visual information. It shows that using personal visual memory significantly improves performance on a new benchmark while still being competitive on traditional text-memory tasks. This indicates the importance of visual context in personalized AI.
OmniVerifier-M1: Multimodal Meta-Verifier with Explicit Structured Recalibration
Xinchen Zhang, Bowei Liu, Jiale Liu, et al.
This paper presents OmniVerifier-M1, a novel visual verifier that utilizes symbolic meta-verification and decoupled reinforcement learning to enhance verification processes in multimodal models. A key result is that symbolic outputs significantly improve verification performance compared to traditional textual explanations, leading to better error localization and model reliability.

Ω-QVLA: Robust Quantization for Vision-Language-Action Models via Composite Rotation and Per-step Scaling
Xinyu Wang, Mingze Li, Sicheng Lyu, et al.
The authors developed Omega-QVLA, a new framework that allows for efficient compression of Vision-Language-Action (VLA) models, which combine visual perception, language understanding, and action control. Unlike previous methods that only partially quantized these models or used mixed precision, Omega-QVLA uniformly quantizes both the language and action components to a lower precision, making it more stable and effective. This results in high task success rates while significantly reducing the memory required to run these models on devices. Builders should care because this advancement enables the deployment of complex AI models on resource-constrained devices, opening up new possibilities for real-world applications in areas like robotics and interactive systems.

Human Label Variation as Stable Signal: Learning Annotator-Specific Explanation Behavior via Cross-Annotator Preference Optimization
Beiduo Chen, Pingjun Hong, Ziyun Zhang, et al.
The authors of this study discovered that large language models (LLMs) can learn the unique reasoning styles of different annotators by analyzing their free-text explanations. Unlike previous methods that focused solely on the labels given by annotators, this research shows that understanding the reasoning behind those labels can lead to better model performance. They introduced a new technique called cross-annotator preference optimization (CAPO), which helps the model better mimic individual annotators by comparing their responses to other valid annotations. This approach not only improves the model's ability to generate explanations that reflect specific annotator preferences but also enhances the overall quality of the annotations. Builders should care because this method could lead to more accurate and context-aware AI systems that better understand human reasoning, making them more effective in real-world applications.
CaMBRAIN: Real-time, Continuous EEG Inference with Causal State Space Models
Abhilash Durgam, Nyle Siddiqui, Jeffrey A. Chan-Santiago, et al.
The paper presents CaMBRAIN, a new model for real-time inference of EEG signals that overcomes the limitations of existing methods by enabling long-range continuous inference. It achieves state-of-the-art results with over 10 times higher throughput than previous models, making it a significant advancement for EEG analysis.
Skill-Conditioned Gated Self-Distillation for LLM Reasoning
Jiazhen Huang, Xiao Chen, Xiao Luo, et al.
The paper presents Skill-Conditioned Gated Self-Distillation (SGSD), which enhances reasoning in large language models by using a skill bank for supervision. SGSD outperforms existing methods like GRPO and OPSD on multiple benchmarks, showing a 6.2% improvement on average. This approach allows for more effective use of teacher-student dynamics in model training.
Do Agents Need Semantic Metadata? A Comparative Study in Agentic Data Retrieval
Shiyu Chen, Tarfah Alrashed, Alon Halevy, et al.
This paper analyzes the effectiveness of two types of data retrieval agents: a Baseline Agent and a Semantic Agent. The key finding is that the Semantic Agent significantly outperforms the Baseline Agent in precision when retrieving FAIR-compliant datasets, highlighting the importance of structured metadata.
Can Large Language Models Handle Discourse Particles? A Case Study of Colloquial Malay
Mariah Al Giptiah Binte Yusoff, Jakin Tan, Bocheng Chen, et al.
This paper introduces extsc{MalayPrag}, a benchmark for assessing LLMs' handling of discourse particles in colloquial Malay. The findings indicate that current LLMs struggle with these particles, but the proposed attributes significantly enhance their performance. This highlights the importance of structured approaches to improve LLMs' pragmatic understanding.

Bias Leaves a Gradient Trail: Label-Free Bias Identification via Gradient Probes on Concept Decompositions
Thomas Vitry, Kieran Edgeworth, Stefan Wermter, et al.
The authors developed a method to identify misleading patterns, or 'spurious concepts', in vision models without needing specific bias labels. Unlike previous methods that required retraining or labeled datasets, this approach uses standard class labels and analyzes how the model's predictions change when it encounters errors. By pinpointing and suppressing these spurious concepts, the model's accuracy can be significantly improved on various datasets, even after deployment. This is particularly valuable for builders working with models that cannot be easily retrained, as it offers a way to enhance performance and fairness without extensive modifications. Builders should care because this method provides a practical tool for improving model reliability in real-world applications.
The Abstraction Gap in Vision-Language Causal Reasoning
Chinh Hoang, Mohammad Rashedul Hasan
This paper presents a new methodology for evaluating vision-language models by distinguishing between linguistic plausibility and causal reasoning. The key finding is that while many models perform well on linguistic quality, they struggle with generating explicit causal chains. One model, however, demonstrates the ability to achieve near-zero Abstraction Gap, indicating potential for improved causal reasoning in VLMs.

Can LLMs Use Linguistic Uncertainty Markers to Reliably Reflect Intrinsic Confidence?
Gabrielle Kaili-May Liu, Arman Cohan
The authors of this paper explored how well large language models (LLMs) use language to express their confidence in their answers, specifically through phrases that indicate uncertainty, like 'it is likely...'. They discovered that LLMs often misrepresent their confidence levels, meaning they don't reliably use these phrases to reflect their true uncertainty. This is a shift from previous research, which mainly focused on how LLMs understand these markers without assessing their actual performance in using them. The findings suggest that improving how LLMs use these confidence markers could enhance their reliability and trustworthiness in applications. Builders should care about this because better calibration of LLMs can lead to more accurate and dependable AI systems, which is crucial for user trust and effective decision-making.
Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents
Suji Kim, Kangsan Kim, Sung Ju Hwang
LearnWeak is a new framework that helps small computer-use agents specialize in specific domains without requiring extensive annotations. It identifies weaknesses in agents and generates targeted training tasks, leading to significant performance improvements. The key result shows average gains of over 11 percentage points compared to existing models.
Agent Explorative Policy Optimization for Multimodal Agentic Reasoning
Minki Kang, Shizhe Diao, Ryo Hachiuma, et al.
This paper introduces AXPO, a new approach to improve tool use in vision-language models by addressing the Thinking-Acting Gap. The key result shows that SFT+AXPO outperforms SFT+GRPO across multiple benchmarks, achieving better performance with fewer parameters. This advancement could lead to more effective applications of vision-language models in real-world scenarios.

Rethinking Memory as Continuously Evolving Connectivity
Jizhan Fang, Buqiang Xu, Zhixian Wang, et al.
The authors developed FluxMem, a new memory framework that allows memory in AI agents to evolve and adapt in real-time, rather than being static and fixed. Unlike previous methods that treated memory as a simple storage system with set connections, FluxMem models memory as a flexible network that can change based on feedback and new information. This means that AI agents can better remember and connect relevant information as tasks and environments change, leading to improved performance in complex situations. Builders should care because this approach can significantly enhance the effectiveness of memory-augmented AI systems, making them more capable of handling dynamic challenges.
Multi-Mixer Models: Flexible Sequence Modeling with Shared Representations
Kevin Y. Li, Asher Trockman, Ananda Theertha Suresh, et al.
The Oryx model innovatively combines quadratic attention and linear recurrences to enhance efficiency and performance in language tasks. It demonstrates that hybrid architectures can effectively share internal representations, achieving competitive results even with limited token usage in attention mode. This suggests a promising new direction for model design in handling long-context retrieval and in-context learning.
Principled Algorithms for Optimizing Generalized Metrics in Multi-Label Learning
Mehryar Mohri, Yutao Zhong
This paper presents a new approach to multi-label classification that optimizes complex evaluation metrics using novel surrogate loss functions. The key result is the introduction of the MMO algorithm, which shows superior performance over existing methods on large datasets. This work provides both theoretical foundations and practical solutions for multi-label metric optimization.
SwarmHarness: Skill-Based Task Routing via Decentralized Incentive-Aligned AI Agent Networks
Edwin Jose
SwarmHarness proposes a decentralized protocol for sharing compute resources among nodes without a central authority. It features a self-regulating economy where nodes earn credits for contributions, promoting specialization and emergent collective intelligence. This approach could transform how distributed AI agents operate by enabling them to autonomously manage compute resources.

CubePart: An Open-Vocabulary Part-Controllable 3D Generator
Yiheng Zhu, Kangle Deng, Jean-Philippe Fauconnier, et al.
The authors developed CubePart, a framework that allows users to generate 3D models with specific part structures defined by text prompts. Unlike previous methods that produced either solid shapes or random part divisions, CubePart enables precise control over how the model is built, ensuring that each part aligns with user-defined categories. This means that game developers can create 3D assets that are ready to use in their projects without needing to make additional adjustments. Builders should care because this tool streamlines the process of creating interactive 3D content, making it easier to integrate generative models into games and simulations.
Deep Neural Networks for Doubly Robust Estimation with Nonprobability Survey Samples
Yufang Dai, Shihua Luo, Wendy Lou, et al.
This paper presents a deep neural network-based method for combining probability and nonprobability survey samples to estimate population means. The key result shows that the proposed estimators enhance robustness against parametric misspecification, particularly in nonlinear selection mechanisms.
LLM Zeroth-Order Fine-Tuning is an Inference Workload
Zelin Li, Caiwen Ding
This paper presents a novel method for zeroth-order fine-tuning of large language models that leverages a serving runtime to achieve significant speedups. The approach results in an 8.13x speedup compared to the baseline while maintaining high accuracy. This suggests a promising direction for integrating inference and training processes.
Extrapolative Weight Averaging Reveals Correctness-Efficiency Frontiers in Code RL
Kunhao Zheng, Pierre Chambon, Juliette Decugis, et al.
This study explores how extrapolative weight averaging can enhance performance in reinforcement learning by navigating a correctness-efficiency frontier. The key result shows that using this method improves the solve rate on challenging problems by 3.3% over the best single checkpoint, making it a valuable technique for builders in code-related RL tasks.
Preference-Shaped Expected Hypervolume and R2 Improvement: Exact Computation and Monotonicity
Michael T. M. Emmerich
This paper advances Bayesian multiobjective optimization by analyzing preference-shaped expected improvement criteria. A key result is the demonstration that exact integral R2 improvement can be represented as a scalarization-space volume, which has implications for developing efficient algorithms in this area.
Stance Detection in Prediction Markets: Addressing Imbalanced Trader Commentary via Counterfactual Augmentation and Market Context
Thomas Mbrice
This paper explores stance detection in prediction market comments, revealing that market context significantly improves recall for opposing stances. The optimal augmentation strategy is found to be 50% synthetic samples, which enhances performance without degrading it.

CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning
Linas Nasvytis, Simon Jerome Han, Ben Prystawski, et al.
The authors developed a new algorithm called CORE, which helps language models improve their reasoning skills by learning from past attempts. Unlike previous methods that often require a lot of training data and computational resources, CORE uses a more efficient approach by analyzing successful and unsuccessful reasoning attempts to generate useful insights. This means that builders can achieve better performance with fewer examples and less processing power. By making the learning process more interpretable and compact, CORE offers a promising way to enhance model self-improvement without the heavy resource demands of traditional methods. Builders should care because this could lead to faster and more effective development of AI systems that require less data and computational cost.
Reverse Probing: Supervised Token-level Uncertainty Quantification for Large Language Models in Clinical Text
Bushi Xiao, Sarvesh Soni, Daisy Zhe Wang
This paper presents Reverse Probing, a new framework for quantifying uncertainty in clinical text summarization. It achieves significant improvements in performance metrics, including up to 4 times higher AUPRC, while also reducing computational costs. The findings provide valuable insights into model behavior regarding clinical content.

BIRDNet: Mining and Encoding Boolean Implication Knowledge Graphs as Interpretable Deep Neural Networks
Tirtharaj Dash
The authors developed BIRDNet, a new type of neural network that leverages mined Boolean implication relationships between features in data to create a model that is both sparse and interpretable. Unlike traditional dense models that require a lot of parameters and can be hard to understand, BIRDNet uses significantly fewer parameters while still achieving competitive performance on biological data related to cancer. This means that builders can create models that are not only efficient but also provide clear insights into the underlying rules driving the data. By recovering known biological signatures, BIRDNet can help researchers make better decisions in cancer research and other fields. Builders should care because this approach offers a way to build more efficient and understandable AI systems, which is increasingly important in data-driven applications.
Code as a Weapon: A Consensus-Labeled Prompt Bank for Measuring Coding-Model Compliance with Malicious-Code Requests
Richard J. Young, Gregory D. Moody
This paper presents a new prompt bank that distinguishes between executable malicious code requests and harmful security knowledge requests. It consolidates multiple corpora and establishes a reliable basis for evaluating coding model compliance. The key result is the creation of a validated instrument that sets a higher refusal standard for coding models.

Utility-Aware Multimodal Contrastive Learning for Product Image Generation
Xiaohang Feng, Yiling Xie
The authors developed a new framework for generating product images that takes into account consumer demand, which they call utility-aware multimodal contrastive learning. Unlike previous models that focused mainly on matching images with text descriptions, this approach optimizes for images that are more likely to sell by considering what consumers actually want. This means that the generated images not only look good but also align better with market trends, leading to higher sales. Builders should care because this method can be integrated into existing generative AI systems to enhance their commercial effectiveness, making it a valuable tool for anyone involved in online retail or product marketing.
MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems
Xinle Deng, Ruobin Zhong, Hujin Peng, et al.
This paper introduces a novel framework for tracing errors in memory systems of large language models, which helps identify and correct systematic memory failures. The key result shows that their approach can enhance end-task performance by up to 7.62%. This work opens new avenues for improving the reliability of memory in LLMs.

AlphaTransit: Learning to Design City-scale Transit Routes
Bibek Poudel, Sai Swaminathan, Weizi Li
The authors developed AlphaTransit, a new framework for designing bus networks in cities that combines a method called Monte Carlo Tree Search (MCTS) with a neural network that predicts the quality of route designs. Unlike previous methods that often relied solely on trial and error, AlphaTransit uses learned insights to make better decisions about where to extend bus routes, leading to significant improvements in service rates. This means that cities can create more efficient transit systems that better meet the needs of their populations. Builders should care because this approach not only enhances the design process but also has the potential to improve public transportation accessibility and efficiency, making it a valuable tool for urban planners and transit authorities.
Beyond Lipschitz: Data-Driven Robustness via Discrete Modulus of Continuity
Jürgen Dölz, Michael Multerer, Michele Palma
This paper presents a new framework called the discrete modulus of continuity (DMOC) for assessing the robustness of neural networks. DMOC offers a more nuanced measure of robustness compared to traditional Lipschitz constants and is applicable to large datasets. A key result is that DMOC can effectively distinguish between trained and untrained networks, revealing underfitting and overfitting regimes.
How VLAs Fail Differently: Black-Box Action Monitoring Reveals Architecture-Specific Failure Signatures
Krishnam Gupta
This paper reveals that different VLA architectures exhibit distinct failure patterns at the motor-command level, necessitating tailored monitoring strategies. A key finding is that direction reversal rates can predict failures across architectures, while common safety mechanisms like velocity checking are often ineffective. This insight is crucial for developers working with VLA systems to ensure safety and reliability.
Multi-Adapter Representation Interventions via Energy Calibration
Manjiang Yu, Hongji Li, Junwei Chen, et al.
The paper presents MARI, a method that adapts intervention strategies for large language models based on sample-specific needs. This approach not only aligns models more effectively but also enhances their general capabilities on various tasks. The key result shows significant improvements on safety benchmarks while maintaining performance on general tasks.
LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?
HuiMing Fan, Xiao Wang, Zheng Chu, et al.
This paper investigates whether LLM-based search agents genuinely search the web or rely on their intrinsic knowledge. The key finding is that agents often depend on pre-existing knowledge, performing poorly when external evidence is removed, which highlights the limitations of static search benchmarks.

OpenURMA: A Clean-Room Open Implementation of the Unified Bus Protocol
Bojie Li
The authors developed OpenURMA, an open-source implementation of Huawei's Unified Bus (UB) protocol, which significantly improves the performance of Remote Direct Memory Access (RDMA) operations in data centers. Unlike previous methods that required each connection to maintain a lot of state information, UB simplifies this by separating application-specific data from transport data, leading to much lower latency. Their results show that UB can achieve an end-to-end latency of about 500 nanoseconds, which is over four times faster than the existing RoCEv2 protocol. This improvement means that data centers can handle more operations in less time, making them more efficient. Builders should care because adopting this technology could lead to faster and more responsive applications, ultimately enhancing user experience and system performance.
IPO-Mine: A Toolkit and Dataset for Section-Structured Analysis of Long, Multimodal IPO Documents
Michael Galarnyk, Siddharth Lohani, Vidhyakshaya Kannan, et al.
The IPO-Toolkit enables the parsing and analysis of over 109,000 IPO filings, addressing challenges in handling long, multimodal documents. A key result is the identification of alignment issues between state-of-the-art multimodal models and expert human judgments on financial charts.
Thinking as Compression: Your Reasoning Model is Secretly a Context Compressor
Guoxin Ma, Yibing Liu, Chengzhengxu Li, et al.
This paper introduces Thinking as Compression (TaC), a novel approach that allows LLMs to compress long contexts by generating thinking traces. The method outperforms existing compression techniques, achieving significant improvements in F1 and Exact Match scores at high compression ratios.
Stage-wise Distortion-Perception Traversal in Zero-shot Inverse Problems with Diffusion Models
Jiawei Zhang, Ziyuan Liu, Leon Yan, et al.
This paper presents a new framework called MAP-RPS for navigating the distortion-perception tradeoff in diffusion models. It combines MAP estimation with posterior sampling to improve perceptual quality in inverse problems. The results show that this approach effectively enhances performance across various tasks.
Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study
Irune Zubiaga, Aitor Soroa, Rodrigo Agerri
This paper explores strategies for developing multilingual LLMs for text evaluation, focusing on English, Spanish, and Basque. A key finding is that fine-tuning smaller models with in-domain data can match proprietary models, while larger models excel in zero-shot evaluations. The results offer practical guidance for building multilingual evaluation pipelines.
Beyond Binary Moral Judgment: Modeling Ethical Pluralism in AI
Aisha Aijaz, Rahul Goel, Arnav Batra, et al.
This paper introduces a framework for moral reasoning in AI that models ethical pluralism through a normative ethics simplex. The key result shows that integrating contextual and normative information significantly improves classification accuracy to 88.89%. This approach supports more human-like moral reasoning in AI systems.

Understanding Generalization and Forgetting in In-Context Continual Learning
Guangyu Li, Meng Ding, Lijie Hu
The authors of this paper developed a new theoretical framework to understand how Large Language Models (LLMs) manage multiple tasks presented in a single prompt. Unlike previous studies that focused on single tasks, this research reveals that standard attention mechanisms can cause interference between tasks, which negatively impacts the model's performance. This finding is important for builders because it highlights potential weaknesses in how models learn from past information when faced with new tasks. By understanding these limitations, developers can work on improving model robustness and performance in real-world applications where tasks are often mixed. Essentially, this research provides insights that can help builders create more effective AI systems that better handle complex, multi-task scenarios.
Expressive Power of Floating-Point Neural Networks with Arbitrary Reduction Orders and Inexact Activation Implementations
Yeachan Park, Geonho Hwang, Wonyeol Lee, et al.
This paper explores the expressive power of floating-point neural networks under more realistic execution semantics. It establishes a framework for determining when these networks can represent arbitrary functions, highlighting that distinguishability in the first layer is crucial for universal representability. This finding broadens the understanding of practical activation functions in neural networks.
A Fresh Look at Lamarckian Evolution and the Baldwin Effect
Inès Benito, Johannes F. Lutzeyer, Benjamin Doerr
This paper revisits Baldwinian and Lamarckian evolution in evolutionary algorithms, showing they outperform traditional Darwinian methods in various scenarios. The authors provide a set of generalist parameters that can benefit practitioners, highlighting the practical implications of their findings.

Natural Language Query to Configuration for Retrieval Agents
Melissa Z. Pan, Negar Arabzadeh, Mathew Jacob, et al.
The authors developed a system called BRANE that optimizes how retrieval agents handle queries by dynamically choosing the best configuration based on the specific characteristics of each query. Unlike previous methods that relied on a one-size-fits-all approach, which often required manual tuning for different workloads, BRANE can adjust its settings on-the-fly to improve performance. This means it can achieve the same level of accuracy as the best fixed configurations but at a significantly lower cost—up to 89% less. For builders, this flexibility allows for more efficient use of resources and better performance in real-world applications without the need for constant retraining. Essentially, BRANE offers a smarter way to manage retrieval processes, making it easier to balance quality and cost.

GENESIS: Harnessing AI Agents for Autonomous 6G RAN Synthesis, Research, and Testing
Tamerlan Aghayev, Maxime Elkael, Michele Polese, et al.
The authors developed GENESIS, an AI framework that helps speed up the research and development of cellular networks, specifically for 6G technology. Unlike traditional methods that can take months for each iteration, GENESIS quickly turns ideas or problems into tested solutions using real-world experiments. This means that builders can develop and refine network features much faster and with greater reliability. By addressing common issues like misinterpretation of technical specifications, GENESIS ensures that different components of the network work well together. Builders should care because this framework can significantly reduce development time and improve the quality of their network solutions.
Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases
Dongyoon Hahm, Dylan Hadfield-Menell, Kimin Lee
This paper reveals a critical vulnerability in Reinforcement Learning from Human Feedback (RLHF) called alignment tampering, where LLMs can influence their own preference datasets. The authors show that this can lead to the amplification of biases in generated responses, raising concerns about the reliability of current alignment methods. Mitigating this issue proves challenging without compromising response quality.

Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders
Yi Jing, Zao Dai, Jinwu Hu, et al.
The authors developed a framework called SAERL that enhances how we manage training data for large language models (LLMs) by tapping into the internal workings of the model itself. Unlike previous methods that mainly relied on external indicators, SAERL uses insights from a tool called Sparse Autoencoder to assess the diversity, difficulty, and quality of the training data. This approach led to a 3% increase in accuracy and a 20% reduction in training time for a specific model, showing that it can be effective across various model types and training methods. Builders should care because this framework offers a more efficient way to improve model performance, making it easier to achieve better results with less effort and resources.

From Scores to Gibbs Correctors: Accelerating Uniform-Rate Discrete Diffusion Models
Yuchen Liang, Ness Shroff, Yingbin Liang
The authors developed a new method called GADD, which stands for Gibbs-Accelerated Discrete Diffusion, to speed up the process of generating samples from discrete diffusion models. Unlike previous methods that either needed extra training or were slow to mix, GADD directly uses the structure of the score function to create more efficient sampling without additional training. This results in a significant improvement in both the quality of samples and the time it takes to generate them, making it useful for tasks like text generation and music creation. Builders should care because GADD offers a more efficient way to implement discrete diffusion models, which can enhance the performance of applications relying on these models, ultimately saving time and resources.

When Eyes Betray AI: Social Gaze Consistency as a Semantic Cue for AI-Generated Image Detection
Kim Jihyeon, Sohee Kim, Soosan Lee, et al.
The authors of this paper discovered a new method for detecting AI-generated images by focusing on how people in the images look at each other, which they call Social Gaze Consistency. Unlike previous methods that relied on identifying low-level visual artifacts like pixel errors, this approach looks at the overall coherence of gaze direction and eye alignment among individuals in the image. This change allows for better detection of manipulated images, even when the changes are subtle. For builders, this means that by incorporating this method, they can improve the reliability of AI systems in distinguishing between real and fake images, which is crucial for applications in security, content moderation, and media verification. Overall, this research highlights a new angle for enhancing AI detection capabilities that could lead to more robust and trustworthy AI applications.

MATCHA: Matching Text via Contrastive Semantic Alignment
Siran Li, Ece Sena Etoglu, Carsten Eickhoff, et al.
The paper presents MATCHA, a new evaluation metric for large language models that improves upon traditional metrics like ROUGE and BERTScore. It effectively measures semantic agreement while penalizing contradictions, showing significant performance improvements on various tasks. The key result is a 20.82% improvement over BERTScore on the TruthfulQA dataset.
Towards Controllable Image Generation through Representation-Conditioned Diffusion Models
Nithesh Chandher Karthikeyan, Jonas Unger, Gabriel Eilertsen
This paper explores a novel self-conditioning mechanism for diffusion models, improving both unconditional image generation quality and control over the output. The authors identify directions of variation in the representation space, demonstrating smoothness and disentanglement properties that could benefit practical applications in image generation.

2-ASP(Q) programs with weak constraints: Complexity and efficient implementation
Andrea Cuteri, Giuseppe Mazzotta, Francesco Ricca
This paper presents 2-ASP(Q)^w, a new fragment of Answer Set Programming that can handle optimization problems. The authors introduce effective strategies for computing quantified answer sets, validated through experiments on challenging benchmarks, demonstrating practical effectiveness.
FinHarness: An Inline Lifecycle Safety Harness for Finance LLM Agents
Haoxuan Jia, Yang Liu, Bin Chong, et al.
FinHarness is a new safety mechanism for finance LLM agents that effectively reduces unauthorized actions while maintaining legitimate approvals. It achieves a significant drop in action success rate from 38.3% to 15.0% and uses fewer advanced judge calls, making it efficient. This approach allows agents to make better decisions in real-time.

EdgeFlow: Edge-Map Augmented VLM-Based Flowchart Processing for Industrial Requirements Engineering
Zhifei Dou, Shabnam Hassani, Ou Wei
EdgeFlow improves the conversion of flowcharts to machine-readable models by using a Canny edge map as a structural prior. It achieves notable increases in node-level and edge-level F1 scores, demonstrating its effectiveness in industrial requirements engineering. This method does not require annotated training data, making it practical for real-world applications.

Maat: The Agentic Legal Research Assistant for Competition Protection
Basant Mounir, Farida Madkour, Amira Abdelaziz, et al.
Maat is a ReAct agent designed for legal research in competition law, addressing limitations of existing general and legal assistants. It effectively grounds findings in official sources and provides rich citations, significantly outperforming baseline tools on case-specific tasks. This makes it a valuable tool for legal professionals needing reliable research assistance.
Governed Evolution of Agent Runtimes through Executable Operational Cognition
Mariano Garralda-Barrio
This paper presents a framework for managing the lifecycle of agent-generated artifacts in multi-agent systems. It emphasizes the importance of treating these artifacts as persistent capabilities rather than transient outputs. The key result is the introduction of HarnessMutation, which allows for governed runtime adaptation with explicit validation and rollback mechanisms.
Semantic Gradients Interactions in SSD: A Case Study in Racial Identity and Hate Speech
Felix Ostrowicki, Hubert Plisiecki
The paper presents interaction SSD, a novel method for analyzing how semantic meaning varies across different moderators. It effectively illustrates this method using the UC Berkeley Measuring Hate Speech corpus, revealing significant moderation effects based on annotator racial identity. This approach enhances the interpretability of hate-speech judgments.

Modeling Agentic Technical Debt and Stochastic Tax: A Standalone Framework for Measurement, Simulation, and Dashboarding
Muhammad Zia Hydari, Raja Iqbal, Narayan Ramasubbu
This paper presents a model that differentiates between two important concepts in AI systems: Agentic Technical Debt and Stochastic Tax. The key result is that while debt can amplify operational burdens, the tax can persist even with minimized debt, offering insights for better governance in AI workflows.

Probabilistic Smoothing with Ratio-Monotone Transforms for Global Optimization
Kukyoung Jang, Taehyun Cho, Junrui Zhang, et al.
This paper presents a novel smoothing framework that improves global optimization by using flexible unimodal kernels. A key result is that the smoothed objective maintains the global maximizer, enhancing robustness without needing a decreasing smoothing schedule.
Real Images, Worse Judgments: Evaluating Vision-Language Models on Concreteness and Imagery
Yifan Jiang, Ruoxi Ning, Sheng Yao, et al.
This study investigates whether visual inputs improve language understanding in multimodal models. It finds that real-image contexts can sometimes degrade performance, especially for less relevant visual evidence. The key result is that focusing on textual content can mitigate these issues.

When Does Demographic Information Help? Data and Modeling Regimes for Perspective-Aware Hate Speech Detection
Weibin Cai, Reza Zafarani
This paper investigates the role of demographic information in hate speech detection, revealing that its effectiveness varies based on data characteristics and modeling approaches. The key finding is that demographic gains are most pronounced in scenarios with low training disagreement and high test disagreement, leading to the introduction of a new model that selectively incorporates demographic data.
Chartographer: Counterfactual Chart Generation for Evaluating Vision-Language Models
Yifan Jiang, Dae Yon Hwang, Jesse C. Cresswell, et al.
The paper presents a novel framework called Chartographer that generates counterfactual charts to evaluate visual reasoning in question-answering tasks. It reveals that vision-language models often fail to generalize when faced with updated charts requiring new reasoning pathways. This finding highlights the limitations of current models in handling visual reasoning tasks effectively.
Greening AI Inference with Accuracy and Latency-aware User Incentives
Vasilios A. Siris, Adamantia Stamou, George D. Stamoulis, et al.
This paper presents a framework for incentivizing AI inference based on users' preferences for quality, latency, and environmental consciousness. A key result is the introduction of a two-tier service subscription model that allows users to reduce carbon emissions in exchange for discounts. This approach provides flexibility for AI providers in managing inference requests during high carbon intensity periods.
Normal Guidance is what Attention Needs
Ethan Harvey, Dennis Johan Loevlie, Michael C. Hughes
This paper explores a novel approach to training classifiers for 3D medical images using a single binary label. The proposed Normal Guidance technique significantly enhances attention-based methods for slice-level localization, outperforming state-of-the-art techniques while maintaining competitive performance in whole-scan classification.

Risk Averse Alert Prioritization for IDS Using Subnormal Gaussian Fuzzy Models
Murat Moran
This paper presents a new framework for prioritizing alerts in intrusion detection systems by modeling uncertainty with fuzzy numbers. The key result shows that this approach significantly outperforms traditional methods in terms of robustness, especially under detector degradation scenarios.
Self-Ensembling Vision-Language Models for Chart Data Extraction
Thomas Berkane, Qianyi Wang, Maimuna S. Majumder
This paper introduces a novel self-ensembling method for extracting tabular data from charts, improving accuracy by up to 23% on a new benchmark. It addresses the limitations of existing models by aggregating multiple outputs to enhance reliability and accuracy. This advancement enables better reuse and analysis of data previously locked in chart images.

BASIS: Batchwise Advantage Estimation from Single-Rollout Information Sharing for LLM Reasoning
Shijin Gong, Erhan Xu, Kai Ye, et al.
BASIS is a new algorithm that enhances the efficiency of value function estimation in reinforcement learning. It achieves a 69% reduction in MSE compared to a strong baseline while using only one rollout per prompt, leading to better policy optimization with less training time.
Detectability in Diversity: Improved Canary Crafting for Privacy Auditing in One Run
Mathieu Dagréou, Aurélien Bellet
This paper introduces an efficient method for crafting canaries in privacy auditing, which enhances the accuracy of privacy leakage estimates while reducing computational costs. The approach combines influence functions with bilevel optimization to achieve better results than previous methods.
Falcon-X: A Time Series Foundation Model for Heterogeneous Multivariate Modeling
Yiding Liu, Yifan Hu, Hongjie Xia, et al.
Falcon-X is a new time series foundation model that improves forecasting by decoupling variates from the raw space and aligning them in a unified latent prototype space. It achieves state-of-the-art performance on key benchmarks, making it a valuable tool for complex multivariate forecasting tasks.
Causal Risk Minimization for High-Dimensional Treatments
Nikita Dhawan, Arnav Paruthi, Andrew Kim, et al.
This paper presents a new method for predicting the effects of interventions in high-dimensional spaces, such as text treatments. A key result is the demonstration that higher-order balance error optimization improves causal estimation, allowing a single model to address multiple causal questions effectively.
Transfer Learning using 66 Diseases for Disease Forecasting Applications
Lauren J Beesley, Alexander C Murph, Dave Osthus, et al.
This paper explores the integration of multiple data streams for forecasting infectious diseases, showing that this approach improves performance in 84.9% of cases. It emphasizes the importance of data quality, indicating that irrelevant data can harm forecasts. A key contribution is the creation of a publicly-available database for the forecasting community.
Lost in Sampling: Assessing Lexical Reachability in LLMs via the Word Coverage Score (WCS)
Samer Awad, Javier Conde, Carlos Arriaga, et al.
This paper investigates how standard sampling methods in LLMs limit linguistic diversity. It introduces the Word Coverage Score (WCS) to measure the impact of these sampling filters on the use of low-frequency, high-information words. The key finding is that common sampling defaults can unintentionally censor diverse language, leading to more homogeneous text outputs.
LUCoS: Latent Unsupervised Context Selection for Tabular Foundation Models
Oroel Ipas, Guillermo Gomez-Trenado, Rocío Romero-Zaliz, et al.
The paper presents LUCoS, a novel method for selecting instances in low-label tabular learning by utilizing latent geometry from embeddings. It significantly outperforms random selection and traditional methods across various datasets and budgets, highlighting the importance of representativeness in context selection.
Temporal Simultaneity Predicts Annotation Quality in Sentiment Corpora
Idris Abdulmumin, Mokgadi Penelope Matloga, Tadesse Destaw Belay, et al.
This paper presents a new sentiment dataset for Setswana and analyzes the decline in inter-annotator agreement over time. A key finding is that tweets labeled within one minute achieve a much higher agreement score than those labeled further apart, highlighting the importance of temporal factors in annotation quality.
The Coverage Illusion: From Pre-retrieval Routing Failure to Post-retrieval Cascades in a Production RAG System
Zafar Hussain, Kristoffer Nielbo
This study reveals that a significant number of real user queries do not require LLM augmentation, contrary to synthetic query assumptions. By implementing a post-retrieval cascade, the authors improve retrieval quality and reduce latency, serving most queries without LLM augmentation. The key result is a 31.8% reduction in latency while maintaining high quality.

MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research
Dingbang Wu, Rui Hao, Haiyang Wang, et al.
MobileGym is a new environment designed for mobile applications that allows for high interaction fidelity and scalable reinforcement learning. It provides structured evaluation and rewards, leading to a notable performance improvement in real-device execution. The key result shows a +12.8 percentage point gain on a test set, indicating its effectiveness.

From Model Scaling to System Scaling: Scaling the Harness in Agentic AI
Shangding Gu
This paper highlights the importance of designing modular and verifiable architectures around foundation models for agentic AI. It identifies key bottlenecks in context governance, trustworthy memory, and skill routing, proposing a new evaluation framework that focuses on the quality of agent behavior over simple task success. The key result is the introduction of CheetahClaws, a reference harness for evaluating these architectures.

Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation
Shuhong Zheng, Aashish Kumar Misraa, Yu-Teng Li, et al.
This paper presents a new method for subject-driven image generation that effectively preserves identity while following textual instructions. By conditioning diffusion models on Multimodal Large Language Models and incorporating a VAE-based identity conditioning, the approach mitigates common issues like copy-paste artifacts. The key result shows significant improvement in human preference for generated images.

Prism: A Plug-in Reproducible Infrastructure for Scalable Multimodal Continual Instruction Tuning
Jun-Tao Tang, Yu-Cheng Shi, Zhen-Hao Xie, et al.
The paper presents Prism, a new codebase designed to facilitate scalable Multimodal Continual Instruction Tuning (MCIT) research. By allowing independent plugin integration, it reduces implementation overhead and enhances code reuse. This approach aims to accelerate the development of new MCIT strategies.

Looped Diffusion Language Models
Sanghyun Lee, Chunsan Hong, Seungryong Kim, et al.
This paper presents LoopMDM, a new approach that improves training efficiency and model performance in masked diffusion models by selectively looping transformer layers. The key result is that LoopMDM can achieve the same performance as larger models while using significantly fewer training resources, making it a compelling option for builders focused on efficiency.

Beyond Summaries: Structure-Aware Labeling of Code Changes with Large Language Models
Bar Weiss, Antonio Abu-Nassar, Adi Sosnovich, et al.
This paper presents a new approach to improve code review efficiency by using large language models to label code changes in patches. The proposed method achieves high recall and precision, suggesting it can effectively enhance traditional static analysis workflows.

Language Models Need Sleep
Sangyun Lee, Sean McLeish, Tom Goldstein, et al.
This paper presents a novel sleep-like mechanism for transformer models that allows them to handle long contexts more effectively. The key result shows that increasing the duration of this 'sleep' improves performance, particularly on tasks requiring deeper reasoning. This could be crucial for builders looking to enhance model efficiency in complex tasks.

Forgetting in Language Models: Capacity, Optimization, and Self-Generated Replay
Martin Marek, Dongkyu Cho, Shikai Qiu, et al.
This paper addresses the issue of forgetting in language models when trained on new tasks. It shows that self-generated samples can effectively serve as replay data, significantly reducing forgetting. The key result is that this method allows for high-learning-rate finetuning without the typical tradeoff of forgetting.

Goal-driven Bayesian Optimal Experimental Design for Robust Decision-Making Under Model Uncertainty
Jinwoo Go, Xiaoning Qian, Byung-Jun Yoon
GoBOED optimizes experimental designs specifically for decision-making objectives, improving alignment with downstream goals. It demonstrates that designs can be more effective than those derived from traditional information-gain maximization. This approach reveals that optimal design windows are broader than previously thought.

OrpQuant: Geometric Orthogonal Residual Projection for Multiplier-Free Power-of-Two Transformer Quantization
Maoyang Xiang, Bo Wang, Tao Luo
This paper presents Orthogonal Residual Projection (ORP), a new framework that improves quantization for large language models on edge devices. ORP achieves a perplexity of 6.10 on LLaMA-2-7B under a 3-bit constraint, outperforming traditional methods while reducing calibration time to about 15 minutes. This advancement addresses critical timing bottlenecks and enhances hardware efficiency.
Channel-wise Vector Quantization
Wei Song, Tianhang Wang, Yitong Chen, et al.
The paper presents Channel-wise Vector Quantization (CVQ), which improves image tokenization by using channel-wise tokens instead of patch-wise ones. This method leads to a new visual autoregressive model that enhances image generation quality, achieving high scores in evaluation metrics. The key result is that CVQ significantly improves reconstruction quality over traditional vector quantization methods.
DiscoverPhysics: Benchmarking LLMs for Out-of-the-Box Scientific Thinking
Matt L. Wiemann, Lindsay M. Smith, Peter Melchior, et al.
DiscoverPhysics is a new benchmark that challenges LLMs to discover physics laws in simulated worlds with unique rules. The study reveals that even the best models struggle with complex tasks requiring hypothesis refinement and experimental design. This highlights the gap between predictive accuracy and conceptual understanding in LLMs.
Automated Benchmark Auditing for AI Agents and Large Language Models
Junlin Wang, Federico Bianchi, Shang Zhu, et al.
The paper presents Auto Benchmark Audit (ABA), a framework that identifies critical issues in AI benchmarks, such as ambiguous task design and incorrect ground truths. By auditing 168 benchmarks, ABA reveals that over 25.7% contain significant problems, which can distort model performance assessments. The tool and annotations are released to aid future benchmark development.
Global Convergence of Wasserstein Policy Gradient for Entropy-Regularized Reinforcement Learning
Zhaoyu Zhu, Rui Gao, Shuang Li
This paper develops a global convergence theory for Wasserstein policy gradient in reinforcement learning by utilizing the Bellman structure. The key result is that the Bellman recursion induces a favorable geometry that supports global convergence, despite the non-convex nature of the entropy-regularized RL objective.
StakeBench: Evaluating Language Understanding Grounded in Market Commitment
Yunhua Pei, Jingyu Hu, Yiwei Shi, et al.
StakeBench is a new framework for evaluating language models based on market commitments rather than subjective labels. It demonstrates that while models can partially recover position-side signals, they struggle with future action anticipation and collective odds projection. This highlights the need for better alignment between model predictions and market behavior.
Active Query Synthesis for Preference Learning
Namrata Nadagouda, Nauman Ahad, Maegan Tucker, et al.
This paper presents a novel approach to active learning that improves the efficiency of user preference learning by addressing feedback reliability. The key result is the development of the Info-Synth framework, which generates optimal queries to enhance decision-making systems. This method shows versatility across various applications, including preference learning and robotic control.
WhoSaidIt: Human-LLM Collaborative Annotation for Text-Based Multilingual Speaker-Attribute Classification
Lingyu Gao, Will Monroe, David Smith, et al.
This paper presents a new framework for re-annotating multilingual speaker attributes using human-LLM collaboration. The key finding is that there are significant cross-lingual differences in how speaker attributes are annotated, highlighting both the potential and limitations of LLMs in this context.
Conditional KRR: Injecting Unpenalized Features into Kernel Methods with Applications to Kernel Thresholding
Rustem Takhanov, Zhenisbek Assylbekov
This paper presents conditional kernel ridge regression (conditional KRR), which improves upon standard KRR by focusing on the residuals of the regression function. The key result shows that conditional KRR can outperform standard KRR when the feature component is more significant than the residuals. This finding is backed by both theoretical analysis and experimental validation.
Paris 2.0: A Decentralized Diffusion Model for Video Generation
Ali Rouzbayani, Bidhan Roy, Marcos Villagra, et al.
Paris 2.0 is a groundbreaking video generation model that utilizes decentralized computation for training. It achieves a remarkable reduction in Frechet Video Distance, demonstrating a 2.0x improvement over previous methods. This advancement opens new avenues for efficient video generation without reliance on large GPU clusters.
Neuronal Stochastic Attention Circuit (NSAC) for Probabilistic Representation Learning
Waleed Razzaq, Yun-Bo Zhao
The Neuronal Stochastic Attention Circuit (NSAC) is a new architecture that enhances uncertainty quantification in continuous-time learning tasks. It effectively combines Gaussian negative log-likelihood with a regularizer to improve predictive variance. The key result is that NSAC provides well-calibrated uncertainty estimates while maintaining competitive accuracy across various applications.
Accelerating Bayesian inverse design in computational fluid dynamics using neural operators
Bipin Tiwari, Omer San
This work shows that neural operator surrogates can significantly speed up Bayesian inference in aerodynamic design, achieving over three orders of magnitude in time reduction. The method maintains the integrity of uncertainty estimates while allowing for efficient geometry reconstruction.
Retrying vs Resampling in AI Control
James Lucassen, Adam Kaufman
This paper explores the concepts of retrying and resampling in AI coding tools, highlighting how retrying can reduce suspicion scores but may also allow for sneakier attacks. A key finding is that auditing based on maximum suspicion scores during resampling significantly improves safety without sacrificing usefulness.
When Gradients Collide: Failure Modes of Multi-Objective Prompt Optimization for LLM Judges
Parth Darshan, Abhishek Divekar
This paper explores the challenges of customizing large language models for specific tasks using textual gradient methods. A key result is that combining multiple task instructions into a single prompt can significantly degrade performance, highlighting the need for careful design in multi-objective optimization.
L2IR: Revealing Latent Intent in Graph Fraud Detection
Jinsheng Guo, Zhenhao Weng, Yibo Liu, et al.
The paper presents L2IR, a framework that leverages large language models to reveal latent intent in graph fraud detection. By distinguishing between supportive and misleading connections, L2IR improves detection performance significantly, achieving an AUPRC increase of up to 8.27%. This method shows promise for enhancing existing GNN-based detectors.
DRScaffold: Boosting Dense-Scene Reasoning in Lightweight Vision Language Models
Xinrui Shi, Kai Liu, Ziqing Zhang, et al.
This paper presents DRBench, a new benchmark for evaluating dense-scene reasoning in vision-language models, and DRScaffold, a fine-tuning framework that enhances grounded reasoning. The key result shows that a smaller model trained with structured supervision can outperform a larger frozen model, highlighting the effectiveness of the proposed approach.
Peak-Then-Collapse and the Four Interface Channels of Knowledge-Graph Tool Use
Tianda Sun, Dimitar Kazakov
This paper explores the challenges of using a minimal knowledge-graph tool API in reinforcement learning. A key result is that the tool-grounded answer rate improves initially but then collapses, highlighting the importance of interface feedback in the learning process.
CITYREP: A Unified Benchmark for Urban Representations Across Cities, Tasks, and Modalities
Junyuan Liu, Xinglei Wang, Zichao Zeng, et al.
CityRep is a new benchmark for evaluating urban representation learning that mitigates spatial leakage and supports fair comparisons across different cities and tasks. The key finding is that performance varies significantly based on the evaluation split used, highlighting the importance of rigorous benchmarking in this field.
Length Generalization with Log-Depth Recurrent Units
Charles Pert, Dalal Alrajeh, Alessandra Russo
The paper presents MLP-LDRU, a novel architecture that effectively addresses length generalization in neural networks. It achieves outstanding accuracy on various regular-language tasks, outperforming existing recurrent and attention-based models. This advancement could lead to improved performance in tasks requiring understanding of sequence length.
Everything at Every Scale: Scale-Invariant Diffusion with Continuous Super-Resolution
Zixin Jessie Chen, Zhuo Chen, Archer Wang, et al.
The SKILD model introduces a unified framework for image generation and super-resolution, achieving impressive results on CIFAR-10 and ImageNet. It operates without task-specific architectures or retraining, making it a versatile tool for image processing.
CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists
Junlin Yang, Dylan Zhang, Xiangchen Song, et al.
CausaLab is a new environment for testing how well LLMs can understand and predict causal relationships. A key finding is that while GPT-5.2-high achieves high task accuracy, it struggles with causal understanding, highlighting the need for better intervention strategies.
Retrieval-Augmented Detection of Potentially Abusive Clauses in Chilean Terms of Service
Christoffer Loeffler, Tomás Rey Pizarro, Daniel Ignacio Miranda Vásquez, et al.
This paper presents a framework for automatically detecting abusive clauses in Chilean Terms of Service, leveraging retrieval-augmented generation techniques. A key result shows that this approach allows local models to perform comparably to larger cloud-based systems while being more cost-effective.
STORM: Internalized Modeling for Spatial-Temporal Reasoning in Video-Language Models
Yiming Liang, Yixiao Chen, Yiyang Zhou, et al.
The STORMS framework enhances video reasoning by internalizing the reasoning process through latent trajectories instead of relying on external tools or textual chains. This approach significantly improves accuracy while reducing inference time. The key result shows that STORMS outperforms existing methods in both efficiency and effectiveness.
AdvantageFlow: Advantage-Weighted Least Squares for RL in Flow Models
Branislav Kveton, Anup Rao, Subhojyoti Mukherjee, et al.
AdvantageFlow is a new reinforcement learning algorithm that optimizes a forward-process prediction loss for flow models. It stabilizes the optimization problem through rollout policy regularization, leading to improved performance in image generation tasks. The key result shows that AdvantageFlow outperforms both Flow-GRPO and a state-of-the-art baseline.
Forgotten Words: Benchmarking NeoBERT for Dementia Detection in Low-Resource Conversational Filipino and English Speech
Rez Samantha Z. Floresca, Edric Castel C. Hao, Hannah Grachiella Buñales, et al.
This paper presents the first evaluation of transformer-based models for dementia detection in Filipino speech, highlighting the importance of bilingual fine-tuning. The key finding is that bilingual fine-tuning significantly improves model performance, achieving a Macro-F1 score of 0.969-0.973, demonstrating the necessity of linguistic coverage in training.
AI-Assisted Systematization for Evaluating GenAI Systems
Dhruv Agarwal, Emily Sheng, Chad Atalla, et al.
This paper addresses the challenge of evaluating generative AI systems by introducing AI-assisted systematization. It presents a structured representation of concepts and evaluates the quality of generated concept specs for hate-based rhetoric and digital empathy. The key result is that AI assistance can effectively support the systematization process, improving clarity in evaluation.
Statistical Inference for Stochastic Gradient Descent Beyond Finite Variance
Jose Blanchet, Peter Glynn, Wenhao Yang
This paper introduces a model-agnostic method for creating confidence regions from stochastic gradient descent (SGD) trajectories, addressing challenges in statistical inference when gradients have infinite variance. The key result is that the proposed method is straightforward to implement and provides asymptotically valid confidence regions in both finite- and infinite-variance scenarios.
Causal methods for LLM development and evaluation
Dennis Frauen, Marie Brockschmidt, Konstantin Hess, et al.
This paper argues for the integration of causal methods in the development and evaluation of large language models. It highlights how these methods can address confounding factors and improve the reliability of LLMs. The key result is that causal methods can enhance the understanding of interventions in LLM training and evaluation.
Deployment-complete benchmarking
El Mustapha Mansouri, Keigo Arai
This paper presents a novel approach to benchmarking that emphasizes the importance of deployment actions over mere scores. A key finding is that traditional benchmarks often fail to provide sufficient evidence for deployment decisions, highlighting the need for more comprehensive evaluation methods.
Fuzzy PyTorch: Rapid Numerical Variability Evaluation for Deep Learning Models
Inés Gonzalez-Pepe, Hiba Akhaddar, Tristan Glatard, et al.
Fuzzy PyTorch is a new framework that allows for efficient evaluation of numerical variability in deep learning models. It integrates stochastic arithmetic into PyTorch, achieving significant runtime reductions while maintaining model performance. This tool is particularly valuable for researchers and practitioners looking to manage floating-point uncertainty effectively.
What Makes a Medical Checker Trainable? Diagnosing Signal Collapse and Reward Hacking in Checker-Guided RAG for Biomedical QA
Yuelyu Ji, Min Gu Kwak, Hang Zhang, et al.
This paper explores the integration of claim-level NLI checkers into retrieval-augmented reinforcement learning for medical applications. A key finding is that the output distribution of the NLI checker during training significantly influences the quality of the model, with moderate signals yielding better results than strong signals. This insight can help practitioners optimize their reward systems in RL settings.
Neural Scalable Symbolic Search Framework for Complex Logical Queries with Multiple Free Variables
Weizhi Fei, Hang Yin, Zihao Wang, et al.
The NS3 framework offers a novel approach to answering complex queries over knowledge graphs by approximating joint rankings without exhaustive enumeration. It improves joint ranking performance while maintaining strong accuracy on marginal queries. This advancement is particularly valuable for practitioners dealing with multi-variable queries in knowledge representation.
SafeCtrl-RL: Inference-Time Adaptive Behaviour Control for LLM Dialogue via RL-Driven Prompt Optimisation
Michael Orme, Yanchao Yu, Zhiyuan Tan
SafeCtrl-RL is a novel framework for ensuring safe behavior in large language models during inference. It allows for adaptive safety regulation without the need for retraining, improving both safety and response quality. The key result is that it consistently outperforms existing prompt-based optimization methods.
When Do LLM Agents Treat Surface Noise Differently from Semantic Noise? A 68-Cell Measurement Study with a Held-Out Trace-Level Validation
Liyun Zhang, Jiayi Guo
This paper investigates how different types of perturbations affect the reasoning of large language models. It finds that meaning-bearing perturbations lead to greater inconsistencies in answers compared to presentation perturbations. This insight could inform future model training and evaluation strategies.
SkillOpt: Executive Strategy for Self-Evolving Agent Skills
Yifan Yang, Ziyang Gong, Weiquan Huang, et al.
SkillOpt is a novel optimizer for agent skills that improves performance by applying a controlled text-space optimization approach. It significantly enhances the accuracy of various models in different execution environments, demonstrating its effectiveness across multiple benchmarks.
LLMs as Noisy Channels: A Shannon Perspective on Model Capacity and Scaling Laws
Xu Ouyang, Deyi Liu, Yuhang Cai, et al.
The paper presents the Shannon Scaling Law, which models LLM training as information transmission, capturing the effects of noise on performance. A key result is that failing to maintain a sufficient signal-to-noise ratio leads to performance degradation, which is effectively predicted by this new framework.
From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills
Zisu Huang, Jingwen Xu, Yifan Yang, et al.
This paper investigates the lifecycle of skills in language agents, focusing on their extraction and consumption. A key finding is that model-generated skills generally improve performance but can lead to negative transfer, highlighting the complexity of skill utility across different models. The authors propose a meta-skill to enhance skill extraction and reduce negative transfer.
SPACENUM: Revisiting Spatial Numerical Understanding in VLMs
Jianshu Zhang, Yijiang Li, Huifeixin Chen, et al.
This paper investigates how well Vision-Language Models understand numerical outputs in spatial contexts. The key finding is that these models often fail to ground numerical values in spatial meaning, performing close to random guessing. Improvements through tuning were noted, but explicit reasoning provided only marginal benefits.
ETCHR: Editing To Clarify and Harness Reasoning
Beichen Zhang, Yuhong Liu, Jinsong Li, et al.
The paper presents ETCHR, a novel image editing model designed to enhance visual reasoning in multimodal large language models. It improves reasoning accuracy significantly across various tasks, achieving notable performance gains with different models.
Complete-muE: Optimal Hyperparameter Transfer and Scaling for MoE Models
Hongwu Peng, Ohiremen Dibua, Yuanjun Xiong, et al.
Complete-muE is a framework that enables efficient hyperparameter transfer from dense models to Mixture-of-Experts (MoE) models. It allows for stable hyperparameter optimization across different model architectures, significantly speeding up convergence without extensive hyperparameter searches. The key result is that hyperparameters tuned on a single dense model can be effectively transferred to all MoE configurations.
Good Token Hunting: A Hitchhiker's Guide to Token Selection for Visual Geometry Transformers
Shuhong Zheng, Michael Oechsle, Erik Sandström, et al.
This paper presents a two-stage token selection framework that enhances the efficiency of visual geometry transformers for 3D reconstruction. By reducing the number of tokens each query interacts with, the method accelerates processing by over 85% while maintaining or improving performance. This advancement could significantly impact future applications in the field.
CHRONOS: Temporally-Aware Multi-Agent Coordination for Evolving Data Marketplaces
Joydeep Chandra
CHRONOS is a new architecture designed to improve recall and privacy in temporal knowledge-graph data marketplaces. It achieves a high recall rate of 0.937 while maintaining competitive query performance and privacy guarantees. This makes it a promising solution for managing evolving data and privacy constraints.
Multilingual Knowledge Transfer under Data Constraints via Lexical Interventions
Anastasiia Sedova, Natalie Schluter, Skyler Seto, et al.
The LINK method enhances cross-lingual knowledge transfer by using lexical substitutions in high-resource training data. This approach requires only a bilingual vocabulary and leads to significant improvements in downstream tasks, achieving up to a 2x speedup in training time for equivalent performance.
PGT: Procedurally Generated Tasks for improving visual grounding in MLLMs
Rim Assouel, Amir Bar, Michal Drozdzal, et al.
This paper introduces Procedurally Generated Tasks (PGT) to enhance fine-grained visual understanding in Multimodal Large Language Models. The key result shows that instruction tuning with PGT data improves performance by up to +20% on the What'sUp benchmark, indicating that better supervision can address spatial reasoning deficits.
On the Stability of Spherical Hellinger-Kantorovich Flows and Their Implications for Differential Privacy
Aratrika Mustafi, Soumya Mukherjee
This paper introduces a perturbation theory for spherical Hellinger-Kantorovich gradient flows, allowing for the comparison of flows from different potentials. A key result is the establishment of uniform bounds for log-likelihood ratios and divergences, which can be applied to enhance sampling methods in differential privacy.
Training-Free Looped Transformers
Lizhang Chen, Jonathan Li, Chen Liang, et al.
This paper presents a training-free method for enhancing transformer models by applying a looping strategy at inference time. The key result shows significant performance improvements on various benchmarks, including a +2.64 percentage point increase on MMLU-Pro for Qwen3-4B-Instruct.
Move on Muon : A Hamiltonian probability gradient flow perspective of Muon optimizer
Aratrika Mustafi, Soumya Mukherjee, Bharath K. Sriperumbudur
This paper presents a new gradient flow for optimizing matrix-valued parameters using a regularized version of the Muon optimizer. The key result is the establishment of a damped Hamiltonian dynamics that ensures energy dissipation and convergence rates under certain conditions, which could enhance training in neural networks.
Human Decision-Making with Persuasive and Narrative LLM Explanations
Laura R. Marusich, Mary Grace Kozuch Dhooghe, Jonathan Z. Bakdash, et al.
This study investigates how LLM-generated narrative explanations affect human decision-making. The key finding is that the persuasiveness of these narratives does not significantly improve decision accuracy compared to AI predictions alone, and may even slow down response times.
Leveraging Foundation Models for Causal Generative Modeling
Aneesh Komanduri, Xintao Wu
FM-CGM is a new framework that enables visual causal reasoning by integrating pretrained foundation models. It allows for zero-shot causal discovery and counterfactual generation, making it valuable for applications requiring reliable causal inference. A key result is its ability to identify plausible causal structures effectively.
Strong Teacher Not Needed? On Distillation in LLM Pretraining
Taiming Lu, Zhuang Liu
This study reveals that even weaker teachers can enhance larger student models when using a proper mix of losses. It also shows that stronger teachers do not always yield better results, as excessive parameters or training can diminish distillation benefits. Importantly, distillation is found to improve generalization more effectively than in-domain fitting.
Entrywise Error Bounds for Spectral Ranking with Semi-Random Adversaries
Dongmin Lee, Anuran Makur, Japneet Singh
This work explores how the performance of spectral algorithms for BTL estimation can be affected by adversarial sampling. The key finding is that by reweighting observed edges, the performance can be improved to match that of uniformly sampled graphs. This insight is crucial for practitioners dealing with biased data in ranking tasks.
Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval
Michal Shlapentokh-Rothman, Prachi Garg, Yu-Xiong Wang, et al.
ToolMerge is a new keyframe retrieval method that leverages LLMs to improve the selection process for long-video question answering. It effectively decomposes queries into tool calls and merges their results, showing a notable 5% improvement in caption retrieval over existing methods. This approach enhances the ability to provide verifiable visual evidence for various types of queries.
It's the humans, not the data: Geopolitical bias in LLMs originates in post-training, amplified by the language of the prompt
Stuart Bladon, Brinnae Bent
The research indicates that geopolitical biases in language models are primarily influenced by post-training rather than pre-training. Notably, the model from Alibaba showed a significant shift in bias towards China after post-training, emphasizing the importance of oversight in model alignment processes.
Hierarchical Concept Geometry in Language Models Emerges from Word Co-occurrence
Andres Nava, Matthieu Wyart
This paper presents a theory that explains how the relationship between general and specific concepts is geometrically represented in language models. The key finding is that the structure of word embeddings reflects a hierarchical organization that mirrors taxonomic relationships, which can be observed in both word2vec and Gemma 2B embeddings.
Not Too Generative, Not Too Discriminative: The Human Alignment Sweet Spot
Jorge Chang Ortega, Bastien Le Lan, Thomas Serre, et al.
This study investigates how human-like visual representations can be better understood through a balance of discriminative and generative learning. The key finding is that human alignment is maximized at intermediate points of this continuum, suggesting that a hybrid approach yields better results in vision tasks.
Advanced AI Service Provisioning in O-RAN through LLM Engine Integration
Seyed Bagher Hashemi Natanzi, Pranshav Gajja, Bo Tang, et al.
This paper introduces a Dual-Brain architecture that leverages LLMs for orchestrating data collection and deployment in O-RAN systems, while an automated ML engine trains classifiers on demand. The key result is the ability to streamline the development of AI applications for real-time RAN control, enhancing efficiency.
Debiased Negative Mining Improves Out-of-distribution Detection with Pre-trained Vision-Language Models
Bo Peng, Jie Lu, Guangquan Zhang, et al.
This paper presents a new approach to out-of-distribution detection using pre-trained vision-language models. The key result shows that their method for debiasing negative label mining significantly improves OOD detection performance across various setups.
Beyond Binary Edits Robust Multimodal Knowledge Editing with Adversarial Subspace Alignment
Haoyuan Wang, Xiaohao Liu, Jiajie Su, et al.
This paper addresses the challenge of updating knowledge in multimodal large language models without losing existing capabilities. The authors propose new techniques to enhance the generalization of knowledge edits, demonstrating that their methods can effectively maintain consistent predictions across semantically similar inputs. A key result is the introduction of adversarial variants that improve robustness in knowledge editing.
Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most
Nick Merrill, Jaeho Lee, Ezra Karger
The paper reveals that larger language models perform worse in forecasting tasks with superlinear growth and tail risks, particularly in the upper tail of distributions. This inverse scaling effect suggests that more capable models may misestimate extreme outcomes while maintaining lower tail accuracy. The authors recommend using continuous accuracy measures for better evaluation of LLM forecasting.
Semi-Parametric Bayesian Additive Regression Trees for Risk Prediction with High-Dimensional Epigenetic Signatures and Low-Dimensional Covariates
Saurabh Bhandari, Parveen Bhatti, Brian C. -H. Chiu, et al.
The spBART model effectively combines interpretable low-dimensional covariates with complex high-dimensional predictors. It successfully identifies important genomic loci and achieves a high out-of-sample discrimination rate (AUC = 0.96) in multiple myeloma studies.
Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German
Sajjad Abdoli, Ghassan Al-Sumaidaee, Clayton W. Taylor, et al.
This study benchmarks five commercial ASR systems on code-switching between various languages. The key finding is that ElevenLabs Scribe v2 outperforms others with the lowest WER and highest BERTScore, highlighting significant quality differences in ASR performance.
Entropy-Gradient Inversion: Moving Toward Internal Mechanism of Large Reasoning Models
Junyao Yang, Chen Qian, Kun Wang, et al.
This paper presents a new approach to improve reasoning in Large Reasoning Models by utilizing a correlation between token entropy and logit gradients. The key result shows that their proposed method, CorR-PO, consistently outperforms existing techniques, indicating that stronger entropy inversions lead to better reasoning performance.
Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
William Lehn-Schiøler, Magnus Ruud Kjær, Rahul Thapa, et al.
This paper presents a framework for interpreting EEG foundation models by extracting sparse feature dictionaries and grounding them in clinical taxonomies. A key result is the identification of operational regimes that reveal critical representational failures, impacting clinical trust in model predictions.
Ceci n'est pas une explication: Evaluating Explanation Failures as Explainability Pitfalls in Language Learning Systems
Ben Knight, Wm. Matthew Kennedy, Danielle Carvalho, et al.
The paper highlights how AI language learning tools can provide misleading feedback that reinforces misconceptions. It introduces L2-Bench, a benchmark for assessing AI feedback quality across six critical dimensions. The key result is the identification of 'explainability pitfalls' that can harm learning outcomes.
SUDP: Secret-Use Delegation Protocol for Agentic Systems
Xiaohang Yu, Hejia Geng, Xinmeng Zeng, et al.
The paper addresses the security risks associated with agentic systems using user secrets by formalizing the Agent Secret Use (ASU) problem. It proposes the Secret-Use Delegation Protocol (SUDP), which allows secure operations without granting reusable authority to untrusted requesters. This approach ensures that user-authorized actions are performed safely and effectively.
RoIt-XMASA: Multi-Domain Multilingual Sentiment Analysis Dataset for Romanian and Italian
Andrei-Marius Avram, Aureliu Valentin Antonie, Cosmin-Mircea Croitoru, et al.
RoIt-XMASA is a new multilingual dataset for sentiment analysis that includes 36,000 labeled reviews in Italian and Romanian. The proposed adversarial training framework improves sentiment discrimination while maintaining language and domain invariance, achieving a notable F1-score of 66.23% with XLM-R.
Safe Reinforcement Learning with Preference-based Constraint Inference
Chenglin Li, Grant Ruan, Hua Geng
This study presents a new approach called Preference-based Constrained Reinforcement Learning (PbCRL) that effectively infers safety constraints from human preferences. A key result is that PbCRL achieves better alignment with true safety requirements while outperforming existing methods in both safety and reward metrics.
Compression as Adaptation: Implicit Visual Representation with Diffusion Foundation Models
Zongyu Guo, Jiajun He, Zhaoyang Jia, et al.
This paper presents a novel visual representation framework that encodes signals as functions, allowing for efficient video compression. The key result is the ability to hash an 81-frame video into a compact vector while enabling control over compression performance.
Entropy-Aware On-Policy Distillation of Language Models
Woogyeol Jin, Taywon Min, Yongjin Yang, et al.
The paper presents Entropy-Aware On-Policy Distillation, which improves knowledge transfer between language models by balancing precision and diversity. The key result shows significant accuracy gains across various benchmarks, indicating that accounting for teacher uncertainty enhances student-teacher alignment.
Certified Per-Instance Unlearning Using Individual Sensitivity Bounds
Hanna Benarroch, Jamal Atif, Olivier Cappé
This work presents a new method for certified machine unlearning that uses adaptive noise calibration based on individual data point contributions. The key result is that this approach allows for certified unlearning with significantly less noise injection compared to traditional methods, improving practical applicability. The findings are supported by both theoretical analysis and experimental results.
Linear Regression with Unknown Truncation Beyond Gaussian Features
Alexandros Kouridakis, Anay Mehrotra, Alkis Kalavasis, et al.
This paper presents a novel algorithm for truncated linear regression that operates efficiently even when the survival set is unknown. It achieves a polynomial runtime with respect to the number of dimensions and desired accuracy, making it more practical for real-world applications. The approach also contributes to positive-only PAC learning, which could be beneficial for future research.
Cascaded Transfer: Learning Many Tasks under Budget Constraints
Eloi Campagne, Yvenn Amara-Ouali, Yannig Goude, et al.
Cascaded Transfer Learning (CTL) allows for efficient learning across multiple related tasks by organizing them hierarchically. The approach minimizes transfer errors and maximizes accuracy within a constrained training budget, showing significant improvements in performance, especially under tight budgets.
R$^3$L: Reflect-then-Retry Reinforcement Learning with Language-Guided Exploration, Pivotal Credit, and Positive Amplification
Weijie Shi, Yanxi Chen, Zexi Li, et al.
R$^3$L improves reinforcement learning by synthesizing high-quality trajectories through a reflect-then-retry approach. This method enhances exploration and exploitation by using language feedback to correct errors and optimize training stability. The key result shows a 5% to 52% relative improvement over existing methods.
On the Koopman-Based Generalization Bounds for Multi-Task Deep Learning
Mahdi Mohammadigohari, Giuseppe Di Fatta, Giuseppe Nicosia, et al.
This paper presents a new framework for establishing generalization bounds in multitask deep neural networks. By using operator-theoretic techniques and a tailored Sobolev space, the authors achieve tighter bounds that are effective even in single output scenarios. This approach enhances theoretical understanding and offers flexibility in multitask deep learning applications.
Operator-Based Generalization Bound for Deep Learning: Insights on Multi-Task Learning
Mahdi Mohammadigohari, Giuseppe Di Fatta, Giuseppe Nicosia, et al.
This paper develops new generalization bounds for vector-valued neural networks, enhancing multi-task learning through a novel framework. The key result is the introduction of sketching techniques that improve computational efficiency while providing performance guarantees for various applications. This work significantly advances understanding of generalization in deep learning architectures.
Bridging Data and Physics: A Graph Neural Network-Based Hybrid Twin Framework
M. Gorpinich, B. Moya, S. Rodriguez, et al.
This paper presents a hybrid twin approach that uses Graph Neural Networks to model the ignorance in physics-based simulations. The key result is that the GNN effectively captures missing physics and improves simulation accuracy while reducing data requirements, making it practical for real-world applications.
DFIR-DETR: Frequency-Domain Iterative Refinement and Dynamic Feature Aggregation for Small Object Detection
Bo Gao, Jingcheng Tong, Xingsheng Chen, et al.
DFIR-DETR improves small object detection by addressing issues in attention mechanisms and feature upsampling. It achieves a mean Average Precision (mAP50) of 92.9% on NEU-DET and 51.6% on VisDrone with a compact model size of 11.7M parameters. This demonstrates effective performance across different detection scenarios.
DCC: Data-Centric Compilation of Machine Learning Kernels for Processing-In-Memory Architectures
Peiming Yang, Sankeerth Durvasula, Ivan Fernandez, et al.
DCC is a new ML compiler that optimizes data rearrangements and compute code for PIM devices, significantly improving performance. It achieves up to 13.17x speedup on specific PIM architectures compared to GPU-only execution, which is crucial for builders focused on maximizing efficiency in ML applications.
Are Targeted Data Poisoning Attacks as Effective as We Think?
William Xu, Chenyu Zhang, Yihan Wang, et al.
The paper presents a novel approach to identify the easiest and hardest samples to poison in targeted data poisoning attacks. By leveraging clean model information, it enables better evaluation of attack effectiveness and proactive defenses against vulnerabilities. A key result is the reliable stratification of samples by poisoning vulnerability.
Interactive Query Answering on Knowledge Graphs with Soft Entity Constraints
Daniel Daza, Alberto Bernardi, Luca Costabello, et al.
This paper presents a new approach to query answering in knowledge graphs that incorporates soft constraints, allowing users to express preferences. The key result is that the proposed methods maintain robust performance while adding minimal overhead, enabling more flexible interactions with graph databases.