GENESIS: Harnessing AI Agents for Autonomous 6G RAN Synthesis, Research, and Testing
GENESIS is an AI framework designed to streamline cellular R&D by converting intents into validated solutions. It effectively reduces the time required for R&D processes while addressing the unique challenges posed by Radio Access Networks. The key result is that it enables faster and more reliable development cycles in a field where traditional methods are time-consuming.
EdgeFlow: Edge-Map Augmented VLM-Based Flowchart Processing for Industrial Requirements Engineering
EdgeFlow improves the conversion of flowcharts to machine-readable models by using a Canny edge map as a structural prior. It achieves notable increases in node-level and edge-level F1 scores, demonstrating its effectiveness in industrial requirements engineering. This method does not require annotated training data, making it practical for real-world applications.
BASIS: Batchwise Advantage Estimation from Single-Rollout Information Sharing for LLM Reasoning
BASIS is a new algorithm that enhances the efficiency of value function estimation in reinforcement learning. It achieves a 69% reduction in MSE compared to a strong baseline while using only one rollout per prompt, leading to better policy optimization with less training time.
MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research
MobileGym is a new environment designed for mobile applications that allows for high interaction fidelity and scalable reinforcement learning. It provides structured evaluation and rewards, leading to a notable performance improvement in real-device execution. The key result shows a +12.8 percentage point gain on a test set, indicating its effectiveness.
From Model Scaling to System Scaling: Scaling the Harness in Agentic AI
This paper highlights the importance of designing modular and verifiable architectures around foundation models for agentic AI. It identifies key bottlenecks in context governance, trustworthy memory, and skill routing, proposing a new evaluation framework that focuses on the quality of agent behavior over simple task success. The key result is the introduction of CheetahClaws, a reference harness for evaluating these architectures.
Prism: A Plug-in Reproducible Infrastructure for Scalable Multimodal Continual Instruction Tuning
The paper presents Prism, a new codebase designed to facilitate scalable Multimodal Continual Instruction Tuning (MCIT) research. By allowing independent plugin integration, it reduces implementation overhead and enhances code reuse. This approach aims to accelerate the development of new MCIT strategies.
Beyond Summaries: Structure-Aware Labeling of Code Changes with Large Language Models
This paper presents a new approach to improve code review efficiency by using large language models to label code changes in patches. The proposed method achieves high recall and precision, suggesting it can effectively enhance traditional static analysis workflows.
Forgetting in Language Models: Capacity, Optimization, and Self-Generated Replay
This paper addresses the issue of forgetting in language models when trained on new tasks. It shows that self-generated samples can effectively serve as replay data, significantly reducing forgetting. The key result is that this method allows for high-learning-rate finetuning without the typical tradeoff of forgetting.
Active Query Synthesis for Preference Learning
This paper presents a novel approach to active learning that improves the efficiency of user preference learning by addressing feedback reliability. The key result is the development of the Info-Synth framework, which generates optimal queries to enhance decision-making systems. This method shows versatility across various applications, including preference learning and robotic control.
WhoSaidIt: Human-LLM Collaborative Annotation for Text-Based Multilingual Speaker-Attribute Classification
This paper presents a new framework for re-annotating multilingual speaker attributes using human-LLM collaboration. The key finding is that there are significant cross-lingual differences in how speaker attributes are annotated, highlighting both the potential and limitations of LLMs in this context.
Retrying vs Resampling in AI Control
This paper explores the concepts of retrying and resampling in AI coding tools, highlighting how retrying can reduce suspicion scores but may also allow for sneakier attacks. A key finding is that auditing based on maximum suspicion scores during resampling significantly improves safety without sacrificing usefulness.
Causal methods for LLM development and evaluation
This paper argues for the integration of causal methods in the development and evaluation of large language models. It highlights how these methods can address confounding factors and improve the reliability of LLMs. The key result is that causal methods can enhance the understanding of interventions in LLM training and evaluation.
SkillOpt: Executive Strategy for Self-Evolving Agent Skills
SkillOpt is a novel optimizer for agent skills that improves performance by applying a controlled text-space optimization approach. It significantly enhances the accuracy of various models in different execution environments, demonstrating its effectiveness across multiple benchmarks.
LLMs as Noisy Channels: A Shannon Perspective on Model Capacity and Scaling Laws
The paper presents the Shannon Scaling Law, which models LLM training as information transmission, capturing the effects of noise on performance. A key result is that failing to maintain a sufficient signal-to-noise ratio leads to performance degradation, which is effectively predicted by this new framework.
From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills
This paper investigates the lifecycle of skills in language agents, focusing on their extraction and consumption. A key finding is that model-generated skills generally improve performance but can lead to negative transfer, highlighting the complexity of skill utility across different models. The authors propose a meta-skill to enhance skill extraction and reduce negative transfer.
SPACENUM: Revisiting Spatial Numerical Understanding in VLMs
This paper investigates how well Vision-Language Models understand numerical outputs in spatial contexts. The key finding is that these models often fail to ground numerical values in spatial meaning, performing close to random guessing. Improvements through tuning were noted, but explicit reasoning provided only marginal benefits.
ETCHR: Editing To Clarify and Harness Reasoning
The paper presents ETCHR, a novel image editing model designed to enhance visual reasoning in multimodal large language models. It improves reasoning accuracy significantly across various tasks, achieving notable performance gains with different models.
Complete-muE: Optimal Hyperparameter Transfer and Scaling for MoE Models
Complete-muE is a framework that enables efficient hyperparameter transfer from dense models to Mixture-of-Experts (MoE) models. It allows for stable hyperparameter optimization across different model architectures, significantly speeding up convergence without extensive hyperparameter searches. The key result is that hyperparameters tuned on a single dense model can be effectively transferred to all MoE configurations.
Good Token Hunting: A Hitchhiker's Guide to Token Selection for Visual Geometry Transformers
This paper presents a two-stage token selection framework that enhances the efficiency of visual geometry transformers for 3D reconstruction. By reducing the number of tokens each query interacts with, the method accelerates processing by over 85% while maintaining or improving performance. This advancement could significantly impact future applications in the field.
CHRONOS: Temporally-Aware Multi-Agent Coordination for Evolving Data Marketplaces
CHRONOS is a new architecture designed to improve recall and privacy in temporal knowledge-graph data marketplaces. It achieves a high recall rate of 0.937 while maintaining competitive query performance and privacy guarantees. This makes it a promising solution for managing evolving data and privacy constraints.
Multilingual Knowledge Transfer under Data Constraints via Lexical Interventions
The LINK method enhances cross-lingual knowledge transfer by using lexical substitutions in high-resource training data. This approach requires only a bilingual vocabulary and leads to significant improvements in downstream tasks, achieving up to a 2x speedup in training time for equivalent performance.
PGT: Procedurally Generated Tasks for improving visual grounding in MLLMs
This paper introduces Procedurally Generated Tasks (PGT) to enhance fine-grained visual understanding in Multimodal Large Language Models. The key result shows that instruction tuning with PGT data improves performance by up to +20% on the What'sUp benchmark, indicating that better supervision can address spatial reasoning deficits.
On the Stability of Spherical Hellinger-Kantorovich Flows and Their Implications for Differential Privacy
This paper introduces a perturbation theory for spherical Hellinger-Kantorovich gradient flows, allowing for the comparison of flows from different potentials. A key result is the establishment of uniform bounds for log-likelihood ratios and divergences, which can be applied to enhance sampling methods in differential privacy.
Training-Free Looped Transformers
This paper presents a training-free method for enhancing transformer models by applying a looping strategy at inference time. The key result shows significant performance improvements on various benchmarks, including a +2.64 percentage point increase on MMLU-Pro for Qwen3-4B-Instruct.
Move on Muon : A Hamiltonian probability gradient flow perspective of Muon optimizer
This paper presents a new gradient flow for optimizing matrix-valued parameters using a regularized version of the Muon optimizer. The key result is the establishment of a damped Hamiltonian dynamics that ensures energy dissipation and convergence rates under certain conditions, which could enhance training in neural networks.
Human Decision-Making with Persuasive and Narrative LLM Explanations
This study investigates how LLM-generated narrative explanations affect human decision-making. The key finding is that the persuasiveness of these narratives does not significantly improve decision accuracy compared to AI predictions alone, and may even slow down response times.
Leveraging Foundation Models for Causal Generative Modeling
FM-CGM is a new framework that enables visual causal reasoning by integrating pretrained foundation models. It allows for zero-shot causal discovery and counterfactual generation, making it valuable for applications requiring reliable causal inference. A key result is its ability to identify plausible causal structures effectively.
Strong Teacher Not Needed? On Distillation in LLM Pretraining
This study reveals that even weaker teachers can enhance larger student models when using a proper mix of losses. It also shows that stronger teachers do not always yield better results, as excessive parameters or training can diminish distillation benefits. Importantly, distillation is found to improve generalization more effectively than in-domain fitting.
Entrywise Error Bounds for Spectral Ranking with Semi-Random Adversaries
This work explores how the performance of spectral algorithms for BTL estimation can be affected by adversarial sampling. The key finding is that by reweighting observed edges, the performance can be improved to match that of uniformly sampled graphs. This insight is crucial for practitioners dealing with biased data in ranking tasks.
Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval
ToolMerge is a new keyframe retrieval method that leverages LLMs to improve the selection process for long-video question answering. It effectively decomposes queries into tool calls and merges their results, showing a notable 5% improvement in caption retrieval over existing methods. This approach enhances the ability to provide verifiable visual evidence for various types of queries.
It's the humans, not the data: Geopolitical bias in LLMs originates in post-training, amplified by the language of the prompt
The research indicates that geopolitical biases in language models are primarily influenced by post-training rather than pre-training. Notably, the model from Alibaba showed a significant shift in bias towards China after post-training, emphasizing the importance of oversight in model alignment processes.
Hierarchical Concept Geometry in Language Models Emerges from Word Co-occurrence
This paper presents a theory that explains how the relationship between general and specific concepts is geometrically represented in language models. The key finding is that the structure of word embeddings reflects a hierarchical organization that mirrors taxonomic relationships, which can be observed in both word2vec and Gemma 2B embeddings.
Not Too Generative, Not Too Discriminative: The Human Alignment Sweet Spot
This study investigates how human-like visual representations can be better understood through a balance of discriminative and generative learning. The key finding is that human alignment is maximized at intermediate points of this continuum, suggesting that a hybrid approach yields better results in vision tasks.
Advanced AI Service Provisioning in O-RAN through LLM Engine Integration
This paper introduces a Dual-Brain architecture that leverages LLMs for orchestrating data collection and deployment in O-RAN systems, while an automated ML engine trains classifiers on demand. The key result is the ability to streamline the development of AI applications for real-time RAN control, enhancing efficiency.
Debiased Negative Mining Improves Out-of-distribution Detection with Pre-trained Vision-Language Models
This paper presents a new approach to out-of-distribution detection using pre-trained vision-language models. The key result shows that their method for debiasing negative label mining significantly improves OOD detection performance across various setups.
Beyond Binary Edits Robust Multimodal Knowledge Editing with Adversarial Subspace Alignment
This paper addresses the challenge of updating knowledge in multimodal large language models without losing existing capabilities. The authors propose new techniques to enhance the generalization of knowledge edits, demonstrating that their methods can effectively maintain consistent predictions across semantically similar inputs. A key result is the introduction of adversarial variants that improve robustness in knowledge editing.
Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most
The paper reveals that larger language models perform worse in forecasting tasks with superlinear growth and tail risks, particularly in the upper tail of distributions. This inverse scaling effect suggests that more capable models may misestimate extreme outcomes while maintaining lower tail accuracy. The authors recommend using continuous accuracy measures for better evaluation of LLM forecasting.
Semi-Parametric Bayesian Additive Regression Trees for Risk Prediction with High-Dimensional Epigenetic Signatures and Low-Dimensional Covariates
The spBART model effectively combines interpretable low-dimensional covariates with complex high-dimensional predictors. It successfully identifies important genomic loci and achieves a high out-of-sample discrimination rate (AUC = 0.96) in multiple myeloma studies.
Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German
This study benchmarks five commercial ASR systems on code-switching between various languages. The key finding is that ElevenLabs Scribe v2 outperforms others with the lowest WER and highest BERTScore, highlighting significant quality differences in ASR performance.
Entropy-Gradient Inversion: Moving Toward Internal Mechanism of Large Reasoning Models
This paper presents a new approach to improve reasoning in Large Reasoning Models by utilizing a correlation between token entropy and logit gradients. The key result shows that their proposed method, CorR-PO, consistently outperforms existing techniques, indicating that stronger entropy inversions lead to better reasoning performance.
Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
This paper presents a framework for interpreting EEG foundation models by extracting sparse feature dictionaries and grounding them in clinical taxonomies. A key result is the identification of operational regimes that reveal critical representational failures, impacting clinical trust in model predictions.
Ceci n'est pas une explication: Evaluating Explanation Failures as Explainability Pitfalls in Language Learning Systems
The paper highlights how AI language learning tools can provide misleading feedback that reinforces misconceptions. It introduces L2-Bench, a benchmark for assessing AI feedback quality across six critical dimensions. The key result is the identification of 'explainability pitfalls' that can harm learning outcomes.
SUDP: Secret-Use Delegation Protocol for Agentic Systems
The paper addresses the security risks associated with agentic systems using user secrets by formalizing the Agent Secret Use (ASU) problem. It proposes the Secret-Use Delegation Protocol (SUDP), which allows secure operations without granting reusable authority to untrusted requesters. This approach ensures that user-authorized actions are performed safely and effectively.
RoIt-XMASA: Multi-Domain Multilingual Sentiment Analysis Dataset for Romanian and Italian
RoIt-XMASA is a new multilingual dataset for sentiment analysis that includes 36,000 labeled reviews in Italian and Romanian. The proposed adversarial training framework improves sentiment discrimination while maintaining language and domain invariance, achieving a notable F1-score of 66.23% with XLM-R.
Safe Reinforcement Learning with Preference-based Constraint Inference
This study presents a new approach called Preference-based Constrained Reinforcement Learning (PbCRL) that effectively infers safety constraints from human preferences. A key result is that PbCRL achieves better alignment with true safety requirements while outperforming existing methods in both safety and reward metrics.
Compression as Adaptation: Implicit Visual Representation with Diffusion Foundation Models
This paper presents a novel visual representation framework that encodes signals as functions, allowing for efficient video compression. The key result is the ability to hash an 81-frame video into a compact vector while enabling control over compression performance.
Entropy-Aware On-Policy Distillation of Language Models
The paper presents Entropy-Aware On-Policy Distillation, which improves knowledge transfer between language models by balancing precision and diversity. The key result shows significant accuracy gains across various benchmarks, indicating that accounting for teacher uncertainty enhances student-teacher alignment.
Certified Per-Instance Unlearning Using Individual Sensitivity Bounds
This work presents a new method for certified machine unlearning that uses adaptive noise calibration based on individual data point contributions. The key result is that this approach allows for certified unlearning with significantly less noise injection compared to traditional methods, improving practical applicability. The findings are supported by both theoretical analysis and experimental results.
Linear Regression with Unknown Truncation Beyond Gaussian Features
This paper presents a novel algorithm for truncated linear regression that operates efficiently even when the survival set is unknown. It achieves a polynomial runtime with respect to the number of dimensions and desired accuracy, making it more practical for real-world applications. The approach also contributes to positive-only PAC learning, which could be beneficial for future research.
R$^3$L: Reflect-then-Retry Reinforcement Learning with Language-Guided Exploration, Pivotal Credit, and Positive Amplification
R$^3$L improves reinforcement learning by synthesizing high-quality trajectories through a reflect-then-retry approach. This method enhances exploration and exploitation by using language feedback to correct errors and optimize training stability. The key result shows a 5% to 52% relative improvement over existing methods.
On the Koopman-Based Generalization Bounds for Multi-Task Deep Learning
This paper presents a new framework for establishing generalization bounds in multitask deep neural networks. By using operator-theoretic techniques and a tailored Sobolev space, the authors achieve tighter bounds that are effective even in single output scenarios. This approach enhances theoretical understanding and offers flexibility in multitask deep learning applications.
Operator-Based Generalization Bound for Deep Learning: Insights on Multi-Task Learning
This paper develops new generalization bounds for vector-valued neural networks, enhancing multi-task learning through a novel framework. The key result is the introduction of sketching techniques that improve computational efficiency while providing performance guarantees for various applications. This work significantly advances understanding of generalization in deep learning architectures.
DFIR-DETR: Frequency-Domain Iterative Refinement and Dynamic Feature Aggregation for Small Object Detection
DFIR-DETR improves small object detection by addressing issues in attention mechanisms and feature upsampling. It achieves a mean Average Precision (mAP50) of 92.9% on NEU-DET and 51.6% on VisDrone with a compact model size of 11.7M parameters. This demonstrates effective performance across different detection scenarios.
DCC: Data-Centric Compilation of Machine Learning Kernels for Processing-In-Memory Architectures
DCC is a new ML compiler that optimizes data rearrangements and compute code for PIM devices, significantly improving performance. It achieves up to 13.17x speedup on specific PIM architectures compared to GPU-only execution, which is crucial for builders focused on maximizing efficiency in ML applications.
Are Targeted Data Poisoning Attacks as Effective as We Think?
The paper presents a novel approach to identify the easiest and hardest samples to poison in targeted data poisoning attacks. By leveraging clean model information, it enables better evaluation of attack effectiveness and proactive defenses against vulnerabilities. A key result is the reliable stratification of samples by poisoning vulnerability.
Interactive Query Answering on Knowledge Graphs with Soft Entity Constraints
This paper presents a new approach to query answering in knowledge graphs that incorporates soft constraints, allowing users to express preferences. The key result is that the proposed methods maintain robust performance while adding minimal overhead, enabling more flexible interactions with graph databases.
STM3: Mixture of Multiscale Mamba for Long-Term Spatio-Temporal Time-Series Prediction
STM3 effectively captures complex long-term spatio-temporal dependencies using a unique architecture. It significantly outperforms the second-best model on the PEMSD8 dataset by 7.1% in MAE, showcasing its robustness in time-series prediction.
Physics-Informed Machine Learning Regulated by Finite Element Analysis for Simulation Acceleration of Melt Pool Dynamics in Laser Powder Bed Fusion
The FEA-PINN framework significantly reduces computational costs while maintaining accuracy comparable to traditional FEA in simulating melt pool dynamics in LPBF. It effectively tracks material status during laser melting and incorporates various physical phenomena.
Speak-to-Structure: Evaluating LLMs in Open-domain Natural Language-Driven Molecule Generation
The paper presents S^2-Bench, a benchmark for evaluating LLMs in generating diverse molecular candidates from natural language prompts. It includes tasks that test molecule editing, optimization, and customization, demonstrating that Llama3.1-8B can outperform leading models like GPT-4o. This shift in focus enhances the capabilities of LLMs in molecular discovery.
Nonlinear Transformations Against Unlearnable Datasets
This research introduces a nonlinear transformation framework that allows deep neural networks to learn from data previously deemed unlearnable. The approach shows improvements in accuracy ranging from 0.34% to 249.59% on unlearnable CIFAR10 datasets, indicating that current protection methods may be insufficient.