Salesforce AI Analysis will current 21 accepted papers at ICLR 2026, the Fourteenth Worldwide Convention on Studying Representations. The convention runs April 23–27 on the Riocentro Conference and Occasion Middle in Rio de Janeiro, Brazil.
Our accepted authors will share their work via lightning talks, poster classes, and workshops all through the week.
This 12 months’s analysis displays the issues we expect matter most for enterprise AI: brokers that act reliably in advanced environments, analysis frameworks that expose actual failure modes, stronger reasoning, and methods that keep environment friendly and reliable at scale.
Workshop Paper
Along with our major convention acceptances, our work on agent identification failures was accepted to the Brokers within the Wild: Security, Safety, and Past workshop at ICLR 2026.
ECHOING: Identification Failures When LLM Brokers Discuss to Every Different Paper
When LLM brokers work together autonomously, they’ll abandon their assigned roles and mirror their conversational accomplice. We name this ‘echoing.’ Throughout 2,500+ conversations, echoing charges reached as excessive as 70% with main mannequin suppliers, but 93% of affected conversations nonetheless registered as profitable by customary metrics. Reasoning fashions provided minimal enchancment, and structured responses decreased however didn’t get rid of the issue.
Authors: Sarath Shekkizhar, Romain Cosentino, Adam Earle, Silvio Savarese
Primary Convention Papers
Agent Architectures and GUI Brokers
Our agent work this 12 months spans test-time scaling, instrument studying, computer-use benchmarks, and multi-agent coordination.
GTA1: GUI Take a look at-time Scaling Agent Paper
GTA1 introduces test-time scaling for GUI brokers, utilizing a number of candidate motion proposals and RL-based grounding to realize state-of-the-art efficiency on autonomous process completion throughout platforms.
Authors: Yan Yang, Dongxu Li, Yutong Dai, Yuhao Yang, Ziyang Luo, Zirui Zhao, Zhiyuan Hu, Junzhe Huang, Amrita Saha, Zeyuan Chen, Ran Xu, Liyuan Pan, Silvio Savarese, Caiming Xiong, Junnan Li
WALT: Internet Brokers that Study Instruments Paper
WALT reverse-engineers web site performance into reusable instruments like search, filter, and create. This shifts from fragile step-by-step interactions to dependable instrument invocation with increased success charges and fewer steps on VisualWebArena and WebArena.
Authors: Viraj Prabhu, Yutong Dai, Matthew Fernandez, Jing Gu, Krithika Ramakrishnan, Yanqi Luo, Silvio Savarese, Caiming Xiong, Junnan Li, Zeyuan Chen, Ran Xu
SCUBA: Salesforce Pc Use Benchmark Paper
SCUBA benchmarks computer-use brokers on 300 actual Salesforce CRM duties throughout admin, gross sales, and repair workflows. Open-source brokers obtain lower than 5% success versus 39% for closed-source fashions in zero-shot settings, bettering to 50% with demonstrations whereas lowering time and prices by 13–16%.
Authors: Yutong Dai, Krithika Ramakrishnan, Jing Gu, Matthew Fernandez, Yanqi Luo, Viraj Prabhu, Zhenyu Hu, Silvio Savarese, Caiming Xiong, Zeyuan Chen, Ran Xu
CoAct-1: Pc-using Multi-agent System with Coding Actions Paper
CoAct-1 introduces a multi-agent system combining GUI management with programmatic execution. An Orchestrator delegates subtasks to GUI Operator or Programmer brokers, attaining 60.76% success on OSWorld (a brand new state-of-the-art) whereas lowering common steps from 15 to 10.15.
Authors: Linxin Track, Yutong Dai, Viraj Prabhu, Jieyu Zhang, Taiwei Shi, Li Li, Junnan Li, Silvio Savarese, Zeyuan Chen, Jieyu Zhao, Ran Xu, Caiming Xiong
Grounded Take a look at-Time Adaptation for LLM Brokers Paper
Parametric on-line adaptation aligns LLM brokers to environment-specific codecs whereas non-parametric dynamics grounding learns causal state transitions via persona-driven exploration, collectively addressing syntactic and semantic mismatches and boosting WebArena multi-site success from 2% to 23%.
Authors: Arthur Chen, Zuxin Liu, Jianguo Zhang, Akshara Prabhakar, Zhiwei Liu, Shelby Heinecke, Silvio Savarese, Victor Zhong, Caiming Xiong
Reasoning and Analysis
Advancing how LLMs purpose and the way we measure that reasoning is central to constructing enterprise AI that works. This cluster addresses test-time scaling, verification dynamics, evaluator coaching, and environment friendly reasoning underneath constraints.
Nudging the Boundaries of LLM Reasoning Paper
NuRL overcomes a central RL limitation through the use of self-generated hints to unlock studying from beforehand ‘unsolvable’ issues, elevating efficiency ceilings the place customary strategies like GRPO plateau, with constant enhancements throughout six benchmarks and three fashions.
Authors: Justin Chih-Yao Chen, Becky Xiangyu Peng, Prafulla Kumar Choubey, Kung-Hsiang Huang, Jiaxin Zhang, Mohit Bansal, Chien-Sheng Wu
Variation in Verification: Understanding Verification Dynamics in Massive Language Fashions Paper
This paper analyzes how LLM verifiers assess resolution candidates in test-time scaling, discovering that weak mills can match stronger ones post-verification. Verification effectiveness is dependent upon downside issue, generator power, and verifier functionality, revealing when verifier scaling reaches its limits.
Authors: Yefan Zhou, Austin Xu, Yilun Zhou, Janvijay Singh, Jiang Gui, Shafiq Joty
Foundational Automated Evaluators (FARE): Scaling Multi-Process Generative Evaluator Coaching for Reasoning-Centric Domains Paper
FARE, educated on 2.5M samples, units new requirements for open-source evaluators. The 8B mannequin rivals bigger RL-trained fashions, whereas the 20B model surpasses 70B+ evaluators and achieves near-oracle reranking on MATH with 14.1% downstream RL enhancements.
Authors: Austin Xu, Xuan-Phi Nguyen, Yilun Zhou, Chien-Sheng Wu, Caiming Xiong, Shafiq Joty
On the Shelf Lifetime of High-quality-Tuned LLM Judges: Future Proofing, Backward Compatibility, and Query Generalization Paper
High-quality-tuned LLM judges wrestle with future-proofing however deal with backward compatibility nicely with DPO. Continuous studying balances adaptation throughout response distributions, although all judges degrade on unseen questions as mills evolve.
Authors: Janvijay Singh, Austin Xu, Yilun Zhou, Yefan Zhou, Dilek Hakkani-Tur, Shafiq Joty
Scalable Chain of Ideas by way of Elastic Reasoning Paper
Elastic Reasoning separates chain-of-thought into considering and resolution phases with impartial budgets, prioritizing resolution completeness underneath constraints. The method achieves sturdy efficiency with decrease coaching prices and extra concise reasoning throughout math and coding benchmarks.
Authors: Yuhui Xu, Hanze Dong, Lei Wang, Doyen Sahoo, Junnan Li, Caiming Xiong
Studying to Motive over Steady Tokens with Reinforcement Studying (HyRea) Paper
HyRea dynamically switches between express and latent reasoning by way of entropy-guided cold-start and GRPO fine-tuning, lowering token utilization to roughly 60% whereas sustaining aggressive accuracy throughout mathematical reasoning benchmarks.
Authors: Yiran Zhao, Yuhui Xu, Doyen Sahoo, Caiming Xiong, Junnan Li
Enhancing LLM Alignment with References Paper
Reference-guided analysis improves LLM-based evaluators and allows efficient semi-self-improvement, attaining 73.1% on AlpacaEval and 58.7% on Enviornment-Exhausting with Llama-3-8B-Instruct, akin to fine-tuned reward fashions.
Authors: Kejian Shi, Yixin Liu, PeiFeng Wang, Alexander Fabbri, Shafiq Rayhan Joty, Arman Cohan
Deep Analysis Reliability
As AI methods tackle advanced analysis and data synthesis duties, rigorous analysis of their outputs turns into crucial. These papers set up new frameworks for auditing deep analysis high quality and measuring citation-grounded reliability.
DeepTRACE: Auditing Deep Analysis AI Programs for Monitoring Reliability Throughout Citations and Proof Paper
DeepTRACE audits generative engines like google and deep analysis brokers, discovering that they produce overconfident, one-sided responses with 20–60% of statements unsupported by their very own cited sources.
Authors: Pranav Narayanan Venkit, Philippe Laban, Yilun Zhou, Kung-Hsiang Huang, Yixin Mao, Chien-Sheng Wu
LiveResearchBench: A Stay Benchmark for Consumer-Centric Deep Analysis within the Wild Paper
LiveResearchBench introduces 100 expert-curated duties requiring real-time internet search, paired with DeepEval for assessing citation-grounded reviews. Analysis of 17 methods reveals particular strengths, failure modes, and elements wanted for dependable deep analysis.
Authors: Jiayu Wang, Yifei Ming, Riya Dulepet, Qinglin Chen, Austin Xu, Zixuan Ke, Frederic Sala, Aws Albarghouthi, Caiming Xiong, Shafiq Joty
Data Graphs and Retrieval
Distill-SynthKG: Distilling Data Graph Synthesis Workflow for Improved Protection and Effectivity Paper
SynthKG introduces ontology-free data graph synthesis that distills into Distill-SynthKG for environment friendly single-step technology, surpassing fashions 8x bigger in KG high quality and outperforming baselines in retrieval and question-answering with a novel graph-based RAG framework.
Authors: Prafulla Kumar Choubey, Xin Su, Man Luo, Xiangyu Peng, Caiming Xiong, Tiep Le, Shachar Rosenman, Vasudev Lal, Phil Mui, Ricky Ho, Phillip Howard, Chien-Sheng Wu
LLM Conduct and Robustness
Understanding how LLMs behave underneath various circumstances, from multi-turn dialogue to inner circuit mechanisms, shapes how we construct extra dependable methods.
LLMs Get Misplaced in Multi-Flip Dialog Paper
LLMs present a 39% efficiency drop in multi-turn versus single-turn conversations throughout six duties. Evaluation of 200,000+ simulated conversations reveals that fashions make untimely assumptions and fail to get better after they take incorrect turns.
Authors: Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, Jennifer Neville
Operate Induction and Process Generalization: An Interpretability Examine with Off-by-One Addition Paper
Circuit evaluation of off-by-one addition reveals a operate induction mechanism the place parallel consideration heads emit distinct items of the +1 operate. This reusable construction allows task-level generalization throughout shifted QA, base-8 addition, and different duties.
Authors: Qinyuan Ye, Robin Jia, Xiang Ren
Effectivity and Scalability
Making fashions smaller, sooner, and cheaper whereas preserving efficiency is crucial for enterprise deployment at scale.
Entropy-Primarily based Block Pruning for Environment friendly Massive Language Fashions Paper
Entropy-based pruning outperforms cosine similarity strategies by leveraging entropy patterns throughout Transformer blocks (reducing early, then rising) as a simpler measure of knowledge richness for lowering mannequin dimension whereas preserving accuracy.
Authors: Liangwei Yang, Yuhui Xu, Juntao Tan, Doyen Sahoo, Silvio Savarese, Caiming Xiong, Huan Wang, Shelby Heinecke
OFTSR: One-Step Circulate for Picture Tremendous-Decision with Tunable Constancy-Realism Commerce-offs Paper
OFTSR achieves one-step picture super-resolution with a tunable fidelity-realism trade-off by aligning pupil predictions to instructor mannequin sampling trajectories, reaching state-of-the-art efficiency on FFHQ, DIV2K, and ImageNet with out multi-step overhead.
Authors: Yuanzhi Zhu, Ruiqing Wang, Shilin Lu, Junnan Li, Hanshu Yan, Kai Zhang
Scaling Reinforcement Studying
Webscale-RL: Automated Knowledge Pipeline for Scaling RL Knowledge to Pretraining Ranges Paper
Webscale-RL introduces a scalable pipeline changing pre-training paperwork into 1.2M verifiable QA pairs throughout 9+ domains. RL coaching on this dataset achieves continuous pre-training efficiency with 100x fewer tokens, suggesting RL can attain pre-training efficiency at a fraction of the info value.
Authors: Zhepeng Cen, Haolin Chen, Shiyu Wang, Zuxin Liu, Zhiwei Liu, Ding Zhao, Silvio Savarese, Caiming Xiong, Huan Wang, Weiran Yao
Software program Engineering
SweRank: Software program Challenge Localization with Code Rating Paper
SweRank introduces an environment friendly retrieve-and-rerank framework for software program problem localization, educated on the SweLoc dataset. It achieves state-of-the-art efficiency on SWE-Bench-Lite and LocBench whereas outperforming expensive agent-based methods that depend on closed-source LLMs.
Authors: Revanth Gangi Reddy, Tarun Suresh, JaeHyeok Doo, Ye Liu, Xuan Phi Nguyen, Yingbo Zhou, Semih Yavuz, Caiming Xiong, Heng Ji, Shafiq Joty
Go to Us at ICLR 2026
Our researchers will current all through the convention. Cease by sales space #203 or verify our ICLR schedule for particular session instances. We’ll additionally share updates all through the week on Bluesky and X.
Assets:

