Salesforce AI Research at ICLR 2026

Salesforce AI Analysis will current 21 accepted papers at ICLR 2026, the Fourteenth Worldwide Convention on Studying Representations. The convention runs April 23–27 on the Riocentro Conference and Occasion Middle in Rio de Janeiro, Brazil.

Our accepted authors will share their work via lightning talks, poster classes, and workshops all through the week.

This 12 months’s analysis displays the issues we expect matter most for enterprise AI: brokers that act reliably in advanced environments, analysis frameworks that expose actual failure modes, stronger reasoning, and methods that keep environment friendly and reliable at scale.

Workshop Paper

Along with our major convention acceptances, our work on agent identification failures was accepted to the Brokers within the Wild: Security, Safety, and Past workshop at ICLR 2026.

ECHOING: Identification Failures When LLM Brokers Discuss to Every Different Paper

When LLM brokers work together autonomously, they’ll abandon their assigned roles and mirror their conversational accomplice. We name this ‘echoing.’ Throughout 2,500+ conversations, echoing charges reached as excessive as 70% with main mannequin suppliers, but 93% of affected conversations nonetheless registered as profitable by customary metrics. Reasoning fashions provided minimal enchancment, and structured responses decreased however didn’t get rid of the issue.

Authors: Sarath Shekkizhar, Romain Cosentino, Adam Earle, Silvio Savarese

Primary Convention Papers

Agent Architectures and GUI Brokers

Our agent work this 12 months spans test-time scaling, instrument studying, computer-use benchmarks, and multi-agent coordination.

GTA1: GUI Take a look at-time Scaling Agent Paper

GTA1 introduces test-time scaling for GUI brokers, utilizing a number of candidate motion proposals and RL-based grounding to realize state-of-the-art efficiency on autonomous process completion throughout platforms.

Authors: Yan Yang, Dongxu Li, Yutong Dai, Yuhao Yang, Ziyang Luo, Zirui Zhao, Zhiyuan Hu, Junzhe Huang, Amrita Saha, Zeyuan Chen, Ran Xu, Liyuan Pan, Silvio Savarese, Caiming Xiong, Junnan Li

WALT: Internet Brokers that Study Instruments Paper

WALT reverse-engineers web site performance into reusable instruments like search, filter, and create. This shifts from fragile step-by-step interactions to dependable instrument invocation with increased success charges and fewer steps on VisualWebArena and WebArena.

Authors: Viraj Prabhu, Yutong Dai, Matthew Fernandez, Jing Gu, Krithika Ramakrishnan, Yanqi Luo, Silvio Savarese, Caiming Xiong, Junnan Li, Zeyuan Chen, Ran Xu

SCUBA: Salesforce Pc Use Benchmark Paper

SCUBA benchmarks computer-use brokers on 300 actual Salesforce CRM duties throughout admin, gross sales, and repair workflows. Open-source brokers obtain lower than 5% success versus 39% for closed-source fashions in zero-shot settings, bettering to 50% with demonstrations whereas lowering time and prices by 13–16%.

Authors: Yutong Dai, Krithika Ramakrishnan, Jing Gu, Matthew Fernandez, Yanqi Luo, Viraj Prabhu, Zhenyu Hu, Silvio Savarese, Caiming Xiong, Zeyuan Chen, Ran Xu

CoAct-1: Pc-using Multi-agent System with Coding Actions Paper

CoAct-1 introduces a multi-agent system combining GUI management with programmatic execution. An Orchestrator delegates subtasks to GUI Operator or Programmer brokers, attaining 60.76% success on OSWorld (a brand new state-of-the-art) whereas lowering common steps from 15 to 10.15.

Authors: Linxin Track, Yutong Dai, Viraj Prabhu, Jieyu Zhang, Taiwei Shi, Li Li, Junnan Li, Silvio Savarese, Zeyuan Chen, Jieyu Zhao, Ran Xu, Caiming Xiong

Grounded Take a look at-Time Adaptation for LLM Brokers Paper

Parametric on-line adaptation aligns LLM brokers to environment-specific codecs whereas non-parametric dynamics grounding learns causal state transitions via persona-driven exploration, collectively addressing syntactic and semantic mismatches and boosting WebArena multi-site success from 2% to 23%.

Authors: Arthur Chen, Zuxin Liu, Jianguo Zhang, Akshara Prabhakar, Zhiwei Liu, Shelby Heinecke, Silvio Savarese, Victor Zhong, Caiming Xiong

Reasoning and Analysis

Advancing how LLMs purpose and the way we measure that reasoning is central to constructing enterprise AI that works. This cluster addresses test-time scaling, verification dynamics, evaluator coaching, and environment friendly reasoning underneath constraints.

Nudging the Boundaries of LLM Reasoning Paper

NuRL overcomes a central RL limitation through the use of self-generated hints to unlock studying from beforehand ‘unsolvable’ issues, elevating efficiency ceilings the place customary strategies like GRPO plateau, with constant enhancements throughout six benchmarks and three fashions.

Authors: Justin Chih-Yao Chen, Becky Xiangyu Peng, Prafulla Kumar Choubey, Kung-Hsiang Huang, Jiaxin Zhang, Mohit Bansal, Chien-Sheng Wu

Variation in Verification: Understanding Verification Dynamics in Massive Language Fashions Paper

This paper analyzes how LLM verifiers assess resolution candidates in test-time scaling, discovering that weak mills can match stronger ones post-verification. Verification effectiveness is dependent upon downside issue, generator power, and verifier functionality, revealing when verifier scaling reaches its limits.

Authors: Yefan Zhou, Austin Xu, Yilun Zhou, Janvijay Singh, Jiang Gui, Shafiq Joty

Foundational Automated Evaluators (FARE): Scaling Multi-Process Generative Evaluator Coaching for Reasoning-Centric Domains Paper

FARE, educated on 2.5M samples, units new requirements for open-source evaluators. The 8B mannequin rivals bigger RL-trained fashions, whereas the 20B model surpasses 70B+ evaluators and achieves near-oracle reranking on MATH with 14.1% downstream RL enhancements.

Authors: Austin Xu, Xuan-Phi Nguyen, Yilun Zhou, Chien-Sheng Wu, Caiming Xiong, Shafiq Joty

On the Shelf Lifetime of High-quality-Tuned LLM Judges: Future Proofing, Backward Compatibility, and Query Generalization Paper

High-quality-tuned LLM judges wrestle with future-proofing however deal with backward compatibility nicely with DPO. Continuous studying balances adaptation throughout response distributions, although all judges degrade on unseen questions as mills evolve.

Authors: Janvijay Singh, Austin Xu, Yilun Zhou, Yefan Zhou, Dilek Hakkani-Tur, Shafiq Joty

Scalable Chain of Ideas by way of Elastic Reasoning Paper

Elastic Reasoning separates chain-of-thought into considering and resolution phases with impartial budgets, prioritizing resolution completeness underneath constraints. The method achieves sturdy efficiency with decrease coaching prices and extra concise reasoning throughout math and coding benchmarks.

Authors: Yuhui Xu, Hanze Dong, Lei Wang, Doyen Sahoo, Junnan Li, Caiming Xiong

Studying to Motive over Steady Tokens with Reinforcement Studying (HyRea) Paper

HyRea dynamically switches between express and latent reasoning by way of entropy-guided cold-start and GRPO fine-tuning, lowering token utilization to roughly 60% whereas sustaining aggressive accuracy throughout mathematical reasoning benchmarks.

Authors: Yiran Zhao, Yuhui Xu, Doyen Sahoo, Caiming Xiong, Junnan Li

Enhancing LLM Alignment with References Paper

Reference-guided analysis improves LLM-based evaluators and allows efficient semi-self-improvement, attaining 73.1% on AlpacaEval and 58.7% on Enviornment-Exhausting with Llama-3-8B-Instruct, akin to fine-tuned reward fashions.

Authors: Kejian Shi, Yixin Liu, PeiFeng Wang, Alexander Fabbri, Shafiq Rayhan Joty, Arman Cohan

Deep Analysis Reliability

As AI methods tackle advanced analysis and data synthesis duties, rigorous analysis of their outputs turns into crucial. These papers set up new frameworks for auditing deep analysis high quality and measuring citation-grounded reliability.

DeepTRACE: Auditing Deep Analysis AI Programs for Monitoring Reliability Throughout Citations and Proof Paper

DeepTRACE audits generative engines like google and deep analysis brokers, discovering that they produce overconfident, one-sided responses with 20–60% of statements unsupported by their very own cited sources.

Authors: Pranav Narayanan Venkit, Philippe Laban, Yilun Zhou, Kung-Hsiang Huang, Yixin Mao, Chien-Sheng Wu

LiveResearchBench: A Stay Benchmark for Consumer-Centric Deep Analysis within the Wild Paper

LiveResearchBench introduces 100 expert-curated duties requiring real-time internet search, paired with DeepEval for assessing citation-grounded reviews. Analysis of 17 methods reveals particular strengths, failure modes, and elements wanted for dependable deep analysis.

Authors: Jiayu Wang, Yifei Ming, Riya Dulepet, Qinglin Chen, Austin Xu, Zixuan Ke, Frederic Sala, Aws Albarghouthi, Caiming Xiong, Shafiq Joty

Data Graphs and Retrieval

Distill-SynthKG: Distilling Data Graph Synthesis Workflow for Improved Protection and Effectivity Paper

SynthKG introduces ontology-free data graph synthesis that distills into Distill-SynthKG for environment friendly single-step technology, surpassing fashions 8x bigger in KG high quality and outperforming baselines in retrieval and question-answering with a novel graph-based RAG framework.

Authors: Prafulla Kumar Choubey, Xin Su, Man Luo, Xiangyu Peng, Caiming Xiong, Tiep Le, Shachar Rosenman, Vasudev Lal, Phil Mui, Ricky Ho, Phillip Howard, Chien-Sheng Wu

LLM Conduct and Robustness

Understanding how LLMs behave underneath various circumstances, from multi-turn dialogue to inner circuit mechanisms, shapes how we construct extra dependable methods.

LLMs Get Misplaced in Multi-Flip Dialog Paper

LLMs present a 39% efficiency drop in multi-turn versus single-turn conversations throughout six duties. Evaluation of 200,000+ simulated conversations reveals that fashions make untimely assumptions and fail to get better after they take incorrect turns.

Authors: Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, Jennifer Neville

Operate Induction and Process Generalization: An Interpretability Examine with Off-by-One Addition Paper

Circuit evaluation of off-by-one addition reveals a operate induction mechanism the place parallel consideration heads emit distinct items of the +1 operate. This reusable construction allows task-level generalization throughout shifted QA, base-8 addition, and different duties.

Authors: Qinyuan Ye, Robin Jia, Xiang Ren

Effectivity and Scalability

Making fashions smaller, sooner, and cheaper whereas preserving efficiency is crucial for enterprise deployment at scale.

Entropy-Primarily based Block Pruning for Environment friendly Massive Language Fashions Paper

Entropy-based pruning outperforms cosine similarity strategies by leveraging entropy patterns throughout Transformer blocks (reducing early, then rising) as a simpler measure of knowledge richness for lowering mannequin dimension whereas preserving accuracy.

Authors: Liangwei Yang, Yuhui Xu, Juntao Tan, Doyen Sahoo, Silvio Savarese, Caiming Xiong, Huan Wang, Shelby Heinecke

OFTSR: One-Step Circulate for Picture Tremendous-Decision with Tunable Constancy-Realism Commerce-offs Paper

OFTSR achieves one-step picture super-resolution with a tunable fidelity-realism trade-off by aligning pupil predictions to instructor mannequin sampling trajectories, reaching state-of-the-art efficiency on FFHQ, DIV2K, and ImageNet with out multi-step overhead.

Authors: Yuanzhi Zhu, Ruiqing Wang, Shilin Lu, Junnan Li, Hanshu Yan, Kai Zhang

Scaling Reinforcement Studying

Webscale-RL: Automated Knowledge Pipeline for Scaling RL Knowledge to Pretraining Ranges Paper

Webscale-RL introduces a scalable pipeline changing pre-training paperwork into 1.2M verifiable QA pairs throughout 9+ domains. RL coaching on this dataset achieves continuous pre-training efficiency with 100x fewer tokens, suggesting RL can attain pre-training efficiency at a fraction of the info value.

Authors: Zhepeng Cen, Haolin Chen, Shiyu Wang, Zuxin Liu, Zhiwei Liu, Ding Zhao, Silvio Savarese, Caiming Xiong, Huan Wang, Weiran Yao

Software program Engineering

SweRank: Software program Challenge Localization with Code Rating Paper

SweRank introduces an environment friendly retrieve-and-rerank framework for software program problem localization, educated on the SweLoc dataset. It achieves state-of-the-art efficiency on SWE-Bench-Lite and LocBench whereas outperforming expensive agent-based methods that depend on closed-source LLMs.

Authors: Revanth Gangi Reddy, Tarun Suresh, JaeHyeok Doo, Ye Liu, Xuan Phi Nguyen, Yingbo Zhou, Semih Yavuz, Caiming Xiong, Heng Ji, Shafiq Joty

Go to Us at ICLR 2026

Our researchers will current all through the convention. Cease by sales space #203 or verify our ICLR schedule for particular session instances. We’ll additionally share updates all through the week on Bluesky and X.

Assets:

Source link

What's Hot

Oil volatility is creating a ‘win-win’ trade strategy

21 Meatball Appetizers That Vanish In Minutes (Not Your Budget)

How to build a lead capture form that converts

Salesforce AI Research at ICLR 2026

How to build a lead capture form that converts

The Buffer Plugin for TRMNL Is Here, and We’re Giving Some Devices Away

MLS Is Ready for Its Through Pass From the World Cup

The 4 S’s of YouTube Success

Oil volatility is creating a ‘win-win’ trade strategy

21 Meatball Appetizers That Vanish In Minutes (Not Your Budget)

How to build a lead capture form that converts

Using cloud servers for data storage

Most people assume Dubai became rich from oil, but oil now accounts for less than 1% of the emirate’s GDP — down from 50% in the 1980s — with tourism, trade, and aviation doing the work instead

Which company offers interns $8,600 per week?

What's Hot

Salesforce AI Research at ICLR 2026

Workshop Paper

ECHOING: Identification Failures When LLM Brokers Discuss to Every Different Paper

Primary Convention Papers

Agent Architectures and GUI Brokers

GTA1: GUI Take a look at-time Scaling Agent Paper

WALT: Internet Brokers that Study Instruments Paper

SCUBA: Salesforce Pc Use Benchmark Paper

CoAct-1: Pc-using Multi-agent System with Coding Actions Paper

Grounded Take a look at-Time Adaptation for LLM Brokers Paper

Reasoning and Analysis

Nudging the Boundaries of LLM Reasoning Paper

Variation in Verification: Understanding Verification Dynamics in Massive Language Fashions Paper

Foundational Automated Evaluators (FARE): Scaling Multi-Process Generative Evaluator Coaching for Reasoning-Centric Domains Paper

On the Shelf Lifetime of High-quality-Tuned LLM Judges: Future Proofing, Backward Compatibility, and Query Generalization Paper

Scalable Chain of Ideas by way of Elastic Reasoning Paper

Studying to Motive over Steady Tokens with Reinforcement Studying (HyRea) Paper

Enhancing LLM Alignment with References Paper

Deep Analysis Reliability

DeepTRACE: Auditing Deep Analysis AI Programs for Monitoring Reliability Throughout Citations and Proof Paper

LiveResearchBench: A Stay Benchmark for Consumer-Centric Deep Analysis within the Wild Paper

Data Graphs and Retrieval

Distill-SynthKG: Distilling Data Graph Synthesis Workflow for Improved Protection and Effectivity Paper

LLM Conduct and Robustness

LLMs Get Misplaced in Multi-Flip Dialog Paper

Operate Induction and Process Generalization: An Interpretability Examine with Off-by-One Addition Paper

Effectivity and Scalability

Entropy-Primarily based Block Pruning for Environment friendly Massive Language Fashions Paper

OFTSR: One-Step Circulate for Picture Tremendous-Decision with Tunable Constancy-Realism Commerce-offs Paper

Scaling Reinforcement Studying

Webscale-RL: Automated Knowledge Pipeline for Scaling RL Knowledge to Pretraining Ranges Paper

Software program Engineering

SweRank: Software program Challenge Localization with Code Rating Paper

Go to Us at ICLR 2026

Related Posts