Meet SCUBA: The Next Frontier in Enterprise-Agent Evaluation

On the planet of AI brokers that click on, scroll, execute and automate — we’re shifting quick from “simply perceive textual content” to “really use software program for you.” The brand new benchmark SCUBA tackles precisely that: how effectively can brokers do actual enterprise workflows contained in the Salesforce platform?

What makes SCUBA stand out:

It’s constructed across the precise workflows contained in the Salesforce platform.
It covers 300 process cases derived from actual person interviews (platform admins, gross sales reps, and repair brokers).
The duties check not simply “does the mannequin reply the query” however “can the mannequin use the UI, manipulate knowledge, set off workflows, troubleshoot points.”
It addresses a spot: present benchmarks typically deal with net navigation and software program manipulation — however enterprise-software “pc use” is tough to measure. SCUBA goals to fill that.

Key Takeaway: If you need brokers that don’t simply chat, however act in enterprise software program, it is a huge step.

The Enterprise Influence

Think about an AI assistant that may navigate your CRM, replace data, launch workflows, interpret dashboard failures, and assist your service staff get unstuck. That’s the imaginative and prescient this paper leans into.

Right here’s why it’s compelling:

Enterprise alignment: Many benchmarks are tutorial or consumer-web oriented. SCUBA places the highlight on business-critical environments (admin, gross sales, and repair).
Lifelike duties: By deriving duties from person interviews and real personas, it bridges the hole between “toy benchmark” and “stay person state of affairs.”
Measurable agent efficiency in context: It allows analysis of how effectively an agent operates inside software program techniques, not simply by way of textual content.
Roadmap for future AI assistants: As extra organizations undertake AI to automate software program use (not simply evaluation), benchmarks like this set expectations, spotlight challenges, and direct progress.

For companies like Salesforce (and their prospects) the implications are clear: higher agent tooling, fewer guide clicks, quicker concern decision, extra environment friendly gross sales/service groups. For the AI neighborhood: a brand new frontier of “process execution in UI” reasonably than “simply textual content reasoning”.

Key Insights:

1. Actual-world area shift is tough

The efficiency drop when shifting from the extra generic OSWorld benchmark (which covers desktop functions) to SCUBA (CRM, enterprise workflows) is important. The experiment reveals a chart of drop in success charges when shifting the benchmark.

2. Demonstrations Assist

Data articles and tutorials on how one can use salesforce platforms are simply accessible. One pure query is whether or not AI brokers can leverage this info successfully like people do. The experiment outcomes reveal that

Human demonstrations (exhibiting the agent how one can do the same process) improved efficiency throughout most brokers: larger success charges, decrease time, decrease token utilization （please see the technical report for extra particulars). However, some brokers didn’t profit as a lot.
Additionally, some ended up utilizing extra steps in demonstration-augmented mode (for instance as a consequence of discovering “shortcuts” that the human demo didn’t present). So the design of demonstrations nonetheless issues.

3. Price, latency, and sensible deployment matter

Success charge will not be the one metric; latency (time to finish duties) and price (API/token prices, variety of steps) are additionally reported. As an example, browser-use brokers had excessive success charges however larger latency (as a consequence of API service response time & multi-agent framework design).
Demonstration augmentation not solely improves success however can scale back time and prices (the paper reviews ~13% decrease time, ~16% decrease price within the demonstration-augmented setting).
For enterprise adoption, this issues: an agent that succeeds however is just too sluggish or too expensive could also be much less helpful in follow.

Implications for the Way forward for CRM Automation:

Ran Xu
Director, AI Analysis

Ran Xu acquired his Ph.D. in pc science from College at Buffalo from 2015. Presently, he leads a gaggle of remarkable pc imaginative and prescient and multimodal AI researchers at Salesforce to push the boundary of analysis and productive AI for CRM.

Extra by Ran

Zeyuan Chen
Senior Supervisor, Analysis

Zeyuan Chen is a Senior Supervisor of Analysis at Salesforce AI Analysis, the place he has been contributing since 2019. His work focuses on advancing pc imaginative and prescient, machine studying, multimodal AI, AI brokers, and workflow automation by means of code era and knowledge visualization. He holds a Bachelor’s
Learn Extra

Extra by Zeyuan

Source link

What's Hot

Intel and AMD unveil new x86 standard to make CPUs better at running AI models

Bank of Korea warns chip workers’ massive bonuses may be inflation concern

Can You Still Succeed With Weekend Trades?

Meet SCUBA: The Next Frontier in Enterprise-Agent Evaluation

Use Psychology of Color in Marketing to Boost Your Results

Every Team Feature in Buffer, Built for Agencies

Instacart Teases the 3 Ad Formats Coming to Its AI Shopping Assistant

Announcing the Final Batch of Speakers for MozCon NYC 2026

Intel and AMD unveil new x86 standard to make CPUs better at running AI models

Bank of Korea warns chip workers’ massive bonuses may be inflation concern

Can You Still Succeed With Weekend Trades?

10 Innovative Customer Engagement Ideas and Strategies to Boost Loyalty

The Weekly Notable Startup Funding Report: 6/22/26 – AlleyWatch

Use Psychology of Color in Marketing to Boost Your Results

What's Hot

Meet SCUBA: The Next Frontier in Enterprise-Agent Evaluation

What makes SCUBA stand out:

The Enterprise Influence

Key Insights:

1. Actual-world area shift is tough

2. Demonstrations Assist

3. Price, latency, and sensible deployment matter

Implications for the Way forward for CRM Automation:

Related Posts