in

Salesforce research lays the foundations for more reliable enterprise AI agents

Getty Images / picture alliance / Contributor

The value of AI agents, systems that can carry out tasks for humans, is evident, with opportunities for productivity gains, especially for businesses. However, the performance of large language models (LLMs) can hinder the effective deployment of agents. Salesforce’s AI Research seeks to address that issue. 

Also: 60% of AI agents work in IT departments – here’s what they do every day

On Thursday, Salesforce launched its inaugural Salesforce AI Research in Review report, highlighting the tech company’s innovations, including new foundational developments and research papers from the past quarter. Salesforce hopes these pieces will help support the development of trustworthy and capable AI agents that can perform well in business environments. 

“At Salesforce, we call these ‘boring breakthroughs’ — not because they’re unremarkable, but because they’re quietly capable, reliably scalable, and built to endure,” said Silvio Savarese, Salesforce’s chief scientist and head of AI research. “They’re so seamless, some might take them for granted.”

Also: The 4 types of people interested in AI agents – and what businesses can learn from them

Let’s dive into some of the biggest breakthroughs and takeaways from the report.

The problem: Jagged intelligence 

If you have ever used AI models for everyday, simple tasks, you may be surprised at the rudimentary nature of some of their mistakes. Even more puzzling is that the same model that got your basic questions wrong performed extremely well across benchmarks that tested its capabilities in highly complex topics, such as math, STEM, and coding. This paradox is what Salesforce refers to as “jagged intelligence”.

Salesforce notes that this “jaggedness”, or the discrepancy between an LLM’s raw intelligence and consistent real-world performance, is particularly challenging for enterprises requiring consistent operational performance, especially in unpredictable environments. However, addressing the problem means first quantifying it, which highlights another issue. 

<!–>

“Today’s AI is jagged, so we need to work on that – but how can we work on something without measuring it first?” said Shelby Heinecke, senior AI research manager at Salesforce. 

Also: Why neglecting AI ethics is such risky business – and how to do AI right

That is exactly the issue that Salesforce’s new SIMPLE benchmark is addressing. 

Benchmarks

Salesforce’s SIMPLE public dataset features 225 reasoning questions that are straightforward for humans to answer but challenging for AI to benchmark or quantify due to the LLM’s jaggedness. To give you an idea of just how basic the questions are, the dataset card in Hugging Face describes the problems as being “solvable by at least 10% of high schoolers given a pen, unlimited paper, and an hour of time.” 

Despite not testing for super-complex tasks, the SIMPLE benchmark should help individuals understand how a model can reason in real-world environments and applications, especially when developing Enterprise General Intelligence (EGI). These competent AI systems handle business applications reliably. 

Also: 60% of AI agents work in IT departments – here’s what they do every day

Another benefit of the benchmark is that it should lead to higher trust from business leaders about implementing AI systems, such as AI agents, into their businesses, as they will have a much better idea about the consistency of the model’s performance.

Another benchmark developed by Salesforce is the ContextualJudgeBench, which takes a different approach, evaluating the AI-enabled judges rather than the models themselves. AI model benchmarks often use assessments by other AI models. ContextualJudgeBench focuses on the LLMs that evaluate other models with the idea that, if the evaluator is trustworthy, its evaluations will be. The benchmark tests over 2,000 response pairs. 

CRMArena 

During the past quarter, Salesforce launched an agent benchmarking framework, CRMArena. The framework evaluates how AI agents perform CRM (customer relationship management) tasks, such as how AI summarizes sales emails and transcripts, makes commerce recommendations, and more. 

“These agents don’t need to solve theorems, don’t need to turn my prose into Shakespearean verses – [they] need to really focus on those critical enterprise needs across different industry verticals,” said Saverse. 

Also: How an ‘internet of agents’ could help AIs connect and work together

CRMArena is meant to address the issue of organizations not knowing how well models perform at practical business tasks. Beyond comprehensive testing, the framework should help improve AI agents’ development and performance. 

Other notable mentions

The full report includes further research to help improve AI model efficiency and reliability. Here’s a super-simplified summary of some of those highlights: 

  • SFR-Embedding: Salesforce enhanced its SFR-Embedding model, which converts text-based information into structured data for AI systems, such as agents. The company also added SFR-Embedding-Code, a specialized code-embedding family of models. 
  • SFR-Guard: A family of models trained on data to evaluate AI agents’ performance across key business areas, such as toxicity detection and prompt injection. 
  • xLAM: Salesforce updated its xLAM (Large Action Model) family with “multi-turn conversation support and a wider range of smaller models for increased accessibility.” 
  • TACO: This multimodal family of models generates chains of thought-and-action (CoTA) to tackle complex, multi-step problems. 

How to disable ACR on your TV (and why it makes such a big difference for privacy)

I tested a robot vacuum with zero-tangling technology – here’s my buying advice after a month