in

ChatGPT isn’t just for chatting anymore – now it will do your work for you

Elyse Betters Picaro / ZDNET

Not too long ago, I wrote that AI agents were the future of AI: tools that could carry out tasks for you, like ordering groceries or booking meetings. OpenAI’s latest launch makes that reality appear a bit closer.  

Also: AI agents will change work and society in internet-sized ways, says AWS VP

On Thursday, during a live stream, OpenAI launched a ChatGPT agent, which the company claims can handle complex tasks for you from start to finish. Some examples OpenAI provided were looking at your calendar and writing a briefing based on your upcoming events, or even planning and buying ingredients for a meal you were thinking of cooking. Let’s dive in. 

How it works

OpenAI’s most cutting-edge features, including Operator and deep research, gave the public a taste of the company’s agentic capabilities and now power this new agent mode. Operator, which launched in January, was created to interact directly with a web browser to carry out actions for you, while deep research is an agentic feature that can search the web for you and compose a detailed report in minutes that would otherwise take humans hours.

After noticing that many of the queries being fed to Operator were a better fit for Deep Research, OpenAI decided to combine the two in this new experience — and add a few new tools.

Also: Microsoft is saving millions with AI and laying off thousands – where do we go from here?

For starters, the ChatGPT agent uses a visual browser that interacts with the web through a graphical user interface (GUI), a text-based browser, a terminal, and direct API access, according to the blog post. It also uses ChatGPT connectors, a feature that allows users to connect apps like Gmail and GitHub to ChatGPT so it can pull relevant information to fulfill their requests. 

With all of those different sources of information, ChatGPT is able to reason through which is the best for the task at hand and pull information accordingly. This processing is done using its own virtual computer and distinguishes between reasoning and action based on human instruction, which allows it to retain context while pulling from multiple tools. 

ChatGPT Agent is flexible and steerable. It allows you to interrupt a request mid-process and collaborate with it to give clearer instructions that better suit your desired outcome. Even though it will use the new information, it won’t lose track of the older instructions, allowing users to take advantage of added context. It will also ask you for further details and classifications needed to carry out the task at hand. 

What can you do with ChatGPT’s agent?

The possibilities are endless. You can automate tasks as simple as scheduling an appointment for yourself at your favorite salon, or as complex as updating a spreadsheet with new financial data while keeping the formatting you want.

Also: Researchers from OpenAI, Anthropic, Meta, and Google issue joint AI safety warning – here’s why

During the live demo, the ChatGPT Agent was asked to look for a pair of black men’s dress shoes in size 9.5, start the process of creating and ordering merch from an image of a pet, handle some aspects of wedding planning, and even pull from Google Drive to create slides. 

If all goes according to plan, having AI book a trip for you or rearrange your meeting schedule could be made possible through OpenAI’s ChatGPT Agent (and competitors like it). Ultimately, only time and testing will tell how executable those functions are, but in theory, it should be as simple as you conversationally asking what you want to be done, and AI handling the rest. 

Privacy and security

An AI that can access your personal information and take action for you naturally raises security and privacy concerns. OpenAI addresses these head-on, offering a whole page within the vlog post dedicated to these concerns, in addition to the usual model card. OpenAI says it has added safeguards for challenges uncovered in the Operator research preview, such as handling sensitive information on the live web and limited terminal network access. 

Also: Does your generative AI protect your privacy? This study ranks them best to worst

During the live stream, OpenAI stated that part of what makes the model so capable is that it can browse the internet — but that the internet is a “very scary place.” In particular, the company was most concerned about prompt injection; for example, if an agent using your credit card information on a website to place the order fell victim to a malicious scam. While ChatGPT agent was trained to help detect phishing attempts, the company still emphasized the risks to users. 

OpenAI says it has also considered the specific risks that agents are exposed to by adding additional safeguards. The company warns that even though the agent can do a range of complex tasks well, it can also make mistakes — one current limitation is that it can’t create slideshows. 

It is worth reviewing the blog post and model card to fully understand limitations and security risks.

Model benchmarks

As with all model releases, OpenAI tested its new agent against different benchmarks, or industry standard evaluations. While most of the agent’s scores were impressive, one of the most notable was its performance on Humanity’s Last Exam (HLE), an evaluation that consists of 3,000 text and multi-modal questions on more than 100 subjects. According to OpenAI’s blog, the model behind ChatGPT agent scored 41.6, a new state-of-the-art mark. 

The agent also performed well on FrontierMath, one of the hardest math benchmarks, achieving a 27.4% accuracy score, which outperforms previous OpenAI models. On SpreadsheetBench, which looks at how models edit spreadsheets derived from real-world scenarios, ChatGPT Agent beat existing models by a “significant margin,” according to the blog, and when editing spreadsheets directly, it outperformed Microsoft Copilot in Excel, scoring a 45.5% compared to Copilot’s 20%.

Also: 5 entry-level tech jobs AI is already augmenting, according to Amazon

Other evaluations cited in the blog post included an internal benchmark on first-to-third-year investment banking analyst modeling tasks, as well as BrowseComp, a benchmark that looks at how agents locate hard-to-find information on the web.

The model also performed competitively against humans. In an internal benchmark OpenAI designed to evaluate model performance on “complex, economically valuable knowledge-work tasks,” ChatGPT agent’s performance was comparable or better than that of humans in about half the cases across different task completion times, outperforming o3 and o4-mini. ChatGPT agent also outperformed humans by a “significant margin” on DSBench, a benchmark which tests agents on realistic data science tasks.

Who can access the ChatGPT agent, and how?

Unlike OpenAI’s most cutting-edge features, which are typically limited to the highest-paying users upon launch, OpenAI is making ChatGPT Agent available to Pro, Plus, and Team users. Pro users will get access by end of day, while Plus and Team users will have it within the next few days, and enterprise and education users within the coming weeks. 

Also: 7 AI features coming to iOS 26 that I can’t wait to use (and how you can try them)

Pro users have the most bandwidth, at 400 messages per month, while other paid users get 40 messages monthly with the option to extend via flexible credit-based options. 

To activate the feature, users simply select “agent mode” from the tool’s dropdown during a conversation with the chatbot. 

–>


Source: Information Technologies - zdnet.com