Follow ZDNET: Add us as a preferred source on Google.
ZDNET’s key takeaways
- Opus 4.5 failed half my coding tests, despite bold claims
- File handling glitches made basic plugin testing nearly impossible
- Two tests passed, but reliability issues still dominate the story
I’ve got to tell you: I’ve had fairly okay coding results with Claude’s lower-end Sonnet AI model. But for whatever reason, its high-end Opus model has never done well on my tests.
Usually, you expect the super-duper coding model to code better than the cheap seats, but with Opus, not so much.
Also: Google’s Antigravity puts coding productivity before AI hype – and the result is astonishing
Now, we’re back with Opus 4.5. Anthropic, the company behind Claude claims, and I quote, “Our newest model, Claude Opus 4.5, is available today. It’s intelligent, efficient, and the best model in the world for coding, agents, and computer use.”
The best model in the world for coding? No, it’s not. At least not yet.
Those of you who’ve been following along know that I have a standard set of four fairly low-end coding tests I put the AI models through on a regular basis. They test a bunch of very simple skills and framework knowledge, but they can sometimes trip up the AIs.
Also: How I test an AI chatbot’s coding ability – and you can, too
I’ll give you the TL;DR right now. Opus 4.5 crashed and burned on one test, turned in a mediocre and not-quite-good-enough answer on the second, and passed the remaining two. With a 50% score, we’re definitely not looking at “the best model in the world for coding.”
Let’s dig in, and then I’ll wrap up with some thoughts.
Test 1: Writing a WordPress plugin
Test 1 asks the AI to build a simple WordPress plugin that presents an interface in the admin dashboard and then randomizes names. The only hard part is that if there is more than one matching name, they are separated, but all the names still show in the list.
Also: The best free AI for coding in 2025 – only 3 make the cut now
Opus 4.5 went to town writing this plugin. I’ve seen builds that were done in a single, simple PHP file and worked just fine. But it is possible to use a mix of PHP for the back end, JavaScript for the interactive bits, and CSS for styling. That’s what Opus did.
Opus wrote a 312-line PHP file, a 178-line JavaScript file, and a 133-line CSS file. Or, at least it did the second time around.
For its first trick, Opus 4.5 combined all three files into one that it said I could download and simply install. Except I couldn’t download the file. I tried a few times, and Opus 4.5 kept responding with “Failed to download files.”
Then I tried getting at the files using the Files Workspace. I clicked on “View the Line Randomizer plugin folder” in the Opus 4.5 response window, only to get a large, empty screen with the phrase “No file content available.”
Okay, fine. After I pasted in my original test prompt, I watched Opus 4.5 display the code as it was being generated. Once it finished, the code was hidden. Presumably, Opus 4.5 just expected the download to work.
To get at the actual code, I had to ask Opus 4.5:
Give me each of the three files separately, so I can cut and paste them from here.
It did. The PHP code was in its own little window area, where I could cut it out and paste it into my text editor. So was the CSS code. But the JavaScript code included some documentation (not commented out) about the recommended file structure.
Had I not quickly taken a look at the whole file’s code to see what it was doing, I might have just tried running that. Without a doubt, that would have resulted in a fail.
Also: OpenAI’s Codex Max solves one of my biggest AI coding annoyances – and it’s a lot faster
There was, however, some good news. After all that fussing and removing the spurious documentation lines that would have killed it, I did manage to get the WordPress plugin to load and present a user interface.
Given that it was being styled by 133 lines of CSS, you would think it might look a little better, but hey, at least something worked. Well, not really.
Once I pasted in my test names, I clicked on Randomize Lines. Nothing happened. Clear All didn’t work either.
Also: How to vibe code your first iPhone app with AI – no experience necessary
Let’s recap just how many ways this failed. It wouldn’t download when it told me it was giving me a download link. Then I asked for the code separately to cut and paste. It mixed the chatbot response into the code. Then, when I pulled that out and ran the test, the actual run didn’t work. It presented a UI, but wouldn’t actually do the code.
As the Mythbusters used to say, “Failure is always an option.”
Test 2: Rewriting a string function
Test 2 asks the AI to fix a simple bit of JavaScript that incorrectly validates the entry of dollars and cents currency. What I feed the AI is code that won’t allow for any cents. It’s supposed to give back working code.
The idea of this function is that it checks for user input. It was originally in a donation plugin, so its job was to make sure the donor was actually typing in an amount that could be qualified as a donation amount, and wouldn’t break on someone entering letters or numbers incorrectly.
Also: How to use ChatGPT to write code – and my top trick for debugging what it generates
The code Opus 4.5 gave back rejected too many edge case examples. It didn’t allow “12.” (two digits followed by a decimal point), although that would clearly work as $12. It didn’t allow for “.5,” although that would clearly work for 50 cents. It didn’t like “000.5”, although it did accept “0.5”. And if someone typed “12.345” it didn’t chop off the last half a cent (or round it up). It just rejected the entry.
Oh, and if there was no value passed to it, or the string value it was asked to test was actually null (an empty value), the code would crash. Not just return an error, but crash.
That gives “the best model in the world for coding” its second failure.
Tests 3 and 4
Test 3 asks the AI to identify what’s causing a bug in code, but it requires fairly good framework knowledge of how PHP and WordPress work. It’s a multi-step analysis, where what seems obvious isn’t the problem. The bug is baked deeper into how the framework works.
Opus 4.5 passed this test just fine.
Also: Why AI coding tools like Cursor and Replit are doomed – and what comes next
Test 4 asks the AI to work with three programs: AppleScript, Chrome, and a utility called Keyboard Maestro. Basically, it’s asking Keyboard Maestro to interact with AppleScript to find and activate a specific tab in Chrome.
Surprisingly, because this test often trips up the AIs, Opus 4.5 aced this question. It understood Keyboard Maestro, and it didn’t make the usual case sensitivity errors other AIs have made in the past.
Bottom line for Opus 4.5
Opus 4.5 is supposed to be Anthropic’s grand work. In the agentic environment with Claude Code, and supervised by a professional programmer willing to ask Claude to rewrite its result until the code works, it might be pretty good.
I’ve been using Claude Code and Sonnet 4.5 in the agentic terminal interface with pretty impressive results. But the results are not always correct. I have to send Claude back to work three, four, five, six, even ten times sometimes to get it to give me a workable answer.
Here, for this article, I just tested Opus 4.5 in the chatbot. I did send it back once to give me code I could actually access. But overall, it failed 50% of the time. Plus, in my first test, it demonstrated how it just wasn’t ready for a simple chatbot interface.
Also: GitHub’s new Agent HQ gives devs a command center for all their AI tools – why this is a huge deal
I’m sure Anthropic will improve this over time, but as of today, I certainly can’t report that Opus 4.5 is ready for prime time. I shot a note out to Anthropic asking for comment. If the company gets back to me, I’ll update this article with its response.
Stay tuned.
Have you tried Opus 4.5 or any of Anthropic’s other models for hands-on coding work? How do your results compare with what I found here? Have you run into similar issues with file handling or code reliability, or has your experience been smoother? And where do you think these “best model in the world for coding” claims land based on your own testing? Share your thoughts in the comments below.
You can follow my day-to-day project updates on social media. Be sure to subscribe to my weekly update newsletter, and follow me on Twitter/X at @DavidGewirtz, on Facebook at Facebook.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, on Bluesky at @DavidGewirtz.com, and on YouTube at YouTube.com/DavidGewirtzTV.
Source: Robotics - zdnet.com
