GPT-5 bombed my coding tests, but redeemed itself with code analysis

MF3d/Getty Images

ZDNET’s key takeaways

GPT-5 Pro delivers the sharpest, most actionable code analysis.
A detail-focused prompt can push base GPT-5 toward Pro results.
o3 remains a strong contender despite being a GPT-4 variant.

With the big news that OpenAI has released GPT-5, the team here at ZDNET is working to learn about and communicate its strengths and weaknesses. In another article, I put its programming prowess to the test and came up with a less-than-impressive result.

Also: I tested GPT-5’s coding skills, and it was so bad that I’m sticking with GPT-4o

When Deep Research first appeared with the OpenAI o3 LLM, I was quite impressed with what it could understand from examining a code repository. I wanted to know how well it understood the project just from the available code.

In this article, I’m examining how well the three GPT-5 variants do in examining that same code repository. We’ll dig in and compare them. The results are quite interesting. Here are the four models.

o3: a GPT-4 variant optimized for reasoning.
GPT-5: OpenAI’s new main ChatGPT model, available to all tiers, including free.
GPT-5 Thinking: A variant of GPT-5 that OpenAI says is optimized for “architectural reflection.” It is available to $20/mo Plus and $200/mo Pro tiers.
GPT-5 Pro: OpenAI’s current $200/mo top-tier model, with the highest reasoning and context capabilities.

I gave all four models the same assignment. I connected them to my private GitHub repository for my open-source free WordPress security plugin and its freemium add-on modules, selected Deep Research, and gave them this prompt.

Examine the repository and learn its structure and architecture. Then report back what you’ve learned.

For those models that asked to choose areas of detail about what I wanted, I gave them this prompt.

Everything you can tell me, be as comprehensive as possible.

As you can see, I didn’t provide any context other than the source code repo itself. That code has a README file, as well as comments throughout the code, so there was some English-language context. But most of the context has to be derived from the folder structure, file names, and code itself.

Also: The best AI for coding in 2025 (and what not to use)

From that, I hoped that the AIs would assess its structure, quality, security posture, extensibility, and possibly suggest improvements. This should be relevant to ZDNET readers because it’s the kind of high-judgement, detail-oriented work that AIs are being used for. It certainly can make coming up to speed on an existing coding project easier, or at least provide a foundation for initial understanding.

TL;DR summary

Other than the two prompts above, I didn’t give the LLMs any guidance about what to tell me. I wanted to see how they evaluated the repository and what sort of analysis they could provide.

As you can see from this table, overall coverage was quite varied in scope. More checks mean more depth of coverage.

To create this aggregate, topics like “Project Purpose & Architecture,” “System Architecture,” and “Plugin Design & Integration” were all normalized under Purpose/Architecture. Directory/File Structure contained any section mapping folders and files. Execution flow combines anything about how the software code runs. Recommendations/Issues combines all discussions of modernization suggestions, open issues, and minor red flags.

In terms of overall value, I’d rank the four LLMs as follows (from best to least best).

GPT-5 Pro: Most precise, engineering-ready, and actionable.
GPT-5: Widest scope, excellent mapping, and defensive-coding insight.
o3: Concise, modernization-focused, but lighter on underlying architecture.
GPT-5 Thinking: Best onboarding narrative, least evaluative depth.

Pro, of course, is only available in the $200/mo ChatGPT Pro tier. Later in this article, I’ll show one way to modify the above prompts to get GPT-5 (non-Pro) to provide a fairly close approximation of the overall depth of the Pro response.

GPT-5 Thinking, which is a model available in the $20/mo Plus plan, was the least helpful of the group. The GPT-4 generation o3 Deep Thinking model still holds up, but you can see how its self-directed focus is a bit different from the other two.

Also: Google’s Jules AI coding agent built a new feature I could actually ship – while I made coffee

My main conclusion is that I was a bit surprised about how close the models were to each other. GPT-5, as OpenAI promised, did seem to provide a jump in overall cognition and usefulness, but nothing I would consider game-changing.

With that, let’s dive into some specific examples that help illustrate my conclusions. Each of these sections is pulled from the various reports generated and shows you how each model provided similar information.

Security posture, according to the models

Below, you can see how GPT-5 Pro names exact mechanisms (like file-top guard, nonces, manage_options). I’ve provided exact snippets here from the reports generated by all four models. GPT-5 affirms best practices but keeps it conceptual. o3 describes what happens (redirects/login flow) more than how it’s hardened. GPT-5 Thinking gives the clearest “what runs when” story for new developers coming up to speed on the repo.

GPT-5 Pro: Most concrete, code-level

“It guards against direct file access (if (!defined(‘ABSPATH’)) exit; at the top of PHP files). It sanitizes input where appropriate … using WordPress nonces in AJAX handlers. It uses capability checks (e.g., adding menu pages only for users with manage_options capability … only admins see those tools). The code tries not to load unnecessary things … like only loading certain admin files on certain contexts.”

GPT-5: Correct, but higher-level

“There are checks for WordPress functions before use … so the plugin behaves gracefully even on very old WordPress setups. The plugins often guard against direct file access by checking … to prevent security issues from accessing .php files directly. Add-ons verify the presence of core before proceeding … and show an admin error if CMB2 isn’t loaded.”

o3: Runtime behavior, less on hardening specifics

“Purpose: My Private Site locks down an entire site so only logged-in users can view content … while protecting the rest. Overall architecture: [it] integrates deeply with WordPress’s hook system and login/logout events to manage redirects and track login state.”

GPT-5 Thinking: Clear execution flow, onboarding tone

“Admin vs Front-end: It checks is_admin() to determine context. If on the front-end (not admin), it retrieves the saved privacy setting and, when enabled, hooks at a point like template_redirect to redirect unauthorized visitors. Throughout this initialization, the plugin uses WordPress hooks (actions and filters) to integrate functionality.”

Licensing and update mechanism, according to the models

GPT-5 Pro didn’t just describe the system; it walked through the process in sequential operational steps, almost like a short runbook you could hand to a developer or QA tester. GPT-5 confirms the architecture but abstracts the plumbing. GPT-5 Thinking adds a helpful “how add-ons plug into the Licenses tab” detail. o3 largely leaves licensing internals on the cutting room floor in favor of a fairly unhelpful modernization critique.

GPT-5 Pro: Explains it step-by-step

“The core plugin provides utility functions to get and store license keys in a centralized option (jr_ps_licenses) and to contact the EDD license server for validation. Each extension plugin defines its own updater using EDD_SL_Plugin_Updater, passing the current version, the license key from the centralized store, and the EDD store URL. The core plugin’s UI has a ‘Licenses’ tab, and extensions inject their own license fields via filters.”

GPT-5: Conceptual, but accurate

“License integration: The core plugin centralizes license management … and the add-ons piggyback on the core’s licensing mechanism, integrating their license fields into the core plugin’s interface.”

o3: Barely mentions this topic at all

The o3 report spends most of its time on modernization and architecture. It discusses configuration and update behavior but does not walk through option keys, updater classes, or the Licenses UI wiring with the same procedural detail as GPT-5 and GPT-5 Pro. So there’s nothing here to quote as a demonstration.

GPT-5 Thinking: Good UI and extensibility observation

“The add-ons heavily rely on hooks provided by core or WordPress: They use add_filter/add_action calls to insert their logic … and use WordPress action hooks to integrate their license fields into the Licenses tab that the core plugin triggers when building the Licenses tab.”

State management, according to the models

Both GPT-5 Pro and GPT-5 explicitly pointed out how my code uses “one option array + prune + no-op writes,” which is a WordPress best practice for code maintainability. Both o3 and GPT-5 Thinking describe the lifecycle and effects (what’s initialized, what loads when) rather than the exact option structure.

GPT-5 Pro: Looks at specific storage pattern

“Settings are stored in a single serialized option … initialization routines add default keys, prune deprecated ones, and only update the option in the database if there is an actual change, avoiding unnecessary writes.”

GPT-5: Also looks at storage pattern, but more generally

“State Management: Plugin settings are stored in WordPress options as a central settings array and the code ensures defaults are applied while removing deprecated ones on each load, but only writes to the database when changes occur.”

o3: Identifies intent and behavior, but doesn’t discuss internals

“The main plugin initializes defaults (installed version, first-run timestamp, etc.). On each run it ensures these options exist and, if the privacy feature is disabled, the enforcement hook is not added.”

GPT-5 Thinking: Discusses basic flow and modules

“Module includes: includes admin and common modules in the back-end; on the front-end it retrieves the saved privacy setting and, when enabled, loads enforcement logic (e.g., in template_redirect). It registers a deactivation hook to clean up on deactivation (e.g., deleting a flag option).”

What does this mean for GPT-5?

I was unimpressed with GPT-5 when it came to my coding tests. It failed half of my tests, an unprecedentedly bad result for what has previously been the gold standard in passing coding tests.

But GPT-5 was quite impressive in its analysis of the GitHub repository. It could be a powerful tool for onboarding new programmers, for someone adopting code, or simply for coming back up to speed on a project that’s been untouched for a while.

Also: How I test an AI chatbot’s coding ability – and you can, too

The GPT-4 generation o3 model is known to be a strong reasoning model, which is why it has been the basis for ChatGPT Deep Research. But GPT-5 was able to combine both breadth and detail, which is where o3 and GPT-4o were weak in previous tests.

The older models did give accurate summaries and useful suggestions, but they missed interconnections. For example, the older models were never able to show how UI flows, licensing, and update mechanisms work together.

Even the base version of GPT-5 was able to identify cross-cutting concerns without additional prompting. Repository structure, backward compatibility, performance characteristics, and state management patterns all appeared in the first draft. Trying to get GPT-4 to span subjects is often an exercise in deep frustration.

I found GPT-5’s ability to understand and explain a complex interconnected system like my security product, all in one pass, to be a substantial improvement over the GPT-4 generation.

Is GPT-5 Pro worth $200/mo?

Maybe. If you’re in a real rush to get to know a project and want as much of a data dump as possible as quickly as possible, yes. If you’re operating on a big programming budget and $200/mo doesn’t matter to you, yes.

But I find that cost hard to bear, especially when I have to subscribe to a wide range of AI services to evaluate them. So, now that I’m nearing the end of my one-month test of Pro-level activities, I’m planning on downgrading back to the $20/mo Plus plan.

Also: How to use GPT-5 in VS Code with GitHub Copilot

Pro’s edge over GPT-5 wasn’t about knowing more facts; it was about delivering those facts in a form you can act on immediately. The Pro report didn’t just explain that security looked good; it cited the exact guards and checks in the code. It didn’t just say licensing was centralized; it mapped the exact functions and database options involved.

Again, if you’re on a time crunch, you might consider Pro. But I also think you can modify the base GPT-5’s responses, with detail like the Pro report produced, simply by using better prompting.

That’s next…

How to get Pro-level results from base GPT-5

I fed both the GPT-5 and GPT-5 Pro reports into GPT-5 and asked it for a prompt that would push the base-level GPT-5 to give GPT-5 Pro comprehensiveness as a result. This is that prompt, which you should add to any query where you want more complete coding information:

*High-Specificity Technical Mode: ***In your answer, combine complete high-level coverage with exhaustive implementation-level detail.

Always name exact constants, functions, classes, hooks, option names, database tables, file paths, and build tools where possible, quoting them exactly from the code or material provided.
For every claim, explain why it’s true and how you can tell (include reasoning tied to the evidence).
For each improvement you suggest, make it actionable and reference where in the codebase it applies.
Do not generalize when specifics are available.
Structure the output so a developer could use it directly to verify findings or implement recommendations.

This worked fantastically well. It took ChatGPT GPT-5 12 minutes to produce a 15,477-word document, complete with analysis and coding blocks. For example, it describes how value initialization is done, and then shows the code that accomplishes it.

–>

Screenshot by David Gewirtz/ZDNET

I think you could fine-tune this prompt and get Pro-level results without having to pay the $200/mo fee. I’m certainly going to tinker with this idea, possibly using GPT-5 to refine the specifications in the prompt for different areas I want to delve deeply into. I’ll let you know how it goes.

See for yourself

I had some difficulty setting up sharing for each of these long reports, so I just copied the results into Google Docs and shared them. Here are the links if you want to look at any of these reports.

You are welcome to dig into these documents and learn how my project is structured. While you may or may not care about my project, it’s instructive to see how the various models perform. While you can read the reports, my actual repo is restricted since it’s my private development repository.

What about you? Have you tried using GPT-5 or GPT-5 Pro to analyze your own code? How did its insights compare to earlier models like GPT-4 or o3? Do you think the $200/month Pro tier is worth it for the extra precision, or could you get by with better prompts in the base version? Have you found AI code analysis useful for onboarding, refactoring, or improving security? Let us know in the comments below.

You can follow my day-to-day project updates on social media. Be sure to subscribe to my weekly update newsletter, and follow me on Twitter/X at @DavidGewirtz, on Facebook at Facebook.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, on Bluesky at @DavidGewirtz.com, and on YouTube at YouTube.com/DavidGewirtzTV.

Source: Robotics - zdnet.com