Follow ZDNET: Add us as a preferred source<!–> on Google.
ZDNET’s key takeaways
- This program ensures users can migrate AI workloads between Kubernetes distributions.
- Kubernetes will finally support rollbacks for returning to a working cluster if something goes wrong.
- Several other improvements will make Kubernetes even friendlier for AI workloads.
Over a decade ago, there were many alternatives to Kubernetes for container orchestration. Today, unless you’ve been in cloud-native computing for a long, long time, you’d be hard-pressed to name any of them. That’s because Kubernetes was clearly the best choice.
Back then, containers, thanks to Docker, were the hot new technology. Fast-forward a decade, and the technology that has everyone worked up is AI. To that end, the Cloud Native Computing Foundation (CNCF) launched the Certified Kubernetes AI Conformance Program (CKACP) at KubeCon North America 2025 in Atlanta as a standardized way of deploying AI workloads on Kubernetes clusters.
A safe, universal platform for AI workloads
CKACP’s goal is to create community-defined, open standards for consistently and reliably running AI workloads across different Kubernetes environments.
Also: Why even a US tech giant is launching ‘sovereign support’ for Europe now
CNCF CTO Chris Aniszczyk said, “This conformance program will create shared criteria to ensure AI workloads behave predictably across environments. It builds on the same successful community-driven process we’ve used with Kubernetes to help bring consistency across over 100-plus Kubernetes systems as AI adoption scales.”
Specifically, the initiative is designed to:
- Ensure portability and interoperability for AI and machine learning (ML) workloads across public clouds, private infrastructure, and hybrid environments, enabling organizations to avoid vendor lock-in when moving AI workloads wherever needed.
- Reduce fragmentation by setting a shared baseline of capabilities and configurations that platforms must support, making it easier for enterprises to adopt and scale AI on Kubernetes with confidence.
- Give vendors and open-source contributors a clear target for compliance to ensure their technologies work together and support production-ready AI deployments.
- Enable end users to rapidly innovate, with the reassurance that certified platforms have implemented best practices for resource management, GPU integration, and key AI infrastructure needs, tested and validated by the CNCF.
- Foster a trusted, open ecosystem for AI development, where standards make it possible to efficiently scale, optimize, and manage AI workloads as usage increases across industries.
In short, the initiative is focused on providing both enterprises and vendors with a common, tested framework to ensure AI runs reliably, securely, and efficiently on any certified Kubernetes platform.
If this approach sounds familiar, well, it should, because it’s based on the CNCF’s successful Certified Kubernetes Conformance Program. It’s due to that 2017 plan and agreement that, if you’re not happy with, say, Red Hat OpenShift, you can pick up your containerized workloads and cart them over to Mirantis Kubernetes Engine or Amazon Elastic Kubernetes Service–> without worrying about incompatibilities. This portability, in turn, is why Kubernetes is the foundation for many hybrid clouds.
Also: Coding with AI? My top 5 tips for vetting its output – and staying out of trouble
With 58% of organizations already running AI workloads on Kubernetes, CNCF’s new program is expected to significantly streamline how teams deploy, manage, and innovate in AI. By offering common test criteria, reference architectures, and validated integrations for GPU and accelerator support, the program aims to make AI infrastructure more robust and secure across multi-vendor, multi-cloud environments.
As Jago Macleod, Kubernetes & GKE engineering director at Google Cloud, said at Kubecon, “At Google Cloud, we’ve certified for Kubernetes AI Conformance because we believe consistency and portability are essential for scaling AI. By aligning with this standard early, we’re making it easier for developers and enterprises to build AI applications that are production-ready, portable, and efficient, without reinventing infrastructure for every deployment.”
Understanding Kubernetes improvements
That was far from the only thing Macleod had to say about Kubernetes’s future. Google and the CNCF have other plans for the market-leading container orchestrator. Key improvements coming include rollback support, the ability to skip updates, and new low-level controls for GPUs and other AI-specific hardware.
In his keynote speech, MacLeod explained that, for the first time, Kubernetes users now have a reliable minor version rollback feature. This feature means clusters can be safely reverted to a known-good state after an upgrade. This capability ends the long-standing “one-way street” problem of Kubernetes control-plane upgrades. Rollbacks will sharply reduce the risk of adopting critical new features or urgent security patches.
Alongside this improvement, Kubernetes users can now skip specific updates. This approach gives administrators more flexibility and control when planning version migrations or responding to production incidents.
Besides the CKACP, Kubernetes is being rearchitected to support AI workload demands natively. This support means Kubernetes will give users granular control over hardware like GPUs, TPUs, and custom accelerators. This capability also addresses the enormous diversity and scale requirements of modern AI hardware.
Also: SUSE Enterprise Linux 16 is here, and its killer feature is digital sovereignty
Additionally, new APIs and open-source features, including Agent Sandbox and Multi-Tier Checkpointing, were announced at the event. These features will further accelerate inference, training, and agentic AI operations within clusters. Innovations like node-level resource allocation, dynamic GPU provisioning, and scheduler optimizations for AI hardware are becoming foundational for both researchers and enterprises running multi-tenant clusters.
Agent Sandbox is an open-source framework and controller that enables the management of isolated, secure environments, also known as sandboxes, designed for running stateful, singleton workloads, such as autonomous AI agents, code interpreters, and development tools. The main features of Agent Sandbox are:
- Isolation and security: Each sandbox is strongly isolated at both the kernel and network levels using technologies such as gVisor or Kata Containers, so it’s safe to run untrusted code (e.g., generated by large language models) without compromising the integrity of the host system or cluster.
- Declarative APIs: Users can declare sandbox environments and templates using Kubernetes-native resources (Sandbox, SandboxTemplate, SandboxClaim), enabling rapid, repeatable creation and management of isolated instances.
- Scale and performance: Agent Sandbox supports thousands of concurrent, stateful sandboxes with fast, on-demand provisioning. This capability will be great for AI agent workloads, code execution, or persistent developer environments.
- Snapshot and recovery: On Google Kubernetes Engine (GKE), the Agent Sandbox can utilize Pod Snapshots for rapid checkpointing, hibernation, and instant resumption, dramatically reducing startup latency and optimizing resource usage for AI workloads.
Today, Multi-Tier Checkpointing in Kubernetes is primarily available on GKE. In the future, this mechanism will enable the reliable storage and management of checkpoints during the training of large-scale ML models.
Also: Enterprises are not prepared for a world of malicious AI agents
Here’s a quick sketch on how Multi-Tier Checkpointing works:
- Multiple storage tiers: Checkpoints are first stored in fast, local storage (such as in-memory volumes or local disk on a node) for quick access and fast recovery.
- Replication across nodes: The checkpoint data is replicated to peer nodes in the cluster to protect against node failures.
- Persistent cloud storage backup: Periodically, checkpoints are backed up to durable cloud storage to provide a reliable fallback in case of cluster-wide failures or cases when local copies are unavailable.
- Orchestrated management: The system automates checkpoint saving, replication, backup, and restoration, minimizing manual intervention during training.
The benefit for AL and ML workloads is that Multi-Tier Checkpointing enables quick resumption of training from the last checkpoint without losing significant progress. The mechanism also provides fault tolerance by protecting training jobs from frequent interruptions by ensuring that checkpoints are safely stored and replicated.
On top of all that, Multi-Tier Checkpointing gives scalability by supporting large distributed training jobs running on thousands of nodes. Finally, the feature, of course, works with all major AI frameworks, such as JAX and PyTorch, and integrates with their checkpointing mechanisms.
With rollbacks, selective update skipping, and production-grade AI hardware management, Kubernetes is poised to power the world’s most demanding AI and enterprise platforms. The CNCF’s launch of the Kubernetes AI Conformance program is further cementing the ecosystem’s role in setting standards for interoperability, reliability, and performance for the near future of cloud-native AI.
Also: 6 essential rules for unleashing AI on your software development process – and the No. 1 risk
Kubernetes’s first decade was all about moving IT from bare metal and Virtual Machines (VMs) to containers. Its next decade will be defined by its ability to manage AI at a planetary scale by providing safety, speed, and flexibility for a new class of workloads.
Artificial Intelligence
<!–>
–>

