in

Kubernetes, cloud-native computing’s engine, is getting turbocharged for AI

zf L/Moment/Getty Images

Follow ZDNET: Add us as a preferred source<!–> on Google.


ZDNET’s key takeaways

  • This program ensures users can migrate AI workloads between Kubernetes distributions.
  • Kubernetes will finally support rollbacks for returning to a working cluster if something goes wrong.
  • Several other improvements will make Kubernetes even friendlier for AI workloads.

Over a decade ago, there were many alternatives to Kubernetes for container orchestration. Today, unless you’ve been in cloud-native computing for a long, long time, you’d be hard-pressed to name any of them. That’s because Kubernetes was clearly the best choice. 

Back then, containers, thanks to Docker, were the hot new technology. Fast-forward a decade, and the technology that has everyone worked up is AI. To that end, the Cloud Native Computing Foundation (CNCF) launched the Certified Kubernetes AI Conformance Program (CKACP) at KubeCon North America 2025 in Atlanta as a standardized way of deploying AI workloads on Kubernetes clusters. 

A safe, universal platform for AI workloads

CKACP’s goal is to create community-defined, open standards for consistently and reliably running AI workloads across different Kubernetes environments. 

Also: Why even a US tech giant is launching ‘sovereign support’ for Europe now

CNCF CTO Chris Aniszczyk said, “This conformance program will create shared criteria to ensure AI workloads behave predictably across environments. It builds on the same successful community-driven process we’ve used with Kubernetes to help bring consistency across over 100-plus Kubernetes systems as AI adoption scales.” 

Specifically, the initiative is designed to:

  • Ensure portability and interoperability for AI and machine learning (ML) workloads across public clouds, private infrastructure, and hybrid environments, enabling organizations to avoid vendor lock-in when moving AI workloads wherever needed.
  • Reduce fragmentation by setting a shared baseline of capabilities and configurations that platforms must support, making it easier for enterprises to adopt and scale AI on Kubernetes with confidence.
  • Give vendors and open-source contributors a clear target for compliance to ensure their technologies work together and support production-ready AI deployments.
  • Enable end users to rapidly innovate, with the reassurance that certified platforms have implemented best practices for resource management, GPU integration, and key AI infrastructure needs, tested and validated by the CNCF.
  • Foster a trusted, open ecosystem for AI development, where standards make it possible to efficiently scale, optimize, and manage AI workloads as usage increases across industries.

In short, the initiative is focused on providing both enterprises and vendors with a common, tested framework to ensure AI runs reliably, securely, and efficiently on any certified Kubernetes platform.

If this approach sounds familiar, well, it should, because it’s based on the CNCF’s successful Certified Kubernetes Conformance Program. It’s due to that 2017 plan and agreement that, if you’re not happy with, say, Red Hat OpenShift, you can pick up your containerized workloads and cart them over to Mirantis Kubernetes Engine or Amazon Elastic Kubernetes Service–> without worrying about incompatibilities. This portability, in turn, is why Kubernetes is the foundation for many hybrid clouds.

Also: Coding with AI? My top 5 tips for vetting its output – and staying out of trouble

With 58% of organizations already running AI workloads on Kubernetes, CNCF’s new program is expected to significantly streamline how teams deploy, manage, and innovate in AI. By offering common test criteria, reference architectures, and validated integrations for GPU and accelerator support, the program aims to make AI infrastructure more robust and secure across multi-vendor, multi-cloud environments.

As Jago Macleod, Kubernetes & GKE engineering director at Google Cloud, said at Kubecon, “At Google Cloud, we’ve certified for Kubernetes AI Conformance because we believe consistency and portability are essential for scaling AI. By aligning with this standard early, we’re making it easier for developers and enterprises to build AI applications that are production-ready, portable, and efficient, without reinventing infrastructure for every deployment.”

Understanding Kubernetes improvements

That was far from the only thing Macleod had to say about Kubernetes’s future. Google and the CNCF have other plans for the market-leading container orchestrator. Key improvements coming include rollback support, the ability to skip updates, and new low-level controls for GPUs and other AI-specific hardware.

In his keynote speech, MacLeod explained that, for the first time, Kubernetes users now have a reliable minor version rollback feature. This feature means clusters can be safely reverted to a known-good state after an upgrade. This capability ends the long-standing “one-way street” problem of Kubernetes control-plane upgrades. Rollbacks will sharply reduce the risk of adopting critical new features or urgent security patches. 

Alongside this improvement, Kubernetes users can now skip specific updates. This approach gives administrators more flexibility and control when planning version migrations or responding to production incidents.

Besides the CKACP, Kubernetes is being rearchitected to support AI workload demands natively. This support means Kubernetes will give users granular control over hardware like GPUs, TPUs, and custom accelerators. This capability also addresses the enormous diversity and scale requirements of modern AI hardware. 

Also: SUSE Enterprise Linux 16 is here, and its killer feature is digital sovereignty

Additionally, new APIs and open-source features, including Agent Sandbox and Multi-Tier Checkpointing, were announced at the event. These features will further accelerate inference, training, and agentic AI operations within clusters. Innovations like node-level resource allocation, dynamic GPU provisioning, and scheduler optimizations for AI hardware are becoming foundational for both researchers and enterprises running multi-tenant clusters.

Agent Sandbox is an open-source framework and controller that enables the management of isolated, secure environments, also known as sandboxes, designed for running stateful, singleton workloads, such as autonomous AI agents, code interpreters, and development tools. The main features of Agent Sandbox are:

Today, Multi-Tier Checkpointing in Kubernetes is primarily available on GKE. In the future, this mechanism will enable the reliable storage and management of checkpoints during the training of large-scale ML models.

Also: Enterprises are not prepared for a world of malicious AI agents

Here’s a quick sketch on how Multi-Tier Checkpointing works:

The benefit for AL and ML workloads is that Multi-Tier Checkpointing enables quick resumption of training from the last checkpoint without losing significant progress. The mechanism also provides fault tolerance by protecting training jobs from frequent interruptions by ensuring that checkpoints are safely stored and replicated.

On top of all that, Multi-Tier Checkpointing gives scalability by supporting large distributed training jobs running on thousands of nodes. Finally, the feature, of course, works with all major AI frameworks, such as JAX and PyTorch, and integrates with their checkpointing mechanisms.

With rollbacks, selective update skipping, and production-grade AI hardware management, Kubernetes is poised to power the world’s most demanding AI and enterprise platforms. The CNCF’s launch of the Kubernetes AI Conformance program is further cementing the ecosystem’s role in setting standards for interoperability, reliability, and performance for the near future of cloud-native AI.

Also: 6 essential rules for unleashing AI on your software development process – and the No. 1 risk

Kubernetes’s first decade was all about moving IT from bare metal and Virtual Machines (VMs) to containers. Its next decade will be defined by its ability to manage AI at a planetary scale by providing safety, speed, and flexibility for a new class of workloads.

–>


Source: Robotics - zdnet.com