Kubernetes v1.34: Pods Report DRA Resource Health (kubernetes.io)

0 points 1 day ago ago | visit original

🤖 AI Summary

Kubernetes v1.34 introduces an alpha feature that surfaces the health of specialized devices (GPUs, TPUs, FPGAs) directly in a Pod’s .status, aimed at making hardware-related failures far easier to diagnose for AI/ML and other high-performance workloads. Controlled by the ResourceHealthStatus feature gate, the change extends KEP-4680 to Dynamic Resource Allocation (DRA) drivers so operators and automation can immediately tell whether a misbehaving device—not application code—is the root cause of Pod failures (for example, via a new allocatedResourcesStatus field in v1.ContainerStatus). Technically, DRA drivers can implement a new gRPC service (DRAResourceHealth in dra-health/v1alpha1) that streams device health updates to the Kubelet using the NodeWatchResources server-streaming RPC; statuses are Healthy, Unhealthy, or Unknown. The Kubelet’s DRAPluginManager opens long-lived streams, the DRA manager persists updates in a healthInfoCache (survives restarts), and updates propagate to relevant Pods’ status. To try it you must enable the ResourceHealthStatus feature gate and use a DRA driver that implements v1alpha1. Planned enhancements before Beta include human-readable health messages, configurable timeouts, and better post-mortem capture—changes that will further reduce downtime and enable automated reactions (e.g., de-scheduling) to unhealthy devices.

Loading comments...

loading comments...