Job Scope:
Build and evolve Kubernetes as a core AI infrastructure platform.
Extending Kubernetes, not just operating it
Designing GPU-aware scheduling, isolation, and lifecycle management
Building reliable, multi-tenant AI clusters that do not break under extreme load
Total /Relevant Experience:
6 Plus years of experience
Key Responsibilities:
1. Kubernetes Platform Architecture
- Design and evolve Kubernetes clusters optimized for:
- GPU-heavy workloads
- multi-node, gang-scheduled training jobs
- long-running and high-throughput inference
- Own control-plane architecture:
- etcd sizing and tuning
- API server scalability
- scheduler performance under high churn
- Define reference cluster architectures for:
- dedicated training clusters
- shared multi-tenant clusters
2. GPU-Aware Scheduling & Workload Semantics
- Build or extend scheduling mechanisms for:
- GPU topology awareness
- NUMA and locality sensitivity
- anti-affinity for noisy neighbors
- Integrate and deeply understand:
- NVIDIA GPU Operator
- device plugins
- MIG / vGPU strategies (where applicable)
- Ensure Kubernetes scheduling decisions align with real ML workload behavior, not just resource requests.
3. Platform Extensions & Controllers
- Develop custom controllers/operators to:
- manage cluster lifecycle
- enforce policy and quotas
- automate remediation (node drain, GPU quarantine, rescheduling)
- Design internal APIs that abstract:
- complex GPU and networking configurations
- cluster upgrades and maintenance workflows
4. Multi-Tenancy, Isolation & Security
- Design strong tnant isolation using:
- namespaces, RBAC, admission controllers
- network policies (CNI-level enforcement)
- GPU and node-level isolation strategies
- Work with security engineers to:
- enforce least privilege
- support enterprise compliance requirements
- ensure auditability of platform actions
5. Observability, Reliability & Debuggability
- Define observability standards for:
- control-plane health
- scheduling latency
- GPU and noe lifecycle events
- Expose clear signals to SRE and operations teams.
- Ensure every platform action is traceable, debuggable , auditable.
Must-have skill:
- Deep Kubernetes internals (scheduler, etcd, control plane)
- Go-based controller development
- GPU operators and device plugins
- Distributed systems fundamentals
Good-to-Have Skills:
- Experience with multi-node GPU environments
- Hands-on experience with distributed training frameworks
- Working knowledge of the NVIDIA ecosystem (TensorRT, Triton, NeMo)
- Experience deploying and operating AI models at scale on Kubernetes clusters
- Familiarity with Slurm or other workload schedulers
Qualifications Criteria:
- B.E/B.Tech or any relevant degree