Devops Engineer- K8

Yotta Data Services Private Limited • Full-time • Mumbai, IN • 2d ago

Job Scope:

Build and evolve Kubernetes as a core AI infrastructure platform.

Extending Kubernetes, not just operating it

Designing GPU-aware scheduling, isolation, and lifecycle management

Building reliable, multi-tenant AI clusters that do not break under extreme load

Total /Relevant Experience:

6 Plus years of experience

Key Responsibilities:

1. Kubernetes Platform Architecture

Design and evolve Kubernetes clusters optimized for:
GPU-heavy workloads
multi-node, gang-scheduled training jobs
long-running and high-throughput inference
Own control-plane architecture:
etcd sizing and tuning
API server scalability
scheduler performance under high churn
Define reference cluster architectures for:
dedicated training clusters
shared multi-tenant clusters

2. GPU-Aware Scheduling & Workload Semantics

Build or extend scheduling mechanisms for:
GPU topology awareness
NUMA and locality sensitivity
anti-affinity for noisy neighbors
Integrate and deeply understand:
NVIDIA GPU Operator
device plugins
MIG / vGPU strategies (where applicable)
Ensure Kubernetes scheduling decisions align with real ML workload behavior, not just resource requests.

3. Platform Extensions & Controllers

Develop custom controllers/operators to:
manage cluster lifecycle
enforce policy and quotas
automate remediation (node drain, GPU quarantine, rescheduling)
Design internal APIs that abstract:
complex GPU and networking configurations
cluster upgrades and maintenance workflows

4. Multi-Tenancy, Isolation & Security

Design strong tnant isolation using:
namespaces, RBAC, admission controllers
network policies (CNI-level enforcement)
GPU and node-level isolation strategies
Work with security engineers to:
enforce least privilege
support enterprise compliance requirements
ensure auditability of platform actions

5. Observability, Reliability & Debuggability

Define observability standards for:
control-plane health
scheduling latency
GPU and noe lifecycle events
Expose clear signals to SRE and operations teams.
Ensure every platform action is traceable, debuggable , auditable.

Must-have skill:

Deep Kubernetes internals (scheduler, etcd, control plane)
Go-based controller development
GPU operators and device plugins
Distributed systems fundamentals

Good-to-Have Skills:

Experience with multi-node GPU environments
Hands-on experience with distributed training frameworks
Working knowledge of the NVIDIA ecosystem (TensorRT, Triton, NeMo)
Experience deploying and operating AI models at scale on Kubernetes clusters
Familiarity with Slurm or other workload schedulers

Qualifications Criteria:

B.E/B.Tech or any relevant degree

Related Jobs

DevOPs ML Ops

Godrej Industries Group • Full-time • Mumbai, IN • 1d ago

Senior DevOps Engineer

1d ago

Azure DevOPs with Terraform code

Capgemini • Full-time • Navi Mumbai, Maharashtra, India • 10h ago

Other

10h ago

Devops Performance Lead

BNP Paribas • Full-time • Mumbai, IN • 21h ago

Devops Lead

21h ago

Tech Solution Architect

Godrej Industries Group • Full-time • Mumbai, IN • 21h ago

Devops Lead

21h ago

Cloud Leader - Azure/AWS Platform

ElementSkill • Full-time • Navi Mumbai, IN • 21h ago

Devops Lead

21h ago

AWS Cloud Native Consultant

Godrej Infotech Ltd • Full-time • Mumbai, IN • 1d ago

Devops Lead

1d ago

DevOps Engineer

Brainfog Technologies • Full-time • Mumbai, IN • 2d ago

DevOps Engineer (Entry Level)

2d ago

DevOps Engineer

Pert Telecom Solutions • Full-time • Navi Mumbai, IN • 2d ago

DevOps Engineer (Entry Level)

2d ago

Immediate Joiner: Azure Cloud Engineer/ DevOps

Dogma Group • Contract • Mumbai, IN • 5d ago

DevOps Engineer (Entry Level)

5d ago