Responsibilities & Accountabilities:
• Design, build, and operate CI/CD pipelines for AI, data, and platform services.
• Build and manage Kubernetes-based platforms for scalable agent and model workloads.
• Automate infrastructure provisioning using IaC (Terraform, Helm, etc.).
• Implement observability (logging, metrics, tracing) for AI agents, data pipelines, and platform services.
• Ensure high availability, resilience, and performance of production systems.
• Drive security, isolation, and governance for multi-tenant AI workloads.
• Work closely with data, AI, and platform engineering teams to productionize systems.
• Support release management, incident response, and root cause analysis.
Required Skills & Experience:
5–8 years of experience in DevOps, SRE, or Platform Engineering roles.
• Strong hands-on experience with Kubernetes and container orchestration. • Proven experience building and operating CI/CD systems at scale. • Experience with Infrastructure as Code (Terraform, CloudFormation, Pulumi). • Solid understanding of: o Linux systems and networking fundamentals o Distributed systems and cloud-native architectures • Experience supporting high-scale, production-grade platforms. • Exposure to end-to-end SDLC and production operations.
Good to Have: Experience operating AI/ML or data platforms in production. • Exposure to LLM serving, GPU workloads, or AI runtimes. • Experience with public cloud platforms (AWS, GCP, Azure). • Knowledge of service meshes, ingress, and networking at scale. • Familiarity with security, secrets management, and compliance. • Open-source contributions or experience running open-source platforms.
What We Offer:
Ownership of core platform reliability and automation for Agentic AI. • Opportunity to operate a sovereign, hyperscale AI and data platform. • Strong focus on automation, reliability, and engineering excellence. • Work alongside deeply technical data, AI, and platform teams. • Clear growth path into Staff / Principal Platform or SRE roles. . If you enjoy building platforms that power AI systems at scale and care deeply about reliability and automation, this role is for you.