About the Role
Step into a pivotal leadership role where you will architect, scale, and own the cloud-native infrastructure that powers next-generation AI and machine learning platforms.
This role sits at the critical intersection of Kubernetes platform engineering, MLOps, and enterprise AI workloads. You will enable data science, AI engineering, and application teams to reliably deploy, scale, and manage distributed services. The ideal candidate is a hands-on technical authority who has owned critical production environments, driven enterprise architectural decisions, and excels as a Single Point of Contact (SPOC) bridging cross-functional and global delivery teams.
Core Responsibilities
Kubernetes Platform Engineering
- Architecture & Operations: Architect, deploy, and manage production-grade Kubernetes clusters (AKS/EKS/GKE) managing the full lifecycle including upgrades, node pools, autoscaling, PDBs, quotas, network policies, ingress controllers, and service mesh.
- Advanced Troubleshooting: Lead workload and network debugging, resolving complex issues related to DNS/CNI, ingress traffic, service routing, pod health, resource bottlenecks, and rollout failures.
- Resilience: Define and execute comprehensive Disaster Recovery (DR) plans, operational runbooks, and mock failover scenarios.
ML Platform & MLOps
- Platform Building: Build and manage scalable ML training and serving platforms (Kubeflow preferred; MLflow, Vertex AI, or SageMaker).
- Model Lifecycle: Enable reproducible model development, lineage tracking, artifact management, and centralized model registries.
- AI Integration: Implement model-serving at scale (KServe, KFServing, Seldon) and seamlessly integrate AI services (OpenAI, Azure OpenAI, Vertex AI) into cloud-native platforms and CI/CD workflows. Support data science teams with orchestration and GPU-based workloads.
Platform Automation & Infrastructure as Code
- GitOps & CI/CD: Implement GitOps-based workflows using Argo CD or Flux, alongside Helm templates and policy-as-code guardrails. Build automated CI/CD pipelines across all environments with integrated testing, security scans, and artifact governance.
- Provisioning: Provision and manage cloud infrastructure using Terraform (or ARM/Bicep/CloudFormation).
Observability, Reliability & Security
- Operational Standards: Establish SLIs/SLOs, reliability objectives, and platform hardening strategies. Lead incident response and root-cause analysis.
- Deep Observability: Implement Prometheus/Grafana, ELK/EFK, OpenTelemetry, and distributed tracing.
- Governance: Implement RBAC, IAM, secrets management (KMS, Key Vault), container image hardening, and network isolation. Support cost optimization and standards enforcement.
Leadership & Cross-Functional Collaboration
- Technical SPOC: Act as the primary technical point of contact for platform engineering, guiding deliverables and coordinating seamlessly across teams.
- Mentorship: Coach and mentor engineers while partnering with AI/ML, data engineering, networking, and security teams to align platform capabilities with business objectives.
Required Qualifications
- Experience: 8+ years in cloud, DevOps, platform engineering, or SRE roles at a senior, lead, or principal level.
- Kubernetes Mastery: Deep, hands-on production experience with AKS, EKS, or GKE, including fundamentals, lifecycle management, and workload debugging.
- MLOps Proficiency: Hands-on experience building and supporting ML platforms (Kubeflow preferred) and integrating AI services.
- Automation: Strong proficiency with Terraform, GitOps, Helm, and CI/CD automation.
- Track Record: Proven ability to own, operate, and scale critical production platforms.
Preferred Qualifications
- Experience managing GPU-based ML workloads.
- Expertise in model monitoring, drift detection, and AI model governance.
- Deep cloud expertise in Azure (preferred), GCP, or AWS.
- Experience defining platform KPIs and enabling self-service infrastructure capabilities.
Candidate Attributes
- Hands-on, ownership-driven mindset with strong operational discipline.
- Exceptional debugging instincts for complex, open-source ecosystems.
- Clear, concise communicator capable of simplifying complex architectural topics for diverse stakeholders.