Role Overview
We are seeking a DevOps Engineering Manager to lead Cloud and Platform engineering
for AI-first teams, operating at the intersection of large-scale containerized production
systems and next-generation Agentic AI and LLM deployments.
This role is responsible for building and operating highly Reliable, Secure, and Scalable
platforms that support mature microservices-based workloads while enabling rapid
experimentation and production rollout of Agentic AI systems. You will work closely with
AI/ML, platform, and product teams across India and Europe to operationalize AI
solutions at scale.
Key Responsibilities
• Define and own the cloud and platform architecture for large-scale containerized
microservices and Agentic AI / LLM workloads, ensuring scalability, reliability, and cost
efficiency.
• Lead CI/CD platform engineering, enabling automated build, test, security scanning, and
deployment for backend services, React-based web applications, and mobile app backends
• Enable production-grade AI platforms, supporting agent frameworks, vector databases,
prompt pipelines, and inference
• Define Infrastructure as code standards, cloud account structures, networking, and
environment provisioning across AWS and secondary clouds.
• Implement and enforce SRE practices: define SLIs/SLOs, error budgets, capacityand reliability
targets, and lead incident response and post-incident reviews.
• Ensure end-to-end observability across services and AI workloads, including logs, metrics,
traces, model performance, and cost visibility
• Embed security, compliance, and governance by design, including IAM, secrets management,
network security, vulnerability management, and AI-specific controls.
• Make informed build vs. buy decisions, evaluate emerging cloud and AI infrastructure
technologies, and drive continuous platform modernization.
Must Have
• 10+ years of experience in DevOps / Cloud / Platform Engineering, including
people management and technical leadership
• Deep hands-on expertise with AWS, with working exposure to GCP and Azure in
multi-cloud or hybrid environments
• Proven experience operating large-scale, production-grade containerized
workloads, with strong understanding of high availability, fault tolerance, and capacity planning
in global teams
• Practical experience supporting AI/ML or LLM workloads in production environments
• Strong expertise in Kubernetes and Docker, including cluster operations, workload isolation,
ingress, service meshes, and deployment strategies
• Advanced experience with ‘Infrastructure as Code’ for cloud provisioning, networking, security
controls, and environment standardization across multiple stages
• Solid understanding of observability and reliability engineering, including metrics, logging,
tracing, alerting, and defining SLIs/SLOs for distributed systems and AI services
• Hands-on exposure with cloud security and compliance practices, including IAM design,
secrets management, vulnerability scanning, and secure deployment patterns—especially for
AI platforms
• Knowledge of cloud cost optimization (FinOps), especially for AI workloads
• Background in strong product-based organizations solving real customer-facing problems
Leadership and Mindset
• Strong AI-first mindset with curiosity and adaptability to turn rapid AI innovation in to stable
production systems.
• Strategic thinker with hands-on technical depth
• Excellent communication and collaboration skills in global, distributed teams
• Ownership-driven leader who builds accountable teams and fosters a culture of reliability,
automation, and continuous improvement