Role: Lead Site Reliability Engineer
Function: Site Reliability Engineering / Infrastructure
Location: Bangalore or Mumbai, India
Type: Full-time
Industry: Artificial Intelligence, Speech & Language Technology, Cloud Infrastructure
About Company
The company is the dedicated research and AI innovation arm of a flagship national AI program. It builds foundational technologies in Speech-to-Text, Text-to-Speech, Real-Time Conversational AI, and Multimodal Intelligence.
The mission: make human-machine communication as natural as speaking to another person, in any Indian language. The company operates next-generation data centers purpose-built for training and serving large AI models.
It partners with global AI leaders including OpenAI, Anthropic, Google, and Meta. Engineers work on problems of genuine national scale, with the resources and ambition to match.
Position Overview
As the founding Lead SRE, you will build the reliability engineering function from zero — establishing practices, tooling, and culture before the first hire joins. You will own production reliability of AI inference and training infrastructure serving millions of users across India, and directly shape how the organization thinks about availability, latency, and operational maturity for speech, language, and multimodal AI systems.
Role & Responsibilities
- Establish the SRE function from scratch: define SLOs, SLIs, error budgets, on-call rotations, runbooks, and incident management processes across all production AI workloads
- Architect the observability stack (metrics, logs, distributed tracing) for GPU-backed inference clusters and high-throughput STT/TTS and conversational AI services
- Design and implement CI/CD pipelines and deployment automation for ML model rollouts, canary releases, and rollback mechanisms
- Own capacity planning and auto-scaling strategies for model serving infrastructure handling latency-sensitive, variable AI workloads
- Lead incident response and blameless post-mortems; drive systemic reliability improvements with measurable SLO impact
- Partner with AI research and platform engineering teams to define production readiness criteria and harden new model deployments before launch
- Hire and mentor the initial SRE team; set engineering standards, toolchain decisions, and reliability culture for the organization
Must Have Criteria
- 10+ years in SRE or production infrastructure engineering, with at least 2 years in a lead or staff-level role
- Demonstrated experience building an SRE or platform engineering function from the ground up — tooling selection, process definition, and first hires
- Hands-on experience managing Kubernetes clusters at scale (500+ nodes) in production, including GPU node pools
- Proficiency in Python or Go for automation, tooling, and infrastructure-as-code (Terraform or Pulumi)
- Proven track record defining and operating SLO/SLI frameworks and error budget policies in high-traffic services (10K+ RPS)
- Deep expertise with observability tooling — Prometheus, Grafana, OpenTelemetry, and distributed tracing (Jaeger or similar)
- Experience with cloud infrastructure on AWS or GCP, including compute, networking, storage, and managed Kubernetes services
Nice to Have
- Experience running ML inference infrastructure — NVIDIA Triton, TorchServe, or similar model serving frameworks
- Familiarity with low-latency streaming systems (Kafka, gRPC streaming) relevant to real-time voice AI pipelines
- Prior work at an AI-first company, research lab, or large-scale consumer platform (100M+ users)
- CKA/CKS certification or equivalent Kubernetes operational depth
- Exposure to multi-region, multi-datacenter deployments with data residency constraints
What We Offer
- Full ownership to define practices, select tooling, and build the team from scratch
- Collaboration with frontier AI researchers and global partners including OpenAI, Anthropic, Google, and Meta
- Competitive compensation with performance-linked incentives, benchmarked to top-tier AI research organizations
- Direct impact on AI systems that will serve hundreds of millions of users across India's linguistic diversity