Jobs search

Senior Site Reliability Engineer

Recrew AI • Full-time • Mumbai, IN • 13h ago

Role: Lead Site Reliability Engineer

Function: Site Reliability Engineering / Infrastructure

Location: Bangalore or Mumbai, India

Type: Full-time

Industry: Artificial Intelligence, Speech & Language Technology, Cloud Infrastructure

About Company

The company is the dedicated research and AI innovation arm of a flagship national AI program. It builds foundational technologies in Speech-to-Text, Text-to-Speech, Real-Time Conversational AI, and Multimodal Intelligence.

The mission: make human-machine communication as natural as speaking to another person, in any Indian language. The company operates next-generation data centers purpose-built for training and serving large AI models.

It partners with global AI leaders including OpenAI, Anthropic, Google, and Meta. Engineers work on problems of genuine national scale, with the resources and ambition to match.

Position Overview

As the founding Lead SRE, you will build the reliability engineering function from zero — establishing practices, tooling, and culture before the first hire joins. You will own production reliability of AI inference and training infrastructure serving millions of users across India, and directly shape how the organization thinks about availability, latency, and operational maturity for speech, language, and multimodal AI systems.

Role & Responsibilities

Establish the SRE function from scratch: define SLOs, SLIs, error budgets, on-call rotations, runbooks, and incident management processes across all production AI workloads
Architect the observability stack (metrics, logs, distributed tracing) for GPU-backed inference clusters and high-throughput STT/TTS and conversational AI services
Design and implement CI/CD pipelines and deployment automation for ML model rollouts, canary releases, and rollback mechanisms
Own capacity planning and auto-scaling strategies for model serving infrastructure handling latency-sensitive, variable AI workloads
Lead incident response and blameless post-mortems; drive systemic reliability improvements with measurable SLO impact
Partner with AI research and platform engineering teams to define production readiness criteria and harden new model deployments before launch
Hire and mentor the initial SRE team; set engineering standards, toolchain decisions, and reliability culture for the organization

Must Have Criteria

10+ years in SRE or production infrastructure engineering, with at least 2 years in a lead or staff-level role
Demonstrated experience building an SRE or platform engineering function from the ground up — tooling selection, process definition, and first hires
Hands-on experience managing Kubernetes clusters at scale (500+ nodes) in production, including GPU node pools
Proficiency in Python or Go for automation, tooling, and infrastructure-as-code (Terraform or Pulumi)
Proven track record defining and operating SLO/SLI frameworks and error budget policies in high-traffic services (10K+ RPS)
Deep expertise with observability tooling — Prometheus, Grafana, OpenTelemetry, and distributed tracing (Jaeger or similar)
Experience with cloud infrastructure on AWS or GCP, including compute, networking, storage, and managed Kubernetes services

Nice to Have

Experience running ML inference infrastructure — NVIDIA Triton, TorchServe, or similar model serving frameworks
Familiarity with low-latency streaming systems (Kafka, gRPC streaming) relevant to real-time voice AI pipelines
Prior work at an AI-first company, research lab, or large-scale consumer platform (100M+ users)
CKA/CKS certification or equivalent Kubernetes operational depth
Exposure to multi-region, multi-datacenter deployments with data residency constraints

What We Offer

Full ownership to define practices, select tooling, and build the team from scratch
Collaboration with frontier AI researchers and global partners including OpenAI, Anthropic, Google, and Meta
Competitive compensation with performance-linked incentives, benchmarked to top-tier AI research organizations
Direct impact on AI systems that will serve hundreds of millions of users across India's linguistic diversity