Role: DevOps/Infrastructure Technical Lead
Function: DevOps/Platform Engineering
Location: Mumbai/Bangalore
Type: Full-time
Industry: AI/ML Infrastructure, Technology
About Company
The company is building the AI layer for Bharat at India-scale. Backed by partnerships with global tech leaders like Meta and Google, the team is creating AI that serves the entire Indian user base—across languages, contexts, and daily needs. This is AI designed for real adoption, not experiments.
They bring a rare combination of deep India-first AI capability and unmatched India-scale distribution. The focus is a platform-and-product stack that makes AI useful, reliable, and safe for everyday consumers. It’s engineered from day one for massive scale—100M+ users early and 1B-ready constraints on latency, cost, reliability, and safety.
If you want to be part of a fast-moving, high-ambition team building technology with real-world reach, this is that opportunity. The culture emphasises engineering excellence, strong collaboration, and tangible impact across sectors that matter to India—while building toward a category-defining consumer AI experience.
Position Overview
Lead the development of platform infrastructure for Reliance Intelligence's AI services, including GPU clusters, CI/CD pipelines, observability systems, and model deployment frameworks. This hands-on technical leadership role focuses on building internal automation tools and frameworks that accelerate engineering velocity across our AI platform.
Responsibilities
- Design and implement scalable GPU cluster infrastructure using GKE for large-scale AI model training and inference
- Build and maintain CI/CD pipelines using GCP Cloud Build, Cloud Run, and Kubernetes for ML model deployment automation
- Develop comprehensive observability solutions using Prometheus, Grafana, and ELK stack for distributed AI systems monitoring
- Create internal frameworks and scaffolds leveraging Vertex AI and BigQuery ML that improve developer productivity
- Architect VectorDB infrastructure using Cloud SQL and Firestore to support high-performance similarity search systems
- Implement LLMOps workflows in Azure for model lifecycle management and deployment orchestration
- Lead troubleshooting efforts for complex distributed systems and performance optimization initiatives
Must-Have Requirements
- 10-20 years of experience in DevOps, SRE, or platform engineering roles
- Expert-level proficiency in Kubernetes and Google Kubernetes Engine (GKE) managing production clusters with 1000+ nodes
- Deep hands-on experience with Docker containerization and Helm charts for application deployment
- Advanced experience with service mesh technologies like Istio for microservices communication
- Deep hands-on experience with core GCP services including Cloud Run, Vertex AI, and BigQuery ML
- Advanced experience with Terraform for infrastructure as code on GCP
- Experience with observability tools including Prometheus, Grafana, ELK stack, and Datadog for distributed systems monitoring
Nice to Have
- Experience with Ray framework for distributed ML workloads and parallel processing
- Background in AI/ML infrastructure at hyperscale with GPU cluster management
- Experience with vector databases and similarity search systems at enterprise scale
- Previous technical leadership experience mentoring DevOps/platform engineering teams
What We Offer
- Opportunity to build AI infrastructure at unprecedented scale with cutting-edge technology
- Work with global tech leaders and contribute to India's AI ecosystem development
- Competitive compensation package with equity participation in Reliance's AI venture
- Access to world-class resources and partnerships with Meta, Google, and other tech giants
- Impact millions of users through AI solutions across education, healthcare, and agriculture