DevOps / SRE Engineer

Proximity Works • Full-time • Navi Mumbai, IN • 1d ago

We are looking for a DevOps / Site Reliability Engineer (L5) to own and scale the production reliability of a large-scale, AI-first platform. You will be responsible for running mission-critical workloads on cloud infrastructure, hardening Kubernetes-based systems, and ensuring high availability, performance, and cost efficiency across platform and AI services.

This role is deeply hands-on and ownership-driven. You will be trusted to run day-2 production systems end-to-end, lead incident response, and continuously raise the reliability bar for AI and data-intensive workloads.

At Proximity, you won't just keep systems running — you'll shape how reliability, observability, and operational excellence are built into the platform from the ground up.

Responsibilities

Own day-2 production operations of a large-scale, AI-first platform running on cloud infrastructure
Run, scale, and harden Kubernetes-based workloads integrated with a broad set of managed cloud services across data, messaging, AI, networking, and security
Define, implement, and operate SLIs, SLOs, and error budgets across core platform and AI services
Build and own observability end-to-end, including:
- APM
- Infrastructure monitoring
- Logs, alerts, and operational dashboards
Improve and maintain CI/CD pipelines and Terraform-driven infrastructure automation
Operate and integrate AI platform services for LLM deployments and model lifecycle management
Lead incident response, conduct blameless postmortems, and drive systemic reliability improvements
Optimize cost, performance, and autoscaling for AI, ML, and data-intensive workloads
Partner closely with backend, data, and ML engineers to ensure production readiness and operational best practices

What Matters (Non-Negotiable Alignment)

Infra owners, not operators.

This role is for engineers who design, build, and own infrastructure, not those limited to ticket-based operations.

Built and operated production-grade cloud infrastructure end-to-end
Strong Kubernetes experience in real, high-traffic production environments
AWS experience is mandatory, with GCP as a strong plus
Experience operating AI / ML workloads in production
- Including GPU-based systems
Strong ownership of CI/CD systems and Infrastructure as Code
End-to-end observability ownership
- Monitoring, logging, alerting, dashboards
Comfortable making infrastructure decisions under ambiguity
Proven ability to collaborate deeply with ML and backend teams to take systems from design → production → scale

Requirements

6+ years of hands-on experience in DevOps, SRE, or Platform Engineering roles
Strong, production-grade experience with cloud platforms
AWS required
GCP strongly preferred, especially Kubernetes and managed services
Proven expertise running Kubernetes at scale in live production environments
Deep hands-on experience with New Relic in complex, distributed systems
Experience operating AI/ML or LLM-driven platforms in production environments
Solid background in Terraform, CI/CD systems, cloud networking, and security fundamentals
Strong understanding of reliability engineering principles, including capacity planning, failure modes, and resilience patterns
Comfortable owning production systems end-to-end with minimal supervision
Strong communication skills and the ability to operate calmly and effectively during incidents
Experience building internal platform tooling for developer productivity

Desired Skills

Experience managing multi-cloud environments or cross-cloud integrations
Familiarity with cost optimization strategies for large-scale Kubernetes and AI workloads
Exposure to service meshes, advanced traffic management, or zero-trust security models

Benefits

Best in class compensation: We hire only the best, and we pay accordingly
Proximity Talks: Learn from senior engineers, platform leaders, and industry experts
Work on real-world AI systems: Operate and scale production AI platforms used at meaningful scale
Continuous learning: Grow alongside a high-caliber team that values operational excellence and engineering rigor

About Us

We are Proximity — a global team of coders, designers, product managers, geeks, and experts. We solve complex problems and build cutting-edge technology at scale.

Our team of Proxonauts is growing quickly, which means your impact on the company's success will be significant. You'll work with experienced leaders who have built and led high-performing tech and platform teams.

Here's a quick guide to getting to know us better: