Industry & Sector: Operating in the Cloud Infrastructure and Enterprise SaaS sector, this high-availability engineering team builds and runs resilient, containerised production platforms that support mission-critical customer applications. We deliver scalable, observable, and secure cloud-native services for global users.
Role: Site Reliability Engineer (SRE) — On-site (India)
Role & Responsibilities
- Design, deploy, and maintain production-grade Kubernetes-based platforms to ensure high availability, scalability, and security.
- Author and maintain Infrastructure-as-Code to provision and manage cloud resources, enabling repeatable, auditable deployments.
- Build and operate CI/CD pipelines and automated release processes to accelerate safe delivery of features and fixes.
- Implement observability: metrics, logging, tracing, and alerting; define SLOs/SLIs and automate incident detection and response.
- Lead incident management and post-incident reviews to drive reliability improvements and reduce MTTR.
- Collaborate with development and product teams to optimize performance, reduce costs, and harden the platform for production traffic.
Skills & Qualifications
Must-Have
- Kubernetes
- Docker
- Terraform
- AWS
- Linux
- Prometheus
- Grafana
- Jenkins
Preferred
Qualifications
- Proven experience operating production cloud infrastructure and container platforms (demonstrable projects or on-call history preferred).
- Strong troubleshooting skills across distributed systems, networking, and storage.
- Willingness to work on-site in India and participate in on-call rotation.
Benefits & Culture Highlights
- Hands-on exposure to large-scale cloud-native systems and opportunity to drive reliability best practices.
- Collaborative engineering culture with focus on learning, ownership, and measurable impact.
- Competitive compensation and benefits aligned to on-site roles in India.
We are looking for proactive SREs who enjoy end-to-end ownership of platform reliability, automation-first engineering, and close collaboration with developers to deliver reliable services at scale. Apply if you thrive on solving complex operational challenges and driving continuous improvement.
Skills: aws,prometheus,kubernetes,sre,jenkins,grafana,terraform,linux,docker