Senior Site Reliability Engineer (SRE) / DevOps Engineer

Umanist NA • Full-time • Pune City, IN • ₹ 2,000,000 - ₹ 2,500,000 / year • 2d ago

Additional Important Note For Applicants

Currently, only immediate joiners (who have already completed their notice period) or candidates serving a notice period of up to 30 days will be considered for this opportunity.
Candidates with longer notice periods may not be considered at this stage due to urgent project requirements.

Important Note for Applicants

Kindly read the job description carefully before applying. Please apply only if your experience, technical skills, and notice period align with the mandatory requirements mentioned above. Profiles that do not meet the core criteria may face rejection during the screening process, which can lead to unnecessary time and effort from both sides. We appreciate your understanding and cooperation.

Senior Site Reliability Engineer (SRE) / DevOps Engineer

Location: Pune (Work From Office)

Experience: 10+ Years

Shift Timing: 3:00 PM – 12:00 AM (Monday–Friday)

On-Call Requirement: 24/7 Production Support Rotation

Role Overview

We are looking for a highly experienced Senior Site Reliability Engineer (SRE) / DevOps Engineer to ensure the reliability, scalability, security, observability, and performance of mission-critical production systems. This role requires strong expertise in cloud infrastructure, Kubernetes, observability, incident management, and modern SRE practices.

The ideal candidate will balance operational excellence with engineering-driven improvements, focusing on automation, reliability, performance optimization, and reducing operational toil.

Key ResponsibilitiesIncident Management & Reliability

Participate in 24/7 on-call rotation and production support.
Diagnose, troubleshoot, and resolve critical production incidents.
Lead Root Cause Analysis (RCA) and post-incident reviews.
Improve MTTR and overall operational efficiency.
Define and manage SLIs, SLOs, SLAs, and Error Budgets.
Drive reliability improvements, capacity planning, and disaster recovery readiness.
Reduce operational toil through automation and engineering solutions.

Cloud & Infrastructure

Design, implement, and manage cloud infrastructure on Microsoft Azure.
Manage Kubernetes clusters and containerized applications.
Implement Infrastructure as Code using Terraform.
Manage Helm deployments and Git-based CI/CD workflows.
Support highly available, scalable, and secure production environments.

Observability & Monitoring

Build and maintain monitoring and observability platforms.
Implement distributed tracing using OpenTelemetry.
Establish monitoring based on Golden Signals:
- Latency
- Traffic
- Errors
- Saturation
Design symptom-based alerting and proactive monitoring strategies.
Improve logging, tracing, metrics collection, and performance visibility.

Security & Compliance

Implement cloud security best practices.
Manage IAM, secrets management, and network security.
Support vulnerability remediation and compliance initiatives.

Must-Have SkillsExperience

10+ years in DevOps, Infrastructure Engineering, Cloud Operations, or Site Reliability Engineering.
6–7+ years of hands-on experience with DevOps tools and cloud-native infrastructure.
Experience supporting highly available production environments.

Cloud & Infrastructure

Microsoft Azure (Mandatory)
Kubernetes
Terraform
Helm
GitHub / GitLab / Azure Repos

Monitoring & Observability

OpenTelemetry
Prometheus
Grafana
Datadog
Azure Monitor
Distributed Tracing
Metrics, Logs, and Observability Best Practices
Golden Signals Monitoring

SRE Practices

Incident Response & Production Support
On-Call Operations
Root Cause Analysis (RCA)
SLI / SLO / SLA Management
Error Budgets
Capacity Planning
Reliability Engineering
Toil Reduction

System & Programming Skills

Python
Bash Scripting
Linux Administration
Networking Fundamentals (DNS, TCP/IP, Load Balancing, SSL/TLS)

Good-to-Have SkillsCloud Platforms

AWS (EC2, S3, RDS, IAM, VPC, CloudWatch)
Google Cloud Platform (GCP)

Programming

Go (Golang)

AI & Cloud-Native Workloads

Azure AI Services
AI Foundry
RAG (Retrieval-Augmented Generation) Infrastructure
AI/ML Production Workloads

Additional Technologies

OpenSearch
ELK Stack
Distributed Systems Architecture

Advanced Observability

Building observability frameworks using OpenTelemetry
Performance Engineering and System Optimization

Preferred Candidate Profile

Strong ownership mindset and accountability.
Excellent troubleshooting and debugging skills.
Experience handling critical production incidents calmly and effectively.
Deep understanding of SRE principles and operational excellence.
Strong collaboration and communication skills.
Passion for automation, scalability, and continuous improvement.

Skills: opentelemetry,azure monitor,gitlab,datadog,helm,grafana,networking fundamentals,devops,devops tools,linux administration,kubernetes,site reliability engineering,cloud-native infrastructure,infrastructure engineering,cloud operations,github,production environments,microsoft azure,reliability engineering,distributed tracing,sre practices,cloud & infrastructure,on-call operations,prometheus,metrics,terraform,python