Additional Important Note For Applicants
- Currently, only immediate joiners (who have already completed their notice period) or candidates serving a notice period of up to 30 days will be considered for this opportunity.
- Candidates with longer notice periods may not be considered at this stage due to urgent project requirements.
Important Note for Applicants
Kindly read the job description carefully before applying. Please apply only if your experience, technical skills, and notice period align with the mandatory requirements mentioned above. Profiles that do not meet the core criteria may face rejection during the screening process, which can lead to unnecessary time and effort from both sides. We appreciate your understanding and cooperation.
Senior Site Reliability Engineer (SRE) / DevOps Engineer
Location: Pune (Work From Office)
Experience: 10+ Years
Shift Timing: 3:00 PM – 12:00 AM (Monday–Friday)
On-Call Requirement: 24/7 Production Support Rotation
Role Overview
We are looking for a highly experienced Senior Site Reliability Engineer (SRE) / DevOps Engineer to ensure the reliability, scalability, security, observability, and performance of mission-critical production systems. This role requires strong expertise in cloud infrastructure, Kubernetes, observability, incident management, and modern SRE practices.
The ideal candidate will balance operational excellence with engineering-driven improvements, focusing on automation, reliability, performance optimization, and reducing operational toil.
Key ResponsibilitiesIncident Management & Reliability
- Participate in 24/7 on-call rotation and production support.
- Diagnose, troubleshoot, and resolve critical production incidents.
- Lead Root Cause Analysis (RCA) and post-incident reviews.
- Improve MTTR and overall operational efficiency.
- Define and manage SLIs, SLOs, SLAs, and Error Budgets.
- Drive reliability improvements, capacity planning, and disaster recovery readiness.
- Reduce operational toil through automation and engineering solutions.
Cloud & Infrastructure
- Design, implement, and manage cloud infrastructure on Microsoft Azure.
- Manage Kubernetes clusters and containerized applications.
- Implement Infrastructure as Code using Terraform.
- Manage Helm deployments and Git-based CI/CD workflows.
- Support highly available, scalable, and secure production environments.
Observability & Monitoring
- Build and maintain monitoring and observability platforms.
- Implement distributed tracing using OpenTelemetry.
- Establish monitoring based on Golden Signals:
- Latency
- Traffic
- Errors
- Saturation
- Design symptom-based alerting and proactive monitoring strategies.
- Improve logging, tracing, metrics collection, and performance visibility.
Security & Compliance
- Implement cloud security best practices.
- Manage IAM, secrets management, and network security.
- Support vulnerability remediation and compliance initiatives.
Must-Have SkillsExperience
- 10+ years in DevOps, Infrastructure Engineering, Cloud Operations, or Site Reliability Engineering.
- 6–7+ years of hands-on experience with DevOps tools and cloud-native infrastructure.
- Experience supporting highly available production environments.
Cloud & Infrastructure
- Microsoft Azure (Mandatory)
- Kubernetes
- Terraform
- Helm
- GitHub / GitLab / Azure Repos
Monitoring & Observability
- OpenTelemetry
- Prometheus
- Grafana
- Datadog
- Azure Monitor
- Distributed Tracing
- Metrics, Logs, and Observability Best Practices
- Golden Signals Monitoring
SRE Practices
- Incident Response & Production Support
- On-Call Operations
- Root Cause Analysis (RCA)
- SLI / SLO / SLA Management
- Error Budgets
- Capacity Planning
- Reliability Engineering
- Toil Reduction
System & Programming Skills
- Python
- Bash Scripting
- Linux Administration
- Networking Fundamentals (DNS, TCP/IP, Load Balancing, SSL/TLS)
Good-to-Have SkillsCloud Platforms
- AWS (EC2, S3, RDS, IAM, VPC, CloudWatch)
- Google Cloud Platform (GCP)
Programming
AI & Cloud-Native Workloads
- Azure AI Services
- AI Foundry
- RAG (Retrieval-Augmented Generation) Infrastructure
- AI/ML Production Workloads
Additional Technologies
- OpenSearch
- ELK Stack
- Distributed Systems Architecture
Advanced Observability
- Building observability frameworks using OpenTelemetry
- Performance Engineering and System Optimization
Preferred Candidate Profile
- Strong ownership mindset and accountability.
- Excellent troubleshooting and debugging skills.
- Experience handling critical production incidents calmly and effectively.
- Deep understanding of SRE principles and operational excellence.
- Strong collaboration and communication skills.
- Passion for automation, scalability, and continuous improvement.
Skills: opentelemetry,azure monitor,gitlab,datadog,helm,grafana,networking fundamentals,devops,devops tools,linux administration,kubernetes,site reliability engineering,cloud-native infrastructure,infrastructure engineering,cloud operations,github,production environments,microsoft azure,reliability engineering,distributed tracing,sre practices,cloud & infrastructure,on-call operations,prometheus,metrics,terraform,python