SRE/DevOps --Lead I - DevOps Engineering

UST • Full-time • Pune Division, IN • 2h ago

Role Description

Site Reliability Engineer (SRE)

Overview

Site Reliability Engineering (SRE) is a discipline that combines software engineering and systems engineering to build and operate large‑scale, distributed, fault‑tolerant systems. SRE teams ensure that internal and external services consistently meet or exceed reliability, availability, and performance expectations while adhering to strong engineering principles.

SREs apply an engineering approach to operational challenges by designing automated, scalable solutions for production systems. The role emphasizes operational excellence, proactive incident prevention, blameless postmortems, and continuous improvement to reduce toil and improve system resilience. A culture of diversity, intellectual curiosity, problem‑solving, and openness is central to success.

What You Will Do

Own and manage system uptime across cloud‑native (AWS, GCP) and hybrid architectures
Design and implement Infrastructure as Code (IaC) using tools such as Terraform, cloud CLIs, and SDKs
Build and maintain CI/CD pipelines for application and infrastructure deployments using Jenkins and cloud‑native toolchains
Develop automated tooling to safely deploy production changes
Create and maintain detailed runbooks for detection, remediation, and service restoration
Troubleshoot and triage complex issues across distributed systems and service dependencies
Participate in on‑call rotations for high‑severity incidents and drive improvements to reduce MTTR
Lead blameless postmortems and own corrective actions to prevent recurrence

Required Experience

Bachelor’s degree in Computer Science or a related technical field (or equivalent practical experience)
7–10 years of experience in software engineering, systems administration, database administration, or networking
4+ years of hands‑on experience with public cloud platforms
Strong experience monitoring infrastructure and application availability to meet performance objectives
Hands‑on expertise with GCP infrastructure services and automated provisioning using Terraform
Experience rebuilding GCP VM instances using Terraform and Jenkins pipelines
Provisioning GCP resources such as GCE, GKE, storage, and networking components via automation
Configuring monitoring and dashboards for microservices using Cloud Monitoring (Stackdriver), Datadog, and AppDynamics
Developing and enhancing automation using Terraform, Shell, and Python
Implementing and troubleshooting IAM policies across GCP and AWS, including custom roles
Implementing blue/green deployment strategies in GCP environments
Hands‑on experience setting up CI/CD pipelines using Git and Jenkins
Strong experience creating and maintaining Helm charts for GKE resources
Broad understanding of systems, storage, networking, security, and databases

What Could Set You Apart

DevSecOps Excellence

Leads DevSecOps practices to improve system resilience and reliability
Designs, develops, tests, documents, and maintains complex automation and services
Explores and introduces new tools, methods, and engineering best practices
Continuously improves processes and tooling to deliver well‑engineered solutions
Supports team growth through reviews, mentoring, and collaboration

Operational Excellence

Drives execution of moderately complex work initiatives
Defines and monitors key availability and performance metrics
Identifies and implements improvements to streamline and optimize operations

Systems Thinking

Applies best practices to improve system integration and reliability
Assesses technology trends and recommends improvements to availability and performance standards

Technical Communication