Role Description
Site Reliability Engineer (SRE)
Overview
Site Reliability Engineering (SRE) is a discipline that combines software engineering and systems engineering to build and operate large‑scale, distributed, fault‑tolerant systems. SRE teams ensure that internal and external services consistently meet or exceed reliability, availability, and performance expectations while adhering to strong engineering principles.
SREs apply an engineering approach to operational challenges by designing automated, scalable solutions for production systems. The role emphasizes operational excellence, proactive incident prevention, blameless postmortems, and continuous improvement to reduce toil and improve system resilience. A culture of diversity, intellectual curiosity, problem‑solving, and openness is central to success.
What You Will Do
- Own and manage system uptime across cloud‑native (AWS, GCP) and hybrid architectures
- Design and implement Infrastructure as Code (IaC) using tools such as Terraform, cloud CLIs, and SDKs
- Build and maintain CI/CD pipelines for application and infrastructure deployments using Jenkins and cloud‑native toolchains
- Develop automated tooling to safely deploy production changes
- Create and maintain detailed runbooks for detection, remediation, and service restoration
- Troubleshoot and triage complex issues across distributed systems and service dependencies
- Participate in on‑call rotations for high‑severity incidents and drive improvements to reduce MTTR
- Lead blameless postmortems and own corrective actions to prevent recurrence
Required Experience
- Bachelor’s degree in Computer Science or a related technical field (or equivalent practical experience)
- 7–10 years of experience in software engineering, systems administration, database administration, or networking
- 4+ years of hands‑on experience with public cloud platforms
- Strong experience monitoring infrastructure and application availability to meet performance objectives
- Hands‑on expertise with GCP infrastructure services and automated provisioning using Terraform
- Experience rebuilding GCP VM instances using Terraform and Jenkins pipelines
- Provisioning GCP resources such as GCE, GKE, storage, and networking components via automation
- Configuring monitoring and dashboards for microservices using Cloud Monitoring (Stackdriver), Datadog, and AppDynamics
- Developing and enhancing automation using Terraform, Shell, and Python
- Implementing and troubleshooting IAM policies across GCP and AWS, including custom roles
- Implementing blue/green deployment strategies in GCP environments
- Hands‑on experience setting up CI/CD pipelines using Git and Jenkins
- Strong experience creating and maintaining Helm charts for GKE resources
- Broad understanding of systems, storage, networking, security, and databases
What Could Set You Apart
DevSecOps Excellence
- Leads DevSecOps practices to improve system resilience and reliability
- Designs, develops, tests, documents, and maintains complex automation and services
- Explores and introduces new tools, methods, and engineering best practices
- Continuously improves processes and tooling to deliver well‑engineered solutions
- Supports team growth through reviews, mentoring, and collaboration
Operational Excellence
- Drives execution of moderately complex work initiatives
- Defines and monitors key availability and performance metrics
- Identifies and implements improvements to streamline and optimize operations
Systems Thinking
- Applies best practices to improve system integration and reliability
- Assesses technology trends and recommends improvements to availability and performance standards
Technical Communication
- Clearly communicates complex technical concepts to diverse stakeholders
- Demonstrates strong written and verbal communication skills
- Collaborates effectively across teams and proactively resolves conflicts
Troubleshooting & Problem Management
- Applies structured approaches to diagnosing and resolving system issues
- Coordinates investigation, remediation, and implementation of fixes
- Analyzes patterns and trends to recommend improvements in system reliability
Skills
site reliability engineering,terraform,cloud sdk,cicd,devsecops,