Key Responsibilities
Lead and manage the Site Reliability Engineering (SRE) team to ensure platform stability, scalability, availability, and performance.
Design, implement, and optimize cloud infrastructure solutions on AWS.
Drive DevOps transformation initiatives and establish Infrastructure as Code (IaC) best practices.
Build and maintain automated deployment pipelines using CI/CD tools.
Manage and optimize containerized environments using Docker and Kubernetes.
Implement infrastructure and application automation to improve operational efficiency.
Establish monitoring, logging, alerting, and observability frameworks for proactive issue detection and resolution.
Define and monitor SLIs, SLOs, and SLAs to maintain service reliability.
Lead incident management, root cause analysis (RCA), and continuous improvement initiatives.
Collaborate with Development, Security, Architecture, and Operations teams to improve platform resilience.
Drive capacity planning, performance tuning, disaster recovery, and cost optimization strategies.
Mentor and develop engineering teams while promoting reliability engineering culture.
Required Skills
Strong experience in AWS Cloud Services (EC2, EKS, ECS, Lambda, VPC, IAM, CloudWatch, RDS, S3).
Hands-on expertise in DevOps and Cloud Infrastructure Management.
Experience with Infrastructure as Code (Terraform / CloudFormation / Ansible).
Strong knowledge of CI/CD tools (Jenkins, GitHub Actions, GitLab CI, Azure DevOps).
Expertise in Docker and Kubernetes (deployment, orchestration, scaling, troubleshooting).
Experience with Observability tools (Prometheus, Grafana, ELK, Splunk, Datadog, OpenTelemetry).
Scripting and automation using Python, Shell, or Go.
Experience in incident response, production support, and reliability engineering.
Strong understanding of networking, security, and cloud architecture principles.
Preferred Qualifications
Bachelor’s/Master’s degree in Computer Science, Engineering, or related field.
AWS certifications preferred (Solutions Architect / DevOps Engineer).
Experience managing distributed teams and enterprise-scale environments.
Exposure to SRE practices, platform engineering, and cloud-native architecture.