Design, deploy, and manage highly available and scalable infrastructure on AWS using services such as Lambda, API Gateway, EC2, RDS, S3, IAM, Redis/ElastiCache.
Manage and optimize VM-based and cloud-native workloads.
Ensure security best practices, cost optimization, and performance tuning across cloud environments.
Containers & Orchestration
Build, deploy, and operate containerized applications using Docker and Kubernetes.
Manage Kubernetes clusters including Helm charts, Ingress/ALB configurations, and cluster-level troubleshooting.
Implement and maintain observability, logging, and monitoring for Kubernetes workloads.
DevOps & Infrastructure as Code
Develop and maintain Terraform modules and execution plans for infrastructure provisioning.
Design and support CI/CD pipelines using Jenkins, enabling automated build, test, and deployment workflows.
Manage configuration, secrets, and system state using Chef cookbooks and secure secrets management practices.
Collaborate with development teams to improve deployment velocity and reliability.
Observability & Reliability
Implement and maintain end-to-end observability using OpenTelemetry.
Operate and scale monitoring and logging stacks including Prometheus, Thanos, Loki, Tempo, Grafana, and InfluxDB.
Build custom metrics scrapers and centralized alerting systems to proactively detect and resolve issues.
Lead incident response, root cause analysis, and postmortems.
Backend & Data Platforms
Support backend services written in Python, exposing and consuming REST APIs.
Work with RDBMS and integrate infrastructure with Big Data platforms.
Ensure infrastructure supports high-throughput, low-latency data workloads.
Required Qualifications
8+ years of experience in DevOps, Cloud Infrastructure, or Site Reliability Engineering.
Strong hands-on experience with AWS cloud services.
Deep expertise in Docker and Kubernetes in production environments.
Proven experience with Terraform, Jenkins, and configuration management tools.
Strong understanding of observability, monitoring, and alerting systems.
Solid programming and scripting experience, preferably in Python.
Experience working with APIs, databases, and data-intensive systems.