We build and maintain a scalable infrastructure where reliability and performance are just as important as automation. We don't just follow best practices adapt them to real-world challenges. That's why we are looking for a DevOps Engineer who not only works with tools but understands the infrastructure and can improve it. Our ideal candidate has deep expertise in Terraform, Kubernetes (EKS), ECS, and AWS, can not only write but also understand existing Terraform and Python code, and has a solid understanding of production environments.
Responsibilities
- Develop and implement Infrastructure as Code (Terraform).
- Manage and optimize AWS Kubernetes (EKS) and ECS clusters.
- Automate CI/CD processes and maintain existing pipelines (GitLab CI, GitHub Actions, ArgoCD, Jenkins).
- Develop monitoring and alerting systems (Prometheus, Grafana, OpenSearch/ELK).
- Ensure high availability and reliability of infrastructure under traffic spikes, failures, and attacks.
- Optimize cloud resource costs.
- Work on security aspects, including IAM, MFA, access management, and DevSecOps.
- Automate application configuration and deployment.
- Collaborate with developers, architects, and security teams to ensure reliable infrastructure.
- Develop SLA, SLI, SLO, and KPI metrics for infrastructure monitoring.
- Investigate incidents and implement strategies to prevent them.
- Participate in on-call rotations for incident response and resolution.
Requirements
- At least 5 years of experience as a DevOps Engineer.
- Deep understanding of production infrastructure: how to prevent failures, manage changes, and ensure seamless operation of services.
- Terraform (critical skill) - strong experience in writing, understanding, and reviewing existing code.
- Experience managing Kubernetes (EKS) and ECS clusters in AWS.
- Expertise in AWS services (EC2 S3 IAM, RDS, Lambda, VPC).
- Experience with CI/CD tools (GitLab CI, GitHub Actions, ArgoCD, Jenkins).
- Strong knowledge of monitoring and logging tools (Prometheus, Grafana, OpenSearch/ELK).
- Solid understanding of Linux and networking concepts.
- Ability to write scripts and read existing code in Python and Bash.
- Experience building highly available and scalable systems.
Nice-to-have
- Experience optimizing cloud resource costs (FinOps approach).
- Experience with serverless architectures (AWS Lambda, API Gateway).
- Experience working with databases (MySQL, PostgreSQL, Redis, MongoDB, Aerospike).
- Knowledge of log analysis and monitoring tools in distributed systems.
- Experience working in an Agile/Scrum environment.
This job was posted by Shajy Theyyamveettil from Affle.