About The Role
The role focuses on building and maintaining the foundational infrastructure that powers high-scale production systems. This position involves designing, scaling, and securing automated cloud environments to ensure maximum reliability, uptime, and performance.
The engineer will collaborate closely with product and backend engineering teams to streamline deployment pipelines, automate infrastructure provisioning, and establish robust monitoring systems to proactively address production anomalies.
Key Responsibilities
- Design and maintain scalable, secure infrastructure on AWS using Terraform for Infrastructure as Code (IaC)
- Manage and optimize Kubernetes clusters, ensuring proper resource allocation, autoscaling, and network security
- Build and improve continuous integration and continuous deployment (CI/CD) pipelines using GitHub Actions or GitLab CI
- Implement comprehensive monitoring, logging, and alerting systems utilizing Prometheus, Grafana, and Datadog
- Participate in an on-call rotation to troubleshoot and resolve production incidents, conducting thorough post-mortem analyses
- Collaborate with security teams to implement IAM policies, secret management, and vulnerability scanning throughout the lifecycle
What We Are Looking For
- 3–6 years of experience in DevOps, Site Reliability Engineering, or systems engineering managing high-traffic production environments
- Strong proficiency with AWS services (EC2, EKS, RDS, S3, IAM) and Terraform
- Deep, hands-on experience orchestrating containers with Kubernetes and Docker
- Solid scripting skills in Python, Go, or Bash for automation tasks
- Strong understanding of Linux systems, networking fundamentals (TCP/IP, DNS, VPCs), and web protocols (HTTP/S)
- Bonus: Experience with service meshes (Istio, Linkerd), compliance frameworks (SOC 2, ISO 27001), or managing multi-region databases