About The Role
The role owns the reliability, scalability, and performance of critical cloud infrastructure. This position sits at the intersection of software engineering and systems engineering, ensuring that platform services remain highly available and performant as traffic scales.
The team focuses on building self-service platforms, automated CI/CD pipelines, and robust monitoring frameworks. The role will collaborate closely with product engineering teams to architect resilient systems and automate operational workflows.
Key Responsibilities
- Design, provision, and maintain multi-region cloud infrastructure using Terraform for Infrastructure as Code (IaC)
- Manage and optimize Kubernetes clusters at scale, focusing on container orchestration, service mesh configuration, and resource allocation
- Build and maintain automated CI/CD pipelines using GitHub Actions, GitLab CI, or Jenkins to support continuous deployment with zero downtime
- Implement comprehensive observability frameworks using Prometheus, Grafana, Datadog, or the ELK stack to proactively detect and debug system anomalies
- Participate in a shared on-call rotation, conducting blameless post-mortems and building automation to permanently eliminate recurring operational pain
- Collaborate with security teams to enforce IAM roles, network security policies, and vulnerability scanning across the entire infrastructure stack
What We Are Looking For
- 3–6 years of experience in DevOps, Site Reliability Engineering (SRE), or Infrastructure Engineering supporting high-traffic production environments
- Strong hands-on experience with at least one major cloud provider, preferably AWS or GCP, and solid proficiency with Terraform
- Deep understanding of containerization and orchestration, specifically Docker and production-grade Kubernetes
- Proficiency in at least one scripting or programming language, such as Python, Go, or Bash, for writing automation tools and custom operators
- Solid understanding of networking fundamentals, including TCP/IP, DNS, load balancing, VPCs, and CDN configurations
- Bonus: Experience with service meshes like Istio, GitOps workflows using ArgoCD, or managing distributed databases at scale