Role description
We are seeking a skilled Site Reliability Engineer to support the administration of Azure Kubernetes Service (AKS) clusters running critical, always-on middleware that processes thousands of transactions per second (TPS). The ideal candidate will operate with a mindset aligned to achieving 99.999% (five-nines) availability.
Key Responsibilities:
- Own and manage AKS cluster deployments, cutovers, base image updates, and daily operational tasks.
- Test and implement Infrastructure as Code (IaC) changes using best practices.
- Apply software engineering principles to IT operations for maintaining scalable and reliable production environments.
- Write and maintain IaC as well as automation code for:
- Monitoring and ing
- Log analysis
- Disaster recovery testing
- Incident response
- Documentation-as-code
Mandatory Skills:
- Strong experience with Terraform
- In-depth knowledge of Azure Cloud
- Proficiency in Kubernetes cluster creation and lifecycle management (deployment-only experience is not sufficient)
- Hands-on experience with CI/CD tools (GitHub Actions preferred)
- Bash and Python scripting skills
Desirable Skills:
- Exposure to Azure Databricks and Azure Data Factory
- Experience with secret management using HashiCorp Vault
- Familiarity with monitoring tools (any)
Skills
Azure, Kubernetes, Terraform, DevOps