We are looking for a highly experienced DevOps / Site Reliability Engineer (SRE) to support and operate mission-critical production systems across hybrid environments. The ideal candidate will have strong expertise in incident management, CI/CD, Kubernetes operations, and cloud infrastructure (AWS/Azure).
You will play a key role in ensuring system reliability, deployment stability, and rapid incident resolution, working closely with engineering and support teams.
Key Responsibilities
Production Operations & Incident Response (Primary)
- Support 24x7 production systems for services and integrations
- Participate in on-call rotation (primarily weekdays)
- Troubleshoot incidents across:
- CI/CD pipelines
- Kubernetes clusters
- API Gateway
- Networking and applications
- Perform incident triage, mitigation, and recovery
- Ensure safe deployments with rollback mechanisms
Technical Skills (Mandatory)
- Kubernetes Operations (deployment, troubleshooting, scaling)
- CI/CD Tools: GitHub Actions, Azure DevOps, Octopus (or equivalent)
- Cloud Platforms: AWS and/or Azure
- Infrastructure as Code: Terraform (working with existing codebases)
- Observability Tools: Prometheus, Grafana, logging systems
Scripting & Automation
- Strong scripting skills in:
- Bash
- Python
- PowerShell
Good to Have
- Experience with API Gateway management
- Knowledge of Cloudflare / APIM
- Exposure to message queues (RabbitMQ) and caching tools (Redis)
- Experience supporting legacy/Windows-based systems