About the Role
This role owns the release engineering backbone for large‑scale AI platforms running on thousands of GPUs. You will design CI/CD systems, deployment safety standards, and rollback mechanisms that allow teams to ship frequently while maintaining reliability, security, and cost discipline.
You act as the bridge between engineering and operations, setting org‑wide standards for how code reaches production.
Job Details
- Design and operate team‑wide CI/CD pipelines with automated testing gates, artifact management, and GitOps‑based deployments
- Implement release engineering best practices: repeatable releases, automated rollback, and change management workflows
- Build and manage test infrastructure for distributed systems (multi‑node jobs, long‑running workloads, environment provisioning)
- Establish engineering standards: repo structure, PR workflows, code quality gates, dependency and security scanning
- Mentor engineers on deployment safety, incident response, and blameless postmortems
Job Requirements
- 5–7 years of DevOps, CI/CD, or release engineering experience in production environments
- Deep hands‑on experience with CI/CD systems (GitHub Actions, GitLab CI, ArgoCD, Jenkins) and multi‑stage pipelines
- Strong Kubernetes expertise: rollout strategies, failure modes, probes, resource limits, and debugging under load
- Proven experience with safe deployment patterns (rolling, canary, blue/green), rollback, and zero/low‑downtime releases
- Strong automation skills (Python, Go, or Bash) and solid practices around config, secrets, and access hygiene