About The Role
The role builds and maintains the cloud infrastructure and deployment pipelines that power a SaaS platform.
You will work on real infrastructure problems at meaningful scale, with the autonomy to improve things and the support to do it right.
Key Responsibilities
- Manage and improve Kubernetes clusters across AWS EKS, GCP GKE, and Azure AKS; handle scaling, node management, and upgrade lifecycle
- Write and maintain Terraform modules for provisioning cloud infrastructure across all three major cloud providers
- Build and maintain CI/CD pipelines using GitHub Actions, ArgoCD, and Helm for continuous deployment to production
- Implement and manage observability stack: Prometheus, Grafana, Loki, and Datadog for metrics, logs, and tracing
- Own cloud cost governance tooling: tagging policies, FinOps dashboards, and automated resource right-sizing workflows
- Implement security best practices: secrets management (Vault/AWS Secrets Manager), network policies, RBAC, and vulnerability scanning
- Write runbooks, incident response playbooks, and contribute to SRE practices including blameless post-mortems
What We Are Looking For
- 3–8 years of DevOps, platform engineering, or SRE experience in a cloud-native environment
- Deep Kubernetes expertise: cluster operations, workload management, networking (CNI, ingress), and storage
- Terraform or Pulumi experience for IaC at scale; familiarity with Helm chart authoring
- CI/CD pipeline experience: GitHub Actions, Jenkins, ArgoCD, or equivalent
- At least two of three major clouds: AWS, GCP, Azure - with real production infrastructure experience on each
- Monitoring and observability: Prometheus/Grafana, Datadog, or CloudWatch/Azure Monitor
- Bonus: Istio service mesh, Crossplane, cost management tooling, or experience with FinOps practices
Location
New York City (Hybrid)
- San Francisco Bay Area
- Seattle
- Dallas, TX
- Chicago, IL
- Remote strongly considered