We are looking for a DevOps Engineer with 3–5 years of experience to help design, build, and scale reliable cloud infrastructure and deployment pipelines.
This role works closely with engineering teams to improve delivery velocity, system reliability, observability, and security across production environments. You will own critical infrastructure and operational workflows that power our platform at scale.
Key Responsibilities
- Cloud Infrastructure Ownership Own the design, evolution and day-to-day operation of cloud infrastructure on GCP. Manage multi-environment infrastructure (dev/staging/prod/preview/playground) with clear isolation and promotion paths. Build and maintain Infrastructure-as-Code, with reusable modules and best practices. Ensure high availability, scalability, backups, disaster recovery and secure access controls. Act as a point of escalation for infrastructure and platform-related issues.
- Backend & Data Ops Operate and support production micro-services running on Kubernetes (GKE). Support the reliability and scalability of data systems and pipelines from an infra/ops standpoint. Own operational aspects of databases and background jobs, queues and workflows. Partner with backend and data engineers to improve performance, failure isolation, observability with ease of debugging in production.
- Release Engineering Design, maintain and evolve CI/CD pipelines. Enable one-click deployments, safe release strategies (blue-green, canary, rollback) and environment promotion workflows. Build and maintain ephemeral preview environments per PR. Improve developer experience through standardized service templates, opinionated base images and self-service infrastructure tooling. Continuously reduce deployment friction and operational toil.
- Observability Own end-to-end observability across services using OpenTelemetry. Design and maintain dashboards, alerts and on-call hygiene. Define and track SLIs, SLOs, SLAs and error budgets. Lead incident response, postmortems and reliability improvements. Ensure all systems follow 12-factor app principles and are debuggable by default.
- Security & Compliance Apply security-by-default and least-privilege principles across cloud, Kubernetes and CI/CD. Own and manage IAM, secrets management and access controls for production systems. Ensure encryption in transit and at rest for services, databases and data pipelines. Secure Kubernetes workloads using RBAC, namespace isolation, network policies, and container best practices. Integrate vulnerability scanning and security checks into CI/CD pipelines. Ensure PHI/PII is handled safely across applications, data platforms and AI systems. Maintain audit trails and traceability for production changes and access. Support infrastructure practices aligned with HIPAA and SOC2 security requirements.
- MLOps & LLMOps Enable production deployment of fine-tuned LLM models. Support GPU-backed inference workloads via GPU node pools, autoscaling and cost-aware scheduling. Help standardize patterns for model serving, model versioning, rollout strategies and canary or shadow deployments for models. Ensure AI services have proper observability for latency, throughput, error rates and cost per request. Drive FinOps practices including compute and GPU cost optimization, right-sizing, storage and network efficiency.
Qualifications
- 4+ years of hands-on experience in DevOps/Platform/SRE roles.
- Strong, production-level experience with GCP (Compute, IAM, VPC, Cloud Run, etc.).
- Deep experience with Docker and Kubernetes (GKE).
- Proven ownership of CI/CD pipelines (GitLab CI + ArgoCD) and release workflows.
- Experience with Infrastructure-as-Code (Terraform) and Service mesh (Istio).
- Experience with OpenTelemetry and tools like Signoz, Prometheus, Grafana, ELK, Loki, etc.
- Strong Linux fundamentals and scripting skills (Bash, Python).
- Experience operating production systems with uptime and latency requirements.
- Strong debugging, incident management and root cause analysis skills.
- Ability to work independently and make sound technical decisions.
Bonus
- Exposure to ML infrastructure, MLOps or LLMOps.
- Experience running GPU workloads or inference services.
- Familiarity with data platforms or workflow engines.
- Experience in regulated domains (healthcare, fintech, etc.).
Skills: devops,iaas,grafana,sre,gcp,docker,kubernetes,elk