Senior Devops Engineer Lead

Amura Health • Full-time • Chennai, IN • ₹ 3,000,000 - ₹ 5,500,000 / year • 1d ago

Amura’s Vision

We believe that the most under-appreciated route to releasing untapped human potential is to build a healthier body, and through which a better brain. This allows us to do more of everything that is important to each one of us.

Billions of healthier brains, sitting in healthier bodies, can take up more complex problems that defy solutions today, including many existential threats, and solve them in just a few decades.

Billions of healthier brains will make the world richer beyond what we can imagine today. The surplus wealth, combined with better human capabilities, will lead us to a new renaissance, giving us a richer and more beautiful culture.

These healthier brains will be equipped with deeper intellect, be less acrimonious, more magnanimous, and have a kinder outlook on the world, resulting in a world that is better than any previous time.

We find this vision of the future exhilarating. Our hopes and dreams are to create this future as quickly as possible and ensure that it is widely distributed and optimized to maximize all forms of human excellence.

Role Overview

We are looking for a highly skilled Senior DevOps Engineer (AI-Native Infrastructure & Platform Engineering) with deep expertise in AWS cloud infrastructure, automation, AI infrastructure operations, and modern DevOps/SRE practices.

This role goes beyond traditional DevOps and requires a seasoned specialist capable of building and operating AI-ready infrastructure platforms that support high-throughput APIs, LLM/AI workloads, GPU-based compute, data-intensive systems, real-time inference pipelines, and scalable ML platforms.

You will be responsible for architecting, automating, securing, and optimizing highly scalable and cost-efficient cloud environments that enable high-velocity engineering and AI teams. This is an ideal position for someone who combines technical ownership, an automation-first mindset, and a passion for developer productivity and platform reliability.

Key Responsibilities

Cloud Infrastructure & Platform Engineering (AWS)

Architect, deploy, and manage highly scalable and secure infrastructure on AWS. Design cloud platforms supporting AI/ML workloads, data pipelines, real-time APIs, and high-concurrency backend systems.
Hands-on expertise with key AWS services including EC2, ECS/EKS, Lambda, RDS, DynamoDB, S3, VPC, CloudFront, IAM, CloudWatch, and GPU-enabled instances.
Build and maintain Infrastructure-as-Code (IaC) using Terraform, CloudFormation, or AWS CDK.
Design multi-AZ and multi-region architectures for high availability and disaster recovery (HA/DR).
Build reusable platform templates and shared infrastructure modules.

AI/ML Infrastructure & MLOps

Build and maintain infrastructure for LLM applications, AI inference workloads, model serving platforms, vector databases, and feature stores.
Support GPU-based workloads and optimize compute/storage usage.
Enable scalable deployment patterns for AI applications using Kubernetes/EKS. Collaborate with Data Science and ML Engineering teams on model deployment, training/tuning of models, CI/CD for ML systems, experiment environments, and reproducibility.
Support orchestration and deployment of AI workflows and inference services while implementing observability and reliability for AI pipelines.

CI/CD, Automation & Developer Productivity

Build and maintain CI/CD pipelines using GitHub Actions, GitLab CI, Jenkins, or AWS CodePipeline.
Automate deployments, environment provisioning, and release workflows.
Build self-service developer platforms, preview environments, and reusable deployment workflows to improve developer productivity.
Implement automated patching, scaling, backups, cleanup workflows, and drift detection.

Containers, Kubernetes & Platform Reliability

Manage Docker-based environments, containerized applications, and optimize workloads using Kubernetes (EKS) or ECS/Fargate.
Manage autoscaling, cluster health, node pools, ingress, service mesh, and workload isolation.
Optimize infrastructure for performance, resilience, and cost-efficiency.
Implement progressive deployment strategies including blue/green, canary, and rolling deployments.

Observability, Incident Response & SRE Practices

Implement observability stacks using CloudWatch, Prometheus, Grafana, ELK, Datadog, OpenTelemetry, or New Relic.
Build actionable dashboards and intelligent alerting systems while defining and tracking SLIs, SLOs, and SLAs.
Lead incident response, root cause analysis, and blameless postmortems to reduce operational toil and improve MTTR.

FinOps, Cost Governance & Security

Continuously monitor and optimize cloud costs (compute utilization, storage lifecycle, GPU usage, and data transfer) using AWS Cost Explorer, Budgets, Trusted Advisor, CloudHealth, or Kubecost.
Implement AWS security best practices for IAM, VPCs, security groups, NACLs, encryption, and manage secrets using KMS, SSM Parameter Store, or Vault.
Build secure CI/CD pipelines with automated security checks, least-privilege access, audit logging, and ensure compliance readiness for ISO 27001, SOC2, and GDPR.

Collaboration, Leadership & Platform Culture

Work closely with engineering, AI/ML, QA, product, and operations teams to drive a DevOps, SRE, GitOps, and automation-first culture.
Mentor junior DevOps and Platform Engineers while creating and maintaining detailed runbooks, architecture diagrams, and platform documentation.

Must-Have:

Skills & Qualifications

7+ years of experience in DevOps, SRE, Platform Engineering, or Cloud Infrastructure Engineering.
Strong expertise in AWS cloud architecture, services, and deep understanding of Kubernetes (EKS), containers, and cloud-native systems.
Strong Infrastructure-as-Code expertise using Terraform, CloudFormation, or CDK. Strong Linux administration, networking, DNS, routing, and load balancing knowledge. Strong scripting/programming experience in Python, Bash, or Go (preferred). Experience with CI/CD automation, GitOps workflows, and observability platforms supporting scalable production systems.

Preferred / Nice-to-Have:

Experience with AI/ML infrastructure, MLOps, model serving, vector databases, GPU orchestration, and inference optimization.
Familiarity with Kafka, Redis, SQS, and event-driven systems.
Exposure to platform engineering, internal developer platforms, and tools like ArgoCD, Flux, Helm, and OpenTelemetry.
AWS Certifications: Solutions Architect, DevOps Engineer, or SysOps Administrator. Knowledge of distributed systems and large-scale platform operations.

Preferred / Nice-to-Have:

Experience with AI/ML infrastructure, MLOps, model serving, vector databases, GPU orchestration, and inference optimization.
Familiarity with Kafka, Redis, SQS, and event-driven systems.
Exposure to platform engineering, internal developer platforms, and tools like ArgoCD, Flux, Helm, and OpenTelemetry.
AWS Certifications: Solutions Architect, DevOps Engineer, or SysOps Administrator. Knowledge of distributed systems and large-scale platform operations.

Here are answers to some questions you may have

Where is your office?

Chennai (Velachery)

Work Model

Work from Office – because great stories are built in person!

Do you have an online presence?

https://amura.ai (we are @AmuraHealth on all social media)

Skills:- DevOps, Platform as a Service (PaaS), Kubernetes, Machine Learning (ML), Artificial Intelligence (AI), Graphics Processing Unit (GPU), MLOps, vLLM, Linux administration and Python