Job description
Location: Onsite - Chandigarh
Employment Type: Full-Time
Experience: 5+ Years
Department: Engineering
About SimplifyAI
SimplifyAI is a fast-growing AI-first startup building intelligent solutions across Cloud, Data, and Generative AI. We help enterprises automate workflows, unlock data potential, and accelerate digital transformation using cutting-edge AI. With a lean, high-ownership engineering culture and offices in Chandigarh, India and Jakarta, Indonesia, we move fast, think big, and build things that matter.
About the Role
We are seeking a highly skilled and self-driven DevOps Lead with a minimum of 5 years of hands-on experience to strengthen our Engineering team. In this role, you will own the full lifecycle of our cloud infrastructure — from provisioning and automation to monitoring and incident response. You will be a critical bridge between development and operations, ensuring our systems are resilient, secure, and ready to scale.
This is an on-site role requiring strong collaboration with cross-functional teams including backend engineers, QA, and security.
Key Responsibilities
Infrastructure & Cloud
Architect, provision, and manage production-grade infrastructure on AWS / GCP / Azure
Design highly available and fault-tolerant systems using cloud-native services
Manage networking components: VPCs, subnets, route tables, security groups, NAT gateways, VPNs
Oversee DNS management, SSL/TLS certificates, load balancers, and CDN configurations
Drive cloud cost optimization initiatives and enforce resource governance policies
Automation & CI/CD
Build, maintain, and continuously improve CI/CD pipelines (GitHub Actions, GitLab CI, Jenkins, or CircleCI)
Automate infrastructure provisioning and configuration using Terraform, Ansible, or Pulumi
Implement GitOps workflows and environment promotion strategies (dev → staging → production)
Automate repetitive operational tasks through scripting (Bash, Python, or Go)
Containers & Orchestration
Manage containerized workloads using Docker and Docker Compose
Administer Kubernetes clusters — deployments, services, ingress controllers, HPA, resource quotas, and RBAC
Manage Helm charts for standardized application packaging and deployment
Monitoring & Observability
Set up and maintain end-to-end observability using Prometheus, Grafana, Datadog, or equivalent
Implement structured log aggregation using ELK Stack or Loki + Grafana
Configure distributed tracing with OpenTelemetry, Jaeger, or Zipkin
Define SLOs/SLAs, alerting thresholds, and on-call escalation runbooks
Security & Compliance
Champion DevSecOps practices across the SDLC
Manage secrets using HashiCorp Vault, AWS Secrets Manager, or Doppler
Enforce network policies, pod security standards, and least-privilege IAM roles
Conduct regular vulnerability scanning (Trivy, Snyk) and coordinate remediation
Ensure infrastructure compliance with security standards (SOC 2, ISO 27001 awareness)
Incident Management
Lead production incident response, perform thorough Root Cause Analysis (RCA), and drive post-mortems
Define and improve disaster recovery (DR) and business continuity plans
Establish and test backup and restore procedures for critical databases and services
Collaboration & Documentation
Work closely with developers to implement deployment strategies: blue/green, canary, and rolling updates
Maintain up-to-date runbooks, architecture diagrams, and infrastructure documentation
Mentor junior engineers on DevOps practices and cloud fundamentals
Required Skills & Qualifications
Cloud & Infrastructure
Proficiency in IaC tools — Terraform (mandatory), Ansible, or Pulumi
Strong understanding of VPC design, multi-region architectures, and cloud networking
Familiarity with serverless (AWS Lambda / Cloud Functions) and managed services (RDS, ElastiCache, S3)
Monitoring & Observability
Hands-on experience with Prometheus + Grafana dashboards and alerting
Log management using ELK Stack or Loki
Ability to define meaningful SLIs, SLOs, and error budgets
Databases & Messaging
Production experience with PostgreSQL (replication, backup, query optimization)
Hands-on with Redis (clustering, persistence, eviction policies)
Familiarity with message brokers: RabbitMQ, Kafka, or Celery + Redis
Education
Bachelor's degree in Computer Science, Information Technology, or a related field
Relevant certifications preferred: AWS Solutions Architect, CKA (Certified Kubernetes Administrator), HashiCorp Terraform Associate
Soft Skills
Strong ownership mindset — treats production systems as a personal responsibility
Excellent analytical and root-cause-oriented problem-solving skills
Clear and concise communicator, both written and verbal
Comfortable with ambiguity and able to prioritize independently in a fast-paced environment
Team-first attitude with the ability to mentor and uplift peers
Disciplined about documentation and knowledge sharing
What We Offer
Competitive salary benchmarked to market standards
On-site work culture with a collaborative, high-performance engineering team
Access to the latest tooling, cloud credits, and hardware
Dedicated learning & development budget (certifications, courses, conferences)