Requirements
Qualifications
- 4–9 years in SRE/DevOps/Systems Engineering as Senior or Principal Engineer
- Strong hands-on experience with Kubernetes, container orchestration, and API management.
- Working knowledge of WAFs,networking security, and database technologies (SQL/NoSQL).
- Proficient in automation and scripting (Python, Go,Ansible, Terraform,etc..)
- Strong observability/monitoring experience.
- Experience with CI/CD pipelines, GitOps, and infrastructure as code.
- Solid problem-solving and collaboration skills.
Job responsibilities
- Resolve escalated incidents across Kubernetes,API Proxy, WAF,DBs, and infra platforms.
- Design and improve runbooks, automating manual steps wherever possible.
- Lead and contribute to building self-healing systems and self-service tooling for users.
- Analyze incident trends, propose improvements in monitoring, capacity, and reliability.
- Collaborate with engineering teams on deployment, upgrades, and performance optimization.
- Conduct postmortems, document RCA, and ensure learning is captured.
- Mentor and coach L1 engineers.
Skills
Mandatory Skills (Must-Have)
1.Advanced Incident Troubleshooting & Resolution
Expectation: Diagnose and resolve escalated incidents that L1 cannot handle,
often across multiplelayers (infrastructure, application,network).
Example: For an API outage,identify if the root cause is in Kubernetes pod networking,APIgateway misconfig,or backend DB latency — and apply fixes.
2. Kubernetes & Container Orchestration Expertise
Expectation: Comfortable with deployments, scaling,networking, and debugging cluster-level
issues.
Example: Troubleshoot why pods are pending by checking node capacity, taints/tolerations, and
cluster autoscaler logs.
3.Automation & Scripting (Python, Go, Bash,Ansible, Terraform)
Expectation: Write scripts and automation to reduce manual toil,enhance monitoring, and improveincident resolution speed.
Example: Develop a Python script to automatically collect pod and system logs when a service
crashes.
4. Observability & Monitoring Tooling
Expectation: Deep understanding of monitoring, alerting, tracing, and logging systems.
Example: Build Prometheus alert rules to detect DB query spikes; configure Grafana dashboards for API latency.
5. CI/CD & Infrastructure as Code (IaC)
Expectation: Familiarity with GitOps workflows, CI/CD pipelines, and infrastructure provisioning.
Example: Enhance Jenkins pipeline to add automated smoke tests before promoting Kubernetes
deployments.
6. Database Troubleshooting (SQL & NoSQL)
Expectation: Identify performance bottlenecks, connection issues, and basic tuning opportunities.
Example: Run queries to detect slow-running SQL statements causing latency in an application.
7. Incident Management & RCA
Expectation: Act as incident commander for escalated issues, lead bridge calls, and produce Root
Cause Analyses.
Example: After a WAF misconfiguration causes downtime,lead the investigation, document the
timeline, and propose preventive actions.
8. Mentorship & Runbook Improvement
Expectation: Coach L1 engineers, refine runbooks, and introduce new automated workflows.
Example: Update a runbook to add automated Kubernetes log collection instead of manual steps.
Preferred Skills (Nice-to-Have)
1. Cloud Platform Engineering (AWS,Azure, GCP)
Expectation: Hands-on skills in provisioning, scaling, and securing cloud workloads.
Example: Diagnose why an AWS ALB is misrouting traffic after a deployment.
2. Security & WAF Management
Expectation: Understand WAF rules, common attacks (SQL injection, XSS), and how to apply fixes.
Example: Investigate false positives in WAF logs and adjust rule sets with security teams.
3. Capacity & Performance Engineering
Expectation: Anticipate scaling needs, tune resource utilization, and propose optimizations.
Example: Identify that a Kubernetes deployment is CPU-throttled and adjust HPA (Horizontal Pod Autoscaler) configs.
4.Automation Platform Integration (AIOps, ChatOps)
Expectation: Integrate AI/ML-powered tools for anomaly detection and auto-remediation.
Example: Implement a ChatOps bot that runs predefined Kubernetes troubleshooting commands in Slack.
5. Cross-Platform Expertise (Hybrid Infra)
Expectation: Experience supporting both on-prem and cloud environments seamlessly.
Example: Compare latency patterns between on-prem DBs and cloud-hosted APIs to identify bottlenecks.
Qualifications:
· 7+ years in SRE/DevOps/Systems Engineering as Senior or Principal Engineer
· Strong hands-on experience with Kubernetes, container orchestration, and API management.
· Working knowledge of WAFs, networking security, and database technologies (SQL/NoSQL).
· Proficient in automation and scripting (Python, Go, Ansible, Terraform, etc.)
· Strong observability/monitoring experience.
· Experience with CI/CD pipelines, GitOps, and infrastructure as code.
· Solid problem-solving and collaboration skills.