Service Reliability Operations Manager

Kore.ai • Full-time • Hyderabad, IN • 1d ago

Kore.ai is a pioneering force in enterprise AI transformation, empowering organisations through our comprehensive agentic AI platform. With innovative offerings across "AI for Service," "AI for Work," and "AI for Process," we're enabling over 400+ Global 2000 companies to fundamentally reimagine their operations, customer experiences and employee productivity.

Our end-to-end platform enables enterprises to build, deploy, manage, monitor, and continuously improve agentic applications at scale. We've automated over 1 billion interactions every year with voice and digital AI in customer service, and transformed employee experiences for tens of thousands of employees through productivity and AI-driven workflow automation.

Recognised as a leader by Gartner, Forrester, IDC, ISG, and Everest, Kore.ai has secured Series D funding of $150M, including strategic investment from NVIDIA to drive Enterprise AI innovation. Founded in 2014 and headquartered in Florida, we maintain a global presence with offices in India, UK, Germany, Korea, and Japan.

Position / Title

Service Reliability Operations Manager

Location: Hyderabad

Experience: 8–15 years

Position Summary

We are looking for a Service Reliability Operations Manager who will own the reliability, availability, and performance of our production systems while driving operational excellence across the engineering organisation. This role sits at the intersection of SRE, DevOps, and release engineering — you will be hands-on with CI/CD pipelines, lead incident response, champion observability, and mentor a growing team of engineers. The ideal candidate combines deep technical expertise in cloud-native infrastructure with strong leadership instincts and a bias for execution. You will partner closely with development, security and platform teams to build resilient systems and a culture of continuous improvement

Responsibilities

Release & Deployment Operations:

Own and optimise end-to-end release pipelines across CI/CD toolchains (Harness, ArgoCD, Jenkins, GitHub Actions, or similar).

Define and enforce deployment standards, including blue-green, canary, and rolling deployment strategies for Kubernetes-based workloads.
Collaborate with development teams to reduce deployment cycle time while maintaining quality gates (SAST, DAST, image scanning).
Manage release coordination across environments (dev, staging, production) with clear runbooks and rollback procedures.

Site Reliability Engineering

Establish and track SLOs, SLIs, and error budgets for critical services; drive reliability improvements based on data.

Lead incident management and post-incident review processes; build a blameless retrospective culture.
Design and implement chaos engineering practices to proactively identify failure modes.
Automate toil reduction through infrastructure-as-code (Terraform, Crossplane) and self-healing system patterns.

Observability & Monitoring

Architect and maintain the observability stack covering metrics, logs, traces, and profiling across distributed systems.

Define alerting strategies that minimise noise and maximise actionable signals; manage PagerDuty escalation policies and on-call rotations.
Evaluate and integrate modern observability tooling, including eBPF-based solutions (e.g., GroundCover, Coralogix, Coroot) for deep kernel-level visibility.

Cloud Infrastructure

Operate and optimise workloads across a multi-cloud environment spanning AWS (EKS), Azure (AKS), and GCP (GKE).

Drive cloud cost optimisation initiatives and contribute to infrastructure budgeting and forecasting.
Ensuring infrastructure security posture through policy-as-code (OPA Gatekeeper), network policies, and secrets management.

Team Leadership & Mentorship

Mentor and coach junior and mid-level SRE/DevOps engineers; conduct regular one-on-ones and provide career growth guidance.

Establish best practices, standards, and documentation for operational workflows.
Foster a collaborative, learning-oriented team culture with knowledge sharing sessions and technical brown bags.

Required Skills & Experience

Min8+years of experience in SRE, DevOps, or Production/Platform Engineering roles with progressive responsibility.

Strong hands-on experience with CI/CD platforms — Harness, ArgoCD, Jenkins, GitLab CI or GitHub Actions.
Deep understanding of SRE principles: SLOs/SLIs, error budgets, incident management, capacity planning and reliability patterns.
Production Kubernetes experience (EKS, AKS, GKE), including Helm, service mesh (Istio), and autoscaling (Karpenter/KEDA).
Proficiency across AWS, Azure, and GCP — networking, IAM, compute, and managed services.
Strong scripting and automation skills (Python, Bash, Go preferred).
Proven experience with infrastructure-as-code tools (Terraform, Crossplane, Pulumi).
Excellent understanding of observability stacks — Prometheus, Grafana, OpenTelemetry, ELK/Loki, Datadog, or equivalent.
Demonstrated ability to mentor engineers and lead operational teams.
Experience managing on-call rotations and incident escalation using PagerDuty or similar tools.

Nice To Have

Hands-on experience with eBPF-based observability or networking tools (GroundCover, Coralogix, Lightstep, Coroot).

Familiarity with developer portal platforms (Port, Backstage) and Internal Developer Platform (IDP) concepts.
Experience with chaos engineering frameworks (Litmus, Gremlin, Chaos Monkey).
Knowledge of security scanning pipelines — Trivy, Semgrep, Snyk or similar.
Background in container supply chain security (distroless images, SBOM generation, image signing).
Cloud cost management experience with FinOps practices and tooling.
Relevant certifications: CKA/CKS, AWS Solutions Architect, Azure DevOps Engineer, GCP Professional Cloud Architect

Education Qualification

Graduate in Engineering OR Master's in Computer Applications