We are looking for an SRE-focused Engineer to join our DevOps team. This role is 80% Site Reliability Engineering and 20% DevOps enablement, with observability, resilience, and incident management at its core. You will lead on-call operations, build world-class observability systems, and drive reliability engineering practices across the organization. Alongside, you'll also collaborate on automation and CI/CD improvements to ensure services are built and operated for scale. We are an engineering-focused team continuously investing in tools, tests, processes, and technology. We consider our people to be our biggest asset and strive to build a culture of continuous learning and growth.
Responsibilities
- Lead SRE practices for reliability, scaling, and performance of production systems.
- Lead on-call operations and incident response, ensuring fast resolution and minimising customer impact.
- Perform deep debugging of production issues across infra, services, and databases.
- Design and automate self-healing, scalable infrastructure.
- Architect and implement advanced observability (metrics, logs, traces, SLIs/SLOs, APM) to detect, debug, and prevent outages.
- Support CI/CD and infra automation (Terraform, Kubernetes, pipelines) as part of DevOps responsibilities (20%).
- Implement and mature observability practices (SLIs/SLOs, distributed tracing, APM).
- Mentor junior engineers in incident management and DevOps best practices.
- Partner with engineering teams on resilient architecture reviews.
- Commitment to continuous innovation by researching and proposing the adoption of new tools and industry best practices to enhance infrastructure reliability.
- Conduct blameless postmortems, improve incident playbooks, and drive prevention culture.
Requirements
- 5-8 years of experience in SRE / Production Engineering (with some DevOps exposure).
- Proven expertise in incident management, debugging distributed systems, and on-call operations.
- Strong background in observability platforms (Prometheus, Grafana, Datadog, OpenTelemetry, or similar).
- Deep knowledge of cloud infra (AWS/GCP), including networking, scaling, HA/DR.
- Hands-on with Kubernetes, Terraform, and CI/CD pipelines.
- Experience with incident frameworks, blameless postmortems, chaos/ resiliency testing.
- Ability to balance short-term firefighting with long-term reliability engineering.
- Strong scripting skills (Shell, Python, or Go preferred).
Must-Have Cultural Traits
- Commitment to fostering a culture of reliability through teamwork, blameless postmortems, continuous learning, and proactive risk management.
- Relentless focus on delivering 99.99999% uptime without compromising merchant trust or production stability.
- Passion for building and scaling high-impact infrastructure that supports GoKwik's global marketplace at unprecedented scale.
- Proactive risk identification and mitigation rather than reactive firefighting.
This job was posted by Nirvesh Mehrotra from GoKwik.