Senior Site Reliability Engineer / Database Reliability Engineer (NoSQL Database Team)
About the Team
The NoSQL Database Team is responsible for building, operating, and continuously improving highly available, mission-critical distributed NoSQL database services. The team focuses on reliability, scalability, performance, operational excellence, and observability while supporting enterprise-scale production workloads.
Job Summary
We are seeking an experienced Senior Site Reliability Engineer / Database Reliability Engineer to join our NoSQL Database team. This role is responsible for ensuring the reliability, availability, and performance of large-scale distributed database systems.
The ideal candidate has deep expertise in distributed systems, networking, operating systems, production troubleshooting, and root cause analysis, with a passion for improving operational excellence through automation, monitoring, and observability.
Key Responsibilities
- Operate, maintain, and improve mission-critical, large-scale distributed NoSQL database services.
- Troubleshoot complex production issues involving distributed systems, networking, operating systems, and database infrastructure.
- Perform detailed root cause analysis (RCA) for production incidents and drive preventive improvements.
- Design, implement, and continuously enhance monitoring, alerting, dashboards, and operational metrics to improve service health and availability.
- Collaborate with software engineering, infrastructure, and platform teams to improve reliability, scalability, and performance.
- Identify recurring operational issues and implement long-term reliability improvements through automation and engineering best practices.
- Participate in production support, incident response, and on-call rotations as required.
- Contribute to operational documentation, runbooks, and post-incident reviews to improve operational readiness.
Required Qualifications
- Minimum 8 years of experience supporting mission-critical, large-scale production systems.
- Hands-on experience working with complex distributed systems.
- Strong expertise in troubleshooting networking and Linux/Unix operating system issues.
- Proven experience conducting root cause analysis (RCA) for complex production incidents.
- Demonstrated ability to identify, implement, and improve monitoring, alerting, and operational metrics.
- Experience supporting or operating distributed NoSQL database platforms or similar large-scale data infrastructure.
- Strong analytical, troubleshooting, and problem-solving skills.
- Excellent communication and cross-functional collaboration skills.
Preferred Qualifications
- Experience with distributed NoSQL databases such as Cassandra, MongoDB, Couchbase, ScyllaDB, HBase, or DynamoDB.
- Experience with cloud platforms such as Oracle Cloud Infrastructure (OCI), AWS, Azure, or Google Cloud Platform (GCP).
- Experience with monitoring and observability tools such as Prometheus, Grafana, Datadog, Splunk, or OpenTelemetry.
- Experience with Kubernetes, containers, and cloud-native infrastructure.
- Proficiency in scripting or automation using Python, Bash, or similar languages.
- Familiarity with Site Reliability Engineering (SRE) principles, automation, and incident management practices.