DevOps Engineer

Oracle • Full-time • Bengaluru, IN • 3w ago

Senior Site Reliability Engineer / Database Reliability Engineer (NoSQL Database Team)

About the Team

The NoSQL Database Team is responsible for building, operating, and continuously improving highly available, mission-critical distributed NoSQL database services. The team focuses on reliability, scalability, performance, operational excellence, and observability while supporting enterprise-scale production workloads.

Job Summary

We are seeking an experienced Senior Site Reliability Engineer / Database Reliability Engineer to join our NoSQL Database team. This role is responsible for ensuring the reliability, availability, and performance of large-scale distributed database systems.

The ideal candidate has deep expertise in distributed systems, networking, operating systems, production troubleshooting, and root cause analysis, with a passion for improving operational excellence through automation, monitoring, and observability.

Key Responsibilities

Operate, maintain, and improve mission-critical, large-scale distributed NoSQL database services.
Troubleshoot complex production issues involving distributed systems, networking, operating systems, and database infrastructure.
Perform detailed root cause analysis (RCA) for production incidents and drive preventive improvements.
Design, implement, and continuously enhance monitoring, alerting, dashboards, and operational metrics to improve service health and availability.
Collaborate with software engineering, infrastructure, and platform teams to improve reliability, scalability, and performance.
Identify recurring operational issues and implement long-term reliability improvements through automation and engineering best practices.
Participate in production support, incident response, and on-call rotations as required.
Contribute to operational documentation, runbooks, and post-incident reviews to improve operational readiness.

Required Qualifications

Minimum 8 years of experience supporting mission-critical, large-scale production systems.
Hands-on experience working with complex distributed systems.
Strong expertise in troubleshooting networking and Linux/Unix operating system issues.
Proven experience conducting root cause analysis (RCA) for complex production incidents.
Demonstrated ability to identify, implement, and improve monitoring, alerting, and operational metrics.
Experience supporting or operating distributed NoSQL database platforms or similar large-scale data infrastructure.
Strong analytical, troubleshooting, and problem-solving skills.
Excellent communication and cross-functional collaboration skills.

Preferred Qualifications

Experience with distributed NoSQL databases such as Cassandra, MongoDB, Couchbase, ScyllaDB, HBase, or DynamoDB.
Experience with cloud platforms such as Oracle Cloud Infrastructure (OCI), AWS, Azure, or Google Cloud Platform (GCP).
Experience with monitoring and observability tools such as Prometheus, Grafana, Datadog, Splunk, or OpenTelemetry.
Experience with Kubernetes, containers, and cloud-native infrastructure.
Proficiency in scripting or automation using Python, Bash, or similar languages.
Familiarity with Site Reliability Engineering (SRE) principles, automation, and incident management practices.