Site Reliability Engineer

MakeMyTrip • Full-time • Gurugram, IN • 3w ago

The Site Reliability Team is responsible for monitoring all aspects of MakeMyTrip, including production servers and services. You will be acting as first line of defense against any kind of service unavailability or performance of our production services 24 x 7 x 365

You will be frequently interacting with various groups within the organization, like Engineering, Sales, and Product, and hence need developing a good all-around understanding of components, systems, and networks is a must.

We don't expect you to have all the required knowledge when you join us, as many of these skills can be picked up through experience in the job; however, those who want to gain new skills and grow must be prepared to spend time doing suitable research and learning. You must be an eager and quick learner with decent communication skills and must be able to use your initiative to tackle a broad range of problems.

Responsibilities:

Understand the application architectures and gain the domain knowledge of how request flows within the ecosystem.
Alerts configuration and Metric coverage of Business, Application, and system-level.
Keep false alerts in check by tuning thresholds and setting up dependencies amongst applications.
React to alerts by correlating them, do first-level debugging to identify the incident root cause area, then escalate problems to the appropriate team till resolution.
Actively participate in incident post-mortems and triage incidents within the team.
Troubleshoot application problems like unhealthy application containers, high load/CPU, Non200 response codes using logs analysis - Adhere to defined process and be ready for some adhoc and surprise incidents
Help your coworkers by creating documentation and detailed knowledge sharing for continuous improvement. - Communications skills and clarity in reporting and communication.

Requirements:

2+ years of relevant experience in a 24x7 Linux production environment.
Experience in monitoring, troubleshooting application problems, and incident management.
Proficiency in Linux commands to helpslicec, and dice data, like grep, awk, top, scp, s must have.
Experience in an AWS-based Dockerized environment is a huge plus.
Knowledge of SQL queries like select, insert, where, group by, order by, basic join is required.
Hands-on experience in dbdebugginglike finding errors/exceptions in logs, taking heap/thread dumps, is a plus.
Bring ideas to improve the overall efficiency of the NOC team.