Job Description
JOB DESCRIPTION
Qualifications
- The ideal candidate will have a strong background in production monitoring, a deep understanding of development and operations, and a proven track record in managing and scaling distributed systems in a public, private, or hybrid cloud environment for e-commerce / retail platforms.
- Understanding of SRE & DevOps principles, including monitoring, alerting, fault analysis, and other common reliability engineering concepts, with a keen eye for opportunities to eliminate toil by code and process improvements.
- Expertise in infrastructure as code (IAC), build automation, source control, and CI/CD tools (e.g., Terraform, CloudFormation, GitHub, Artifactory, Jenkins).
- Deep understanding of containerization and orchestration technologies (e.g., Docker, Kubernetes).
- Experience with monitoring and logging tools (e.g., Prometheus, Grafana, Splunk, New Relic, Sumo Logic) and incident response processes.
- Proficient in modern Java, React, NodeJS, and scripting languages such as Python, and Bash.
- High-level understanding of the different layers of the Tech stack and how they come together to provide a service (e.g. network, compute, storage, OS (Linux, Windows), supporting services, application layer).
Knowledge in CDN technology as well
Responsibilities
- Key measures of success will include platform stability, effective integration and delivery, instrumentation, release quality, technical debt(toil) reduction, development of automation, risk/security compliance, and sustained advancement of the SRE & DevOps practice.
- Design & implement scalable, automated, monitored, and well-documented systems to accelerate the development of the services running in the AWS cloud.
- Configure, tune, and fix multi-tiered systems to achieve optimal application performance, stability, and availability.
- Be part of an on-call rotation providing hands-on technical expertise during service-impacting events.
- Apply troubleshooting skills, debugging tools, and examine logs, telemetry, and other methods to verify assumptions and customer impact. Lead blameless postmortems for root cause and production resiliency.