Site Reliability Engineering Manager
About Fynd:
Fynd is India’s largest omnichannel platform and multi-platform tech company with expertise in retail tech and products in AI, ML, big data ops, gaming + crypto, image editing, and the learning space. Founded in 2012 by 3 IIT Bombay alumni: Farooq Adam, Harsh Shah, and Sreeraman MG. We are headquartered in Mumbai and have 1000+ brands under management, more than 10k stores, and servicing 23k+ pin codes.
Role Overview:
As an Site Reliability Engineering Manager at Fynd, you will lead a team of Site Reliability Engineers to ensure the reliability, scalability, and performance of production systems. You will be responsible for establishing and evolving SRE practices, incident management, automation, and improving system efficiency. The ideal candidate will also lead efforts to drive system health improvements and collaborate across teams to implement SRE best practices.
What will you do at Fynd?
- Lead, mentor, and manage a team of 10-30 Site Reliability Engineers, ensuring operational efficiency, system reliability, and scalability.
- Define and enforce Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to ensure system performance and availability.
- Drive incident management processes by quickly mitigating production issues, leading post-incident reviews, and implementing long-term solutions.
- Automate repetitive tasks to reduce manual interventions, using scripting languages like Python or Go, and infrastructure automation tools like Terraform, Ansible, or CloudFormation.
- Ensure observability through best-in-class monitoring and alerting using tools such as Prometheus, Grafana, and New Relic, making sure all systems are adequately monitored.
- Collaborate closely with engineering, product, and platform teams to ensure reliable feature releases and system updates.
- Architect, implement, and manage scalable, highly available systems using technologies like Kubernetes, Docker, Kafka, and serverless computing (e.g., AWS Lambda).
- Conduct capacity planning, ensuring the infrastructure can scale effectively while optimizing for cost and performance.
- Prepare and lead Game Days and other reliability training sessions to ensure the team can handle real-world incident scenarios effectively.
- Drive continuous improvement in reliability processes, adopting the latest industry practices to align with agile methodologies.
Some Specific Requirements
- 7+ years of experience in Site Reliability Engineering, DevOps, or software engineering roles, with 3+ years in a leadership or Tech Lead role.
- Strong experience with Kubernetes, Docker, and serverless technologies (e.g., AWS Lambda) for managing large-scale, cloud-based infrastructure.
- Expertise in infrastructure as code (IaC) tools like Terraform, Ansible, or CloudFormation to automate deployments and manage cloud infrastructure.
- Proficiency in coding/scripting languages like Python or Go for building automation tools and scripts.
- Hands-on experience with monitoring and alerting tools such as Prometheus, Grafana, New Relic, or similar.
- Familiarity with real-time distributed systems like Kafka and gRPC.
- Strong understanding of cloud platforms such as AWS, Google Cloud Platform (GCP), or hybrid cloud environments.
- Basic knowledge of Linux environments (Red Hat, CentOS) and experience with performance tuning and troubleshooting in production.
- Solid understanding of SLI, SLO, and error budgeting practices for system reliability.
- Previous experience with incident management and driving root cause analysis, remediation, and prevention strategies.
What do we offer?
Growth
Growth knows no bounds, as we foster an environment that encourages creativity, embraces challenges, and cultivates a culture of continuous expansion. We are looking at new product lines, international markets and brilliant people to grow even further. We teach, groom and nurture our people to become leaders. You get to grow with a company that is growing exponentially.
Flex University: We help you upskill by organising in-house courses on important subjects
Learning Wallet: You can also do an external course to upskill and grow, we reimburse it for you.
Culture
Community and Team building activities
Host weekly, quarterly and annual events/parties.
Wellness
Mediclaim policy for you + parents + spouse + kids
Experienced therapist for better mental health, improve productivity & work-life balance
We work 5 days from the office and we make sure people have everything they need:-
Free meals
Snacks, goodies & a lot of fun culture
Please reach out to me at mangeshgaikwad@gofynd.com to share your profile for consideration.