As Manager/Sr Manager/Leader of Site Reliability Engineering (SRE), you would be responsible for leading a team of SREs in designing, building, and maintaining reliable modern large-scale cloud-based infrastructure. This role involves optimizing system performance, ensuring high availability, and enhancing the security of the cloud environments. As a leader, you would be closely working on development and operations to drive improvements in operational efficiency and establish best practices for cloud infrastructure management. The ideal candidate has extensive experience in building and operationalizing large-scale infrastructure with Kubernetes, Kafka, data systems, cloud, etc, strong leadership skills, and a deep understanding of modern SRE principles.
On the SRE team, you'll have the opportunity to manage the complex challenges of scale and fast growth which are unique to Traceable, while using your expertise in coding, algorithms, problem-solving, and SRE practices. We keep Traceable applications up and running, ensuring our customers have the best and most reliable experience possible
Responsibilities
- Ensure reliability of cloud-based distributed systems infrastructure and services built to seamlessly scale to 10s of billions of events per day.
- Responsible for the availability, performance, monitoring, emergency response, and capacity planning of the Traceable cloud services and infrastructure.
- Responsible for building and maintaining ultra-modern infrastructure for CI/CD and DevOps.
- Responsible for debugging and solving production issues and escalations working with the rest of the engineering team.
- Collaborated with product engineering teams across time zones on the design and operations of systems and services.
- Lead, mentor, and manage a team of Site Reliability Engineers to ensure optimal performance and career growth.
- Establish team goals and objectives aligned with the company's strategic vision. Foster a culture of continuous improvement, collaboration, and innovation within the SRE team.
Requirements
- Bachelor's or Master's degree in computer science.
- 8+ years of work experience in SRE and DevOps with modern cloud-native tech stack, distributed systems at massively large scale.
- Strong experience with cloud native technologies (AWS/GCP, microservices Containers, Kubernetes, etc) at scale.
- Strong experience in streaming systems like Kafka streams or Flink.
- Hands-on experience in setting up, automating, and continuously improving the deployment pipelines and CI/CD infrastructure.
- Strong experience with Linux systems.
- Strong experience in operationalizing and scaling modern data systems like MongoDB, Apache Pinot, Apache Trino, Spark, Apache Iceberg, and Kafka Streams
- Strong Experience in infrastructure deployment/provisioning as code using modern tools (Terraform, Helm, Ansible, etc).
- Good expertise in Java and Scripting.
- Strong troubleshooting and debugging skills for production issues and escalations.
- Experience working in a distributed team with different time zones. `
- A self-starter with the ability to work effectively in teams and fast-paced start set-up.
- Excellent spoken / written communication.
This job was posted by Bablu Kumar Mahato from Traceable AI.