Dear Applicants,
Greetings From Insightek Global!!
We are seeking a highly experienced and technically adept SRE Architect to lead and drive Site Reliability Engineering initiatives within our organization. Please find below Detailed JD
Job Title: SRE Architect (Site Reliability Engineering)
Locations: Bangalore, Hyderabad, Pune, Chennai
Experience: 20+ Years of Overall IT Experience
Job Type: Full-Time
Position Overview:
We are seeking a highly experienced and technically adept SRE Architect to lead and drive Site Reliability Engineering initiatives within our organization. With 20+ years of overall IT experience, including hands-on experience in Software Development or SRE, the ideal candidate will possess strong leadership skills and expertise in automating and optimizing platform and product resiliency at the enterprise level. You will be responsible for ensuring the highest levels of availability, reliability, and performance across our systems, and will play a critical role in implementing SRE practices, metrics, and automation across our platforms.
Key Responsibilities:
- SRE Implementation & Automation:
- Lead the end-to-end implementation of Site Reliability Engineering practices, including the definition and application of SLI (Service Level Indicators), SLO (Service Level Objectives), SLA (Service Level Agreements), EB (Error Budgets), and MTTX (Mean Time to X).
- Focus on optimizing platform/product resiliency, availability, and operational efficiency.
- Lead MLOps, AIOps, and Chao Practices initiatives to enhance system performance and fault tolerance.
- Drive the automation mindset and apply best practices at the enterprise level.
- Continuously refine maturity by incorporating industry best practices, tools, and standards into client ecosystems.
- Cloud and Infrastructure Management:
- Oversee the implementation and management of cloud technologies (AWS, Azure, GCP) and physical servers, ensuring their scalability, security, and reliability.
- Manage large-scale infrastructure and data center operations, supporting systems with >20K TPS (transactions per second).
- Maintain and manage applications and infrastructure using any cloud technology, ensuring high availability and fault tolerance.
- Observability & Monitoring:
- Hands-on experience in configuring and managing observability platforms such as Prometheus, Splunk, Grafana, Datadog, Alert Manager/PagerDuty, and ELK stack.
- Build and configure customized metric exporters, dashboards, and monitoring systems for various infrastructure and application layers.
- Implement and manage observability configurations for WebLogic, Tomcat, JBoss, API Gateway Platform/Kong, and other critical systems.
- SRE Transformation & Toil Automation:
- Lead SRE transformation initiatives, focusing on reducing toil, automating operations, and improving overall system efficiency.
- Have a track record of implementing SRE practices within large organizations, including legacy infrastructure transformations in banks and financial organizations.
- Strategic Leadership & Communication:
- Take end-to-end ownership of projects, ensuring the delivery of high-quality services that adhere to SRE best practices.
- Build and track key performance metrics for teams and successful project delivery.
- Present complex technical concepts to both technical and non-technical stakeholders in a clear and effective manner.
- Drive strategic goals and initiatives while managing relationships with key stakeholders and clients.
- Team Leadership & Collaboration:
- Lead a team of SRE professionals with diverse technical backgrounds, enabling and empowering them to succeed in their roles.
- Foster a collaborative and innovative environment that drives success and continuous improvement.
- Ability to make tough decisions, motivate the team, and lead through challenges.
Skills & Competencies:
- Strong background in Software Development (Java, .Net, or Python) with deep automation experience using Python.
- Proven experience in leading SRE transformations and initiatives, especially in large-scale and enterprise environments.
- Expertise in cloud technologies (AWS, Azure, GCP) and managing infrastructure at scale.
- Hands-on experience with observability platforms such as Prometheus, Grafana, Splunk, Datadog, and ELK stack.
- Experience with high-scale data center operations and physical server management.
- Strong understanding of digital engineering, DevOps practices, and cloud infrastructure.
- Ability to create compelling presentations and communicate complex ideas effectively.
- Strong problem-solving skills with the ability to think creatively and arrive at practical solutions.
Leadership/Soft Skills:
- Proven thought leader with the ability to lead cross-functional teams and deliver high-impact projects.
- Ability to define, articulate, and drive clear strategic goals and purpose for the team.
- Ability to motivate, inspire, and manage teams to achieve goals and exceed expectations.
- Exceptional verbal and written communication skills to handle global stakeholders effectively.
- A proactive approach to problem-solving, driving projects forward with excellent organizational and time management skills.
- Open to change and agile in adapting to new challenges and situations.
Good to Have Skills:
- Experience with MLOps, AIOps, or traditional IT support transformation.
- Certified in RHCE, AWS, Azure, GCP, Agile, or SRE methodologies.
- Familiarity with microservices principles and tools like Nagios, Zabbix, or New Relic for monitoring.
- Understanding of service discovery, load balancing, and communication patterns within microservices.
- Flexibility to work in shifts (no night shifts), with strong communication skills for managing global stakeholders.