Job Title: Azure Site Reliability Engineer (SRE)
Location: Chennai/ Hydearabad
Experience Level: 8+ Years
Job Summary:
We are seeking an experienced Azure Site Reliability Engineer (SRE) to ensure the reliability, availability, and performance of our Azure-based platforms and services. The ideal candidate will be responsible for designing and managing scalable infrastructure, driving automation, and ensuring high-performance data analytics and AI solutions. This role involves working closely with cross-functional teams, implementing continuous improvement initiatives, and optimizing costs while maintaining secure and compliant environments.
Key Responsibilities:
Reliability & Performance: Ensure high availability, reliability, and performance of Azure platforms through proactive monitoring, alerting, and incident response strategies.
Infrastructure Management: Design, deploy, and manage fault-tolerant and scalable Azure infrastructure using Infrastructure as Code (IaC) tools like Terraform and Ansible for automated provisioning and configuration.
Data Analytics: Ensure high performance and availability of data pipelines and analytics platforms on Azure, ensuring smooth data processing and insights generation.
Machine Learning & Generative AI: Leverage AIOps to maintain scalable, secure, and high-performing machine learning and generative AI systems on the Azure platform.
AKS Management: Architect, deploy, and manage Azure Kubernetes Service (AKS) clusters. Ensure optimal performance, scalability, and cost-efficiency while adhering to container orchestration best practices.
Automation & CI/CD: Develop automation workflows using Terraform and Ansible. Implement and maintain CI/CD pipelines with Azure DevOps to streamline deployment processes.
Monitoring & Observability: Implement comprehensive monitoring and observability solutions using Azure Monitor, Application Insights, and other tools. Analyze metrics, logs, and traces to identify and resolve performance bottlenecks.
Cost Optimization: Monitor and optimize Azure costs by implementing reservations, savings plans, and other cost-management strategies. Provide recommendations for resource optimization.
Capacity Planning: Perform capacity planning and forecasting to ensure that resources meet current and future demand. Implement scaling strategies for optimal resource utilization.
Security & Compliance: Ensure Azure environments comply with security best practices and regulatory standards. Conduct audits, address vulnerabilities, and ensure data protection and privacy.
Documentation & Knowledge Sharing: Create and maintain detailed documentation for operational processes, incident response, and infrastructure designs. Share knowledge and provide training to team members and stakeholders.
Collaboration & Stakeholder Engagement: Work closely with development teams, data scientists, and other stakeholders to deliver solutions that meet their needs. Communicate effectively on operational status, incidents, and improvements.
Continuous Improvement: Stay current with emerging Azure technologies, data analytics, machine learning, AI trends, and contribute to the development of new tools, processes, and best practices.
Required Qualifications:
Technical Expertise: Extensive experience with Azure services, including Azure Data Analytics (e.g., Synapse Analytics, Data Lake, Data Factory, Power BI), Azure Machine Learning, Azure Cognitive Services, Generative AI, and Azure Kubernetes Service (AKS).
Experience: 8+ years of experience in site reliability engineering, cloud operations, or a related field, focusing on Azure technologies and high-availability systems.
Skills: Strong problem-solving and troubleshooting skills, with experience in incident management, performance optimization, and automation. Proficiency in scripting languages (e.g., PowerShell, Python) and CI/CD tools.
Infrastructure as Code (IaC): Expertise in Terraform and Ansible for infrastructure automation and deployment.
Certifications: Microsoft Certified: Azure Solutions Architect Expert (AZ-305) or equivalent advanced certification is required. Additional certifications in Azure Data Engineering, Machine Learning, or DevOps are a plus.
Desired Attributes:
Operational Excellence: Demonstrated ability to maintain high standards of reliability and performance in complex cloud environments.
Cost Optimization: Experience in cost management strategies, including reservations and savings plans, to optimize Azure expenses.
Leadership: Ability to lead incident response efforts, mentor team members, and drive continuous improvement initiatives.
Customer Focus: Strong commitment to delivering high-quality, user-focused solutions that meet stakeholder needs.
Innovative Mindset: Ability to innovate in data analytics, AI, and container management by applying creative solutions to complex challenges.
Team Collaboration: Excellent interpersonal skills and ability to work effectively with cross-functional teams, fostering a collaborative work environment