Role Description
We are seeking an experienced Senior Site Reliability Engineer (SRE) 7-13 years of overall IT experience with strong focus on Cloud, DevOps, and Site Reliability Engineering, to design, build, operate, and support highly available, scalable, and resilient systems on the Microsoft Azure platform. The ideal candidate will bring deep expertise in Azure cloud services, Azure Kubernetes Service (AKS), CI/CD automation, infrastructure as code, and observability, while applying SRE best practices to reduce operational toil and improve system reliability.
This role requires close collaboration with platform, application and security teams to ensure operational excellence across cloud-native and containerized workloads.
Key Responsibilities
Own reliability, performance, scalability, and availability of Azure-based systems
Apply SRE principles including SLIs, SLOs, error budgets, incident management, and postmortems
Develop and manage Infrastructure as Code (IaC) using Terraform
Build and maintain CI/CD pipelines using Azure DevOps for infrastructure and applications
Automate operational workflows using Azure Logic Apps, Azure Automation Runbooks, and Automation Jobs
Design, run, and monitor batch jobs and scheduled workloads, including:
Kubernetes Jobs
Kubernetes Cronjobs
Perform cloud operations, troubleshooting, and automation using Azure CLI
Implement monitoring, ing, and observability using Dynatrace
Automate configuration management and routine operational tasks using Ansible, PowerShell, and Python
Lead incident response, root cause analysis (RCA), and continuous reliability improvements
Collaborate with development teams supporting .NET and Python applications running on Azure and AKS
Continuously reduce manual effort (toil) through automation and self-healing mechanisms
Required Technical Skills
Cloud Platform (Azure)
Strong hands-on experience with Microsoft Azure
Azure Compute: Virtual Machines, AKS, App Services, Functions
Azure Storage: Blob Storage, Azure Files, Managed Disks
Azure Networking: VNets, Subnets, NSGs, Load Balancers, Application Gateway
Containers & Kubernetes
Azure Kubernetes Service (AKS)
Kubernetes core concepts (deployments, services, ingress, RBAC)
Kubernetes Batch Jobs and CronJobs
AKS scaling, upgrades, node pools, and production troubleshooting
Automation & Infrastructure
Terraform (Azure provider, modules, state management)
Ansible for automation and configuration management
Azure Automation Runbooks & Jobs
Azure Logic Apps for workflow orchestration
CI/CD & DevOps
Azure DevOps (pipelines, repos, releases)
CI/CD for infrastructure, AKS workloads, batch jobs, and cron jobs
Integration with Azure CLI, Terraform, and automation scripts
Observability
Dynatrace (APM, infrastructure monitoring, dashboards, ing)
Azure Monitor / App Insights
Scripting & Development
Azure CLI
PowerShell
Python
- NET application troubleshooting and operational support
Soft Skills
Strong analytical and problemsolving abilities
Ownership mindset with focus on reliability and quality
Ability to lead production incidents calmly and effectively
Willingness to participate in an oncall rotation
Excellent communication and crossteam collaboration skills
Willingness to mentor junior engineers and set best practices
Skills
azure devops,azure kubernetes service (aks),dynatrace,terraform,azure monitor,powershell,azure cli,azure logic apps