Senior Site Reliability Engineer (Azure Platform)

UST • Full-time • Bengaluru, IN • 2d ago

Role Description

We are seeking an experienced Senior Site Reliability Engineer (SRE) 7-13 years of overall IT experience with strong focus on Cloud, DevOps, and Site Reliability Engineering, to design, build, operate, and support highly available, scalable, and resilient systems on the Microsoft Azure platform. The ideal candidate will bring deep expertise in Azure cloud services, Azure Kubernetes Service (AKS), CI/CD automation, infrastructure as code, and observability, while applying SRE best practices to reduce operational toil and improve system reliability.

This role requires close collaboration with platform, application and security teams to ensure operational excellence across cloud-native and containerized workloads.

Key Responsibilities

Own reliability, performance, scalability, and availability of Azure-based systems

Apply SRE principles including SLIs, SLOs, error budgets, incident management, and postmortems

Develop and manage Infrastructure as Code (IaC) using Terraform

Build and maintain CI/CD pipelines using Azure DevOps for infrastructure and applications

Automate operational workflows using Azure Logic Apps, Azure Automation Runbooks, and Automation Jobs

Design, run, and monitor batch jobs and scheduled workloads, including:

Kubernetes Jobs

Kubernetes Cronjobs

Perform cloud operations, troubleshooting, and automation using Azure CLI

Implement monitoring, ing, and observability using Dynatrace

Automate configuration management and routine operational tasks using Ansible, PowerShell, and Python

Lead incident response, root cause analysis (RCA), and continuous reliability improvements

Collaborate with development teams supporting .NET and Python applications running on Azure and AKS

Continuously reduce manual effort (toil) through automation and self-healing mechanisms

Required Technical Skills

Cloud Platform (Azure)

Strong hands-on experience with Microsoft Azure

Azure Compute: Virtual Machines, AKS, App Services, Functions

Azure Storage: Blob Storage, Azure Files, Managed Disks

Azure Networking: VNets, Subnets, NSGs, Load Balancers, Application Gateway

Containers & Kubernetes

Azure Kubernetes Service (AKS)

Kubernetes core concepts (deployments, services, ingress, RBAC)

Kubernetes Batch Jobs and CronJobs

AKS scaling, upgrades, node pools, and production troubleshooting

Automation & Infrastructure

Terraform (Azure provider, modules, state management)

Ansible for automation and configuration management

Azure Automation Runbooks & Jobs

Azure Logic Apps for workflow orchestration

CI/CD & DevOps

Azure DevOps (pipelines, repos, releases)

CI/CD for infrastructure, AKS workloads, batch jobs, and cron jobs

Integration with Azure CLI, Terraform, and automation scripts

Observability

Dynatrace (APM, infrastructure monitoring, dashboards, ing)

Azure Monitor / App Insights

Scripting & Development

Azure CLI

PowerShell

Python

NET application troubleshooting and operational support

Soft Skills

Strong analytical and problemsolving abilities

Ownership mindset with focus on reliability and quality

Ability to lead production incidents calmly and effectively

Willingness to participate in an oncall rotation

Excellent communication and crossteam collaboration skills

Willingness to mentor junior engineers and set best practices

Skills

azure devops,azure kubernetes service (aks),dynatrace,terraform,azure monitor,powershell,azure cli,azure logic apps