At least 10+ years of experience defining and implementing Monitoring solutions - alerts, Telemetry, and instrumentation for on-premises and cloud platforms for large enterprises
Responsibilities
Site Reliability Engineer will be playing a key role in building Observability and Resilience capabilities on cloud platform (Azure). Responsibilities of the SRE will be:
Build and configure alerts, tracing, telemetry, and instrumentation required for Infrastructure Monitoring and Application Performance Management.
Role entails implementing dashboards to monitor and share Observability at various levels (engineering teams, portfolio, senior management).
Support resilience engineering (application and infrastructure resilience) to meet availability requirements.
Work with development engineers, cloud engineers, product teams, and support engineers to gather requirements, implement, and evolve observability and resilience solutions.
Key Skillsets
Extensive knowledge on Observability and Application Performance Monitoring best practices, KPIs/metrics on Cloud platforms
Experience in monitoring tools - Dynatrace and Splunk
Experience with incident resolution (on-call support), application errors and performance troubleshooting using Dynatrace and Splunk to assist application team on root cause analysis
Experience working with SLO and Error budget, understanding of SLA/SLI/SLO
Expertise with Splunk Query Language
Experience building monitoring solutions for container-based workloads (Java / Spring boot desirable), databases, Kafka and Kubernetes
Experience in resilience engineering, and implementing high availability solutions
Experience creating Monitoring dashboards using Dynatrace and Splunk
Ability to work in a fast paced and agile environment
SRE Maturity Level 3 (Expectation)
DevOps Observability
DORA Metrics are visible .
Deployment frequency, Mean Time To Restore (MTTR), Cycle time, Change failure rate
IaC (Infrastructure as Code)
Platforms leverage IaC .
Test / Release automation
Unit tests
Test in a vacuum
Integration tests
Load test results validated against SLOs .
Test run as part of CI/CD pipeline .
Automated rollback
Business Continuity Plan for Recovering Service(s)
Capacity planning review
Show saturation of service as compared to load test and production peak load .
Product Management (Security)
Security scanning
Documented procedures for Vulnerability Management
Integrated into CI/CD pipeline (partner with security)