About eNGINE
eNGINE builds Technical Teams. We are a Solutions and Placement firm shaped by decades of interaction with Technical professionals. Our inspiration is continuous learning and engagement with the markets we serve, the talent we represent, and the teams we build. Our Consulting Workforce is encouraged to enjoy career fulfillment in the form of challenging projects, schedule flexibility, and paid training/certifications. Successful outcomes start and finish with eNGINE.
Role Overview
This role focuses on designing, implementing, and continuously improving end-to-end monitoring and observability capabilities for a large-scale, Azure-based application platform. The position emphasizes building scalable, secure, and reliable observability solutions that provide deep visibility across applications, infrastructure, and networks. The engineer will work closely with platform, application, operations, and security teams to ensure consistent telemetry, actionable insights, and effective incident response.
Core Responsibilities
- Architect and deliver comprehensive monitoring and observability solutions spanning application, infrastructure, network, and user experience layers in Azure-hosted environments.
- Define and implement telemetry standards for metrics, logs, and traces, ensuring consistent data collection and alignment with service reliability objectives.
- Deploy, configure, and maintain application performance monitoring tools to enable transaction tracing, dependency mapping, anomaly detection, and root-cause analysis.
- Integrate cloud-native monitoring services with third-party platforms to provide unified visibility, cross-domain correlation, and centralized analysis.
- Design and maintain dashboards, health indicators, synthetic tests, and alerting strategies that emphasize actionable signals and reduce alert fatigue.
- Establish performance baselines, service-level indicators, and service-level objectives; track trends and deviations to support proactive operations.
- Implement network and digital experience monitoring, including path visibility, endpoint testing, and internet/WAN performance analysis.
- Monitor infrastructure components such as servers, containers, cloud services, and network devices; surface availability, capacity, and performance insights.
- Embed observability practices into CI/CD pipelines and infrastructure-as-code workflows to ensure new services meet defined monitoring standards by default.
- Develop operational runbooks, escalation procedures, and response playbooks; support incident investigation and post-incident analysis with data-driven findings.
- Perform capacity planning and performance trend analysis; recommend optimization, right-sizing, and resilience improvements.
- Ensure monitoring solutions align with security, governance, and compliance requirements; maintain clear documentation and audit-ready evidence.
- Provide documentation, knowledge transfer, and guidance on monitoring tools, standards, and operational best practices.
Required Qualifications
- 5+ years of experience implementing monitoring and observability solutions in cloud or hybrid environments supporting critical applications.
- Hands-on expertise with enterprise observability tooling across multiple domains, including production deployment, advanced configuration, and operational support.
Application Performance Monitoring (APM)
- Experience with at least one APM platform (e.g., AppDynamics, Dynatrace, New Relic, or equivalent).
- Proven ability to instrument applications for transaction tracing, code-level diagnostics, service mapping, and anomaly detection.
- Strong experience designing APM dashboards and tuning alert thresholds and baselines.
Network and Digital Experience Monitoring
- Experience with network performance or digital experience monitoring tools (e.g., ThousandEyes, NetScout, Kentik, or equivalent).
- Proficiency with synthetic testing, path visualization, and network performance analysis across on-prem, cloud, and internet paths.
- Ability to configure agents, endpoint tests, and multi-hop path monitoring to correlate user experience with network conditions.
Infrastructure Monitoring and Event Management
- Experience with infrastructure monitoring platforms (e.g., SolarWinds, SCOM, Datadog, Prometheus/Grafana, or equivalent).
- Proven ability to monitor compute, storage, containers, network devices, and cloud services, including availability and capacity reporting.
- Experience implementing alert routing, event correlation, and de-duplication strategies.
Cloud and Observability Fundamentals
- Strong experience with Azure monitoring services, including Azure Monitor, Log Analytics (KQL), and Application Insights.
- Solid understanding of observability concepts, including distributed tracing, metrics, and centralized logging.
- Familiarity with OpenTelemetry concepts, data pipelines, and vendor-neutral instrumentation approaches.
Automation and Operations
- Scripting or automation experience using PowerShell, Python, or Bash to manage monitoring configuration, agent deployment, testing, and reporting.
- Strong understanding of networking fundamentals (DNS, BGP, HTTP, TLS, TCP/IP), CDN concepts, and performance troubleshooting.
- Experience supporting incident response and performance troubleshooting across application, infrastructure, and network domains.
- Clear written and verbal communication skills, with the ability to collaborate effectively across engineering and operations teams.
Preferred Qualifications
- Experience designing observability solutions in environments with heightened security, governance, or compliance requirements.
- Exposure to centralized log management, SIEM, or SOAR platforms and their integration with monitoring and APM tools.
- Experience integrating monitoring and alerting with IT service management platforms for incident and problem workflows.
- Familiarity with infrastructure-as-code tools and practices, and embedding observability controls directly into provisioning pipelines.
- Understanding of reliability engineering practices, including SLIs, SLOs, error budgets, and reliability reviews.
- Ability to develop custom instrumentation, telemetry, or automation using one or more of the following:
- Java: OpenTelemetry agents, custom instrumentation, transaction tagging, and synthetic testing utilities.
- .NET (C#): Instrumentation of services, auto-instrumentation configuration, custom exporters, and health probes.
- Python: Automation scripts, collectors/exporters, synthetic tests, and API-based monitoring integrations.