About Neighborly:
Neighborly is a local network of home service brands that will connect you to very specific vetted local experts. Our family of service professionals work with rigorous quality standards to repair, maintain, and enhance your home. With pros living in your community, scheduling is quick and convenient.
Engineer II (DevOps Engineer)
Role Summary:
We are seeking an Engineer II (SRE) to join our Platform Engineering team. This role emphasizes reliability, observability, and automation while contributing to a shared internal platform that enables product teams to deploy and operate services safely and efficiently.
You will work at the intersection of cloud infrastructure, Kubernetes, CI/CD, DevSecOps, and observability—helping define and operate a reliable platform using SRE principles
such as SLOs, error budgets, and blameless incident response. This is an excellent role for someone early in their SRE/Platform career who wants to grow from “keeping systems running” into engineering for reliability at scale.
Key Responsibilities Reliability & SRE Practices:
- Operate and improve platform reliability using SRE concepts (SLIs,SLOs, error budgets)
- Support incident response, participate in on-call rotations, and contribute to blameless postmortems.
- Identify recurring reliability risks and help drive remediation through automation and design improvements.
- Track and improve service health indicators such as latency, availability, and error rates.
Observability &Production Visibility:
Configure and maintain Datadog for:
- APM (tracing and performance analysis).
- Centralized logging.
- Infrastructure and Kubernetes monitoring.
- Dashboards, alerts, and SLOs.
- Help define meaningful alerts focused on customer impact rather than raw infrastructure noise.
- Partner with application teams to improve instrumentation and production visibility.
Platform & Cloud Engineering:
- Support the operation of cloud workloads on AWS, following reliability and security best practices
- Assist in managing Kubernetes clusters, including deployment patterns, scaling behavior, and failure handling
- Contribute to platform capabilities that provide “golden paths” for application teams
Infrastructure as Code & Automation:
- Build and maintain cloud and platform infrastructure using Terraform
- Help automate environment provisioning, configuration drift detection, and operational tasks
- Contribute to reusable platform modules that enable consistent and reliable environments
CI/CD & Safe Delivery (DevSecOps):
- Support Azure DevOps pipelines for build, test, and deployment automation.
- Integrate DevSecOps controls into delivery workflows (security scanning, secrets management, policy checks).
- Help enable safer deployments through practices such as progressive delivery, rollback automation, and validation gates.
Collaboration & Platform Enablement:
- Work closely with application teams to improve service reliability and operational readiness.
- Contribute to platform documentation, runbooks, operational standards, and self-service guides.
- Participate in sprint planning, reliability reviews, and continuous improvement initiatives.
Required Skills & Experience:
- Experience with AWS (e.g.,VPC, EKS, EC2, IAM, S3)
- Hands-on experience with Kubernetes, including deployments and basic troubleshooting
- Practical experience using Terraform for infrastructure provisioning.
- Experience with Azure DevOps for CI/CD pipelines.
- Experience with Datadog for monitoring, logging, alerting, or APM.
- Understanding of DevSecOps concepts and secure software delivery.
- Familiarity with Linux systems and basic networking concepts.
- Basic scripting skills (Bash, PowerShell, or Python) Preferred Qualifications.
- Exposure to SRE practices(SLOs, SLIs, error budgets, incident management).
- Hands-on experience with Datadog SLOs, alerts, and dashboards.
- Familiarity with Kubernetes or cloud reliability patterns (autoscaling, health checks, graceful degradation).
- Experience working in multi-environment or multi-tenant platforms.
- Exposure to monitoring tools beyond Datadog(e.g., CloudWatch, Prometheus).
- Relevant certifications (AWS, Kubernetes, Terraform, Datadog) What Success Looks Like in This Role.
- Clear, actionable observability across the platform using Datadog.
- Improved detection and faster resolution of production incidents (lower MTTD/MTTR).
- Reliable, repeatable infrastructure and application deployments.
- Growing adoption of SRE practices across engineering teams.
- A platform that enables teams to ship quickly without sacrificing reliability.