Business Summary:
The Deltek Global Cloud team focuses on the delivery of first-class services and solutions for our customers. We are an innovative and dynamic team that is passionate about transforming the Deltek cloud services that power our customers' project success. Our diverse, global team works cross-functionally to make an impact on the business. If you want to work in a transformational environment, where education and training are encouraged, consider Deltek as the next step in your career!
Position Responsibilities:
Deltek is looking for a Senior Software Engineer to join our Site Reliability Engineering team. In this role, you will be responsible for the reliability, scalability, and performance of our globally-used SaaS platforms. You will bridge the gap between software engineering and infrastructure operations, building the tools, automation, and systems that keep our products running for thousands of customers and millions of users.
This is a high-ownership role in a "never-stop-learning" environment. You will work closely with development teams to embed reliability practices early in the software lifecycle, respond to production incidents, and drive continuous improvements to our observability and operational posture.
Site Reliability & Platform Engineering
- Design, build, and maintain the infrastructure and tooling that underpins Deltek's SaaS platforms at scale.
- Drive reliability improvements across the full stack, spanning application-level resilience patterns through to infrastructure-level fault tolerance.
- Uphold and extend our infrastructure as code-first engineering culture, where all infrastructure changes are made through code and shipped to production via fully automated CI/CD pipelines.
- Develop internal tooling and automation to reduce toil and increase engineering self-service.
Observability & Performance
- Design and maintain comprehensive observability solutions including logging, metrics, tracing, and alerting across our AWS-based infrastructure.
- Proactively identify performance bottlenecks and reliability risks before they impact customers.
- Conduct capacity planning and load testing to ensure systems can scale to meet demand.
Incident Management & On-Call Support
- Participate in an on-call rotation, acting as a first responder for production incidents affecting our SaaS platforms.
- Own post-incident reviews, facilitate blameless post-mortems, identify root causes, and ensure action items are tracked and completed.
- Take pride in leaving systems better than you found them, consistently reducing the frequency and impact of incidents over time.
Collaboration & Engineering Culture
- Partner with software engineering teams to review system designs and architectures with a reliability lens.
- Mentor and provide technical guidance to junior engineers on SRE practices, tooling, and operational excellence.
- Contribute to a strong team culture, supportive, curious, and focused on doing great work while having fun.
Technology Stack:
- JavaScript / Node.js
- C# / .NET
- Python
- Docker & Kubernetes
- PostgreSQL
- Amazon Web Services (AWS) technologies
- Terraform
Qualifications:
Education
- Bachelor's degree in Computer Science or a related field, or equivalent experience.
Experience
- Minimum of 7 years of overall experience in software development, infrastructure engineering, or site reliability engineering.
- 3+ years of hands-on experience in an SRE, DevOps, or platform engineering role in a production SaaS environment.
- 3+ years applying an automation-first approach to problem-solving using configuration management tools and scripting.
- Strong experience with AWS; familiarity with services such as EC2, EKS, RDS, S3, CloudWatch, and IAM.
Technical Skills
- Infrastructure-as-Code mentality with tools like Terraform.
- Demonstrated experience in building high-quality products or services.
- Proficiency in at least one scripting/programming language (Python, Node.js, or similar) for automation and tooling development.
- Strong understanding of networking fundamentals: DNS, load balancing, TLS, firewalls, and VPCs.
- Experience with CI/CD pipelines and deployment automation.
- Solid understanding of relational databases (PostgreSQL preferred) including query performance and operational concerns.
- Hands-on experience with observability tooling (e.g., Prometheus/Grafana, CloudWatch, or similar).
Soft Skills
- Strong communication skills: ability to explain work, ask great questions, listen to peers and customers, influence without authority, and give and receive feedback.
- Passion for building software that solves real problems for real people.
- Commitment to writing well-designed, easy-to-test, and maintainable code.
- Calm under pressure, able to lead effectively during high-severity incidents.
- Blameless, growth-oriented mindset with a focus on continuous improvement.