Description
This Site Reliability Engineer position is a technical role within the Technical and Cloud Services group that helps ensure the reliability, scalability, and performance of our infrastructure while driving automation and efficiency in our development process. This engineer provides technical guidance to team members and other development teams related to the Cloud Services initiatives and helps to ensure our Cloud initiatives are consistently improving. This role requires a proactive problem-solving mindset, excellent communication skills, and the ability to collaborate effectively with our cross-functional teams that help courts and public safety organizations of all sizes better protect and serve the public. By helping provide solutions that improve efficiency and response time, you can help serve our citizens and make communities safer.
NOTE: This is a hybrid position that requires the candidate to be in the Plano, TX office at least 2 days per week.
Shift Expectation
8 AM – 5 PM Central Time (with one-hour break)
Responsibilities
- Design, build, and maintain highly available, scalable, and secure infrastructure components to support our applications and services.
- Implement and maintain monitoring, alerting, and logging systems to proactively identify and resolve issues before they impact users.
- Collaborate with development teams to automate deployment pipelines, infrastructure provisioning, and configuration management using tools like Jenkins, Terraform, and Kubernetes.
- Lead incident response and post-mortem activities to identify root causes and implement preventive measures to minimize future incidents.
- Conduct regular performance tuning and optimization of system resources to ensure optimal efficiency and cost-effectiveness.
- Drive continuous improvement initiatives to streamline processes, enhance reliability, and increase productivity across the organization.
- Stay current with industry trends, best practices, and emerging technologies in SRE, DevOps, cloud computing, and automation.
Qualifications
- BS/BA degree in Computer Science, Computer Engineering, Information Systems, or similar field
- 5-7 years of experience in a Site Reliability and/or DevOps role, with a strong understanding of both disciplines.
- Proficiency in cloud computing platforms such as AWS, Azure, or GCP, including infrastructureas code (IaC) tools like Terraform.
- Hands-on experience with containerization and orchestration tools such as Docker and Kubernetes.
- Expertise in scripting and programming languages such as PowerShell, Python, Bash, or Go for automation and tooling.
- Strong understanding of database concepts and administration, including T-SQL scripting, indexing, and performance tuning.
- Solid understanding of networking concepts, security best practices, and system administration in Windows and Linux environments.
- Strong analytical and problem-solving skill, with the ability to troubleshoot complex issues and drive resolution in a timely manner.
- Excellent communication and interpersonal skills, with the ability to collaborate effectively with cross-functional teams and stakeholders.
- Creative problem solving with demonstrated pattern of applying creative solutions to bridge the gap between product development and environmental considerations.
Required
- AWS
- MS SQL Server and/or PostgreSQL
- Windows Server OS
- Linux OS
- PowerShell
- Python
- IIS
Preferred
- Knowledge of .NET Framework and Languages (C#, VB.NET)
- Experience with monitoring tools, such as DataDog, AWS CloudWatch, and SolarWinds Database Performance Analyzer
- Experience with PagerDuty
- Knowledge of Agile Development, with experience with tools such as JIRA
- Knowledge of Web Development Practices and Technologies
- Experience with ticketing systems, such as Microsoft CRM
- Octopus Deploy
- Hangfire
- Elasticsearch
- RabbitMQ
- Apache
- Advanced knowledge of Microsoft Hosting Technologies
- Advanced knowledge of Dev-Ops practices