DevOps Engineer

eSolutionsFirst • Full-time • Vienna, VA, US • 5m ago

Role Overview:

We are seeking a highly experienced and skilled Infrastructure & Site Reliability Engineer to join our team and take full ownership of the infrastructure, site reliability, and the entire production system for our cutting-edge Agentic AI Platform. You will be responsible for designing, building, maintaining, and scaling our critical systems, ensuring their reliability, performance, security, and cost-efficiency. This role requires a deep understanding of system architecture, automation, and a proactive approach to preventing and resolving production issues.

Responsibilities:

Own the design, implementation, and management of scalable, reliable, and secure cloud infrastructure across the entire production environment on platforms like AWS, Azure, or GCP.
Be responsible for the overall site reliability and performance of the platform, implementing SLOs/SLAs and ensuring high availability.
Develop and maintain robust CI/CD pipelines for automated building, testing, and deployment of our AI platform components.
Implement and manage infrastructure as code (IaC) using tools like Terraform or CloudFormation.
Design, set up, and maintain comprehensive monitoring, logging, alerting, and tracing systems to gain deep visibility into system health and performance.
Proactively identify and resolve complex infrastructure and production issues, often before they impact users.
Own and enforce security best practices across the infrastructure and application stack, ensuring data protection and access control.
Ensure the platform adheres to relevant compliance standards and regulations.
Manage and optimize cloud infrastructure costs, implementing strategies for cost efficiency and reporting.
Collaborate closely with engineering teams to optimize application performance, scalability, and reliability throughout the development lifecycle.
Support MLOps practices, including robust model deployment, versioning, and monitoring in production.
Manage container orchestration platforms like Docker and Kubernetes.
Automate repetitive tasks through scripting (e.g., Python, Bash).
Participate in on-call rotations to support production systems and drive post-mortem analysis for continuous improvement.
Stay current with emerging trends and technologies in cloud computing, SRE, MLOps, and AIOps.

Qualifications:

Bachelor's or Master's degree in Computer Science, Engineering, or a related field, or equivalent practical experience.
Minimum of 7+ years of professional experience in Infrastructure Engineering, Site Reliability Engineering (SRE), DevOps, or a related role with significant production system ownership.
Extensive experience designing, building, and managing infrastructure on at least one major cloud provider (AWS, Azure, or GCP).
Proven experience with infrastructure as code tools (Terraform, CloudFormation, etc.).
Strong experience designing and implementing robust CI/CD pipelines (Jenkins, GitLab CI, GitHub Actions, CircleCI, etc.).
Deep experience with containerization and orchestration (Docker, Kubernetes).
Proficiency in scripting languages (Python, Bash).
Extensive experience with monitoring, logging, alerting, and tracing tools (Prometheus, Grafana, ELK stack, Datadog, New Relic, etc.).
Solid understanding of networking concepts, security principles, and database management.
Experience supporting AI/ML workloads and understanding of MLOps concepts is a strong plus.
Experience with AIOps platforms and practices for automating IT operations, incident response, and performance optimization.
Excellent problem-solving, debugging, and analytical skills, particularly in complex distributed systems.
Strong communication and collaboration abilities, with experience working across engineering teams.

Desired Skills: