Role Overview:
We are seeking a highly experienced and skilled
Infrastructure & Site Reliability Engineer to join our team and take full ownership of the infrastructure, site reliability, and the entire production system for our cutting-edge Agentic AI Platform. You will be responsible for designing, building, maintaining, and scaling our critical systems, ensuring their reliability, performance, security, and cost-efficiency. This role requires a deep understanding of system architecture, automation, and a proactive approach to preventing and resolving production issues.
Responsibilities:
- Own the design, implementation, and management of scalable, reliable, and secure cloud infrastructure across the entire production environment on platforms like AWS, Azure, or GCP.
- Be responsible for the overall site reliability and performance of the platform, implementing SLOs/SLAs and ensuring high availability.
- Develop and maintain robust CI/CD pipelines for automated building, testing, and deployment of our AI platform components.
- Implement and manage infrastructure as code (IaC) using tools like Terraform or CloudFormation.
- Design, set up, and maintain comprehensive monitoring, logging, alerting, and tracing systems to gain deep visibility into system health and performance.
- Proactively identify and resolve complex infrastructure and production issues, often before they impact users.
- Own and enforce security best practices across the infrastructure and application stack, ensuring data protection and access control.
- Ensure the platform adheres to relevant compliance standards and regulations.
- Manage and optimize cloud infrastructure costs, implementing strategies for cost efficiency and reporting.
- Collaborate closely with engineering teams to optimize application performance, scalability, and reliability throughout the development lifecycle.
- Support MLOps practices, including robust model deployment, versioning, and monitoring in production.
- Manage container orchestration platforms like Docker and Kubernetes.
- Automate repetitive tasks through scripting (e.g., Python, Bash).
- Participate in on-call rotations to support production systems and drive post-mortem analysis for continuous improvement.
- Stay current with emerging trends and technologies in cloud computing, SRE, MLOps, and AIOps.
Qualifications:
- Bachelor's or Master's degree in Computer Science, Engineering, or a related field, or equivalent practical experience.
- Minimum of 7+ years of professional experience in Infrastructure Engineering, Site Reliability Engineering (SRE), DevOps, or a related role with significant production system ownership.
- Extensive experience designing, building, and managing infrastructure on at least one major cloud provider (AWS, Azure, or GCP).
- Proven experience with infrastructure as code tools (Terraform, CloudFormation, etc.).
- Strong experience designing and implementing robust CI/CD pipelines (Jenkins, GitLab CI, GitHub Actions, CircleCI, etc.).
- Deep experience with containerization and orchestration (Docker, Kubernetes).
- Proficiency in scripting languages (Python, Bash).
- Extensive experience with monitoring, logging, alerting, and tracing tools (Prometheus, Grafana, ELK stack, Datadog, New Relic, etc.).
- Solid understanding of networking concepts, security principles, and database management.
- Experience supporting AI/ML workloads and understanding of MLOps concepts is a strong plus.
- Experience with AIOps platforms and practices for automating IT operations, incident response, and performance optimization.
- Excellent problem-solving, debugging, and analytical skills, particularly in complex distributed systems.
- Strong communication and collaboration abilities, with experience working across engineering teams.
Desired Skills:
- Experience with configuration management tools (Ansible, Chef, Puppet).
- Experience with serverless computing.
- Knowledge of advanced security practices for cloud environments.
- Experience with database administration and performance tuning.