As a DevOps/MLOps Engineer, you will design, build, and maintain scalable and highly available cloud infrastructure for development, testing, and production environments, specifically tailored for AI projects. You will drive the automation, scaling, and optimization of AI workflows, ensuring robust, high-performing, and cost-effective solutions are delivered in production. Your collaboration with cross-functional teams will be key to maintaining a reliable and efficient AI ecosystem, from development through deployment and continuous improvement
Responsibilities:
- Develop scalable and highly available cloud infrastructure for AI projects.
- Automate CI/CD pipelines for applications, microservices, and model deployment using tools like Bitbucket and GitHub Actions to streamline development and operations.
- Leverage Kubernetes for orchestration and high availability of containerized applications and services.
- Proficient in using IaC tools such as Terraform, AWS CloudFormation, or Ansible to automate and provision cloud infrastructure.
- Basic understanding of managing machine learning workflows, including versioning, deployment, and retraining using tools like MLflow and Kubeflow Pipelines for efficient orchestration and tracking.
- Understanding of alerting and use advanced monitoring tools like Prometheus, Grafana to detect and resolve issues, ensuring system and model robustness in real-time.
- Deploy and manage scalable microservices architectures using Amazon ECS, aligning with business objectives to ensure high availability, reliability, and flexibility.
- Collaborate with development, operations, and machine learning teams to integrate workflows and optimize models for production environments, focusing on performance and seamless scalability.
- Identify and resolve bottlenecks in infrastructure and workflows to optimize performance and efficiency.
- Use tools like Jira for task tracking, sprint management, and ensuring transparency across DevOps and MLOps processes.
- Apply Agile principles to enable iterative delivery, continuous feedback, and faster deployment cycles.
Qualifications
- Experience: 1+ years in DevOps/MLOps or a related role.
- Skills: AWS, Azure, Kubernetes, Terraform, Airflow, Kubeflow, Prometheus, Grafana, ELK stack, Jenkins, Git, GitLab, BitBucket or similar tools.
- Education: Bachelor’s degree in Computer Science, Engineering, or a related field or equivalent experience.
Desirable skills:
- A continuous improvement mindset, with eagerness to research, learn, and adopt new tools and techniques in DevOps and MLOps.
- Strong commitment to automation and efficiency, with a focus on eliminating manual tasks where possible.
- Hands on with Python, Go, or Bash scripting for automation is desirable.
- Problem solving skill to identify and address bottlenecks before it impact performance.
- Clear and effective communication skills for working with both technical and non technical stakeholders.
- Flexibility and adaptability to meet the demands of a fast paced, collaborative environment.
- High attention to detail in implementing scalable and reliable workflows.
- Ability to manage multiple priorities and projects independently, delivering high quality results within deadlines.
- Strong accountability for maintaining system reliability and performance, with a focus on long term sustainability.
- Proven track record of troubleshooting and resolving infrastructure and system issues efficiently.
- Familiarity with security best practices in DevOps/MLOps, ensuring the integrity and safety of systems and data.
- knowledge of microservices architecture and deployment.
- Adebpt in using Git for efficient version control, including managing code repositories, handling branch and merge workflows, resolving conflicts, and maintaining a clean, organized version history.
- Familiarity with using Jira or similar project management tools to track progress, manage tasks, and collaborate with cross-functional teams efficiently.
- Support team success in a startup environment by contributing beyond assigned responsibilities to enhance organizational growth and efficiency.