Key Responsibilities
- Design, implement, and maintain scalable and secure cloud infrastructure to support AI/ML model training and deployment.
- Automate the provisioning, deployment, monitoring, and management of infrastructure and services.
- Build and maintain CI/CD pipelines for both traditional software and machine learning models (MLOps).
- Implement infrastructure-as-code (IaC) using tools like Terraform, Ansible, or CloudFormation.
- Ensure system reliability and availability through monitoring, logging, alerting, and incident response.
- Manage containerization and orchestration using Docker and Kubernetes.
- Ensure security best practices in all aspects of the infrastructure (cloud, containers, pipelines).
- Optimize resource usage and cost efficiency in cloud environments.
Required Skills And Qualifications
- Bachelor’s degree in Computer Science, Engineering, or a related field.
- 3+ years of experience in DevOps, Cloud Engineering, or a similar role.
- Hands-on experience with AWS, GCP, or Azure (AI/ML services experience preferred).
- Proficiency in scripting languages like Python, Bash, and Shell.
- Advanced Linux administration and troubleshooting skills.
- Medium-level Shell scripting or Windows PowerShell scripting skills (automation, monitoring, and system tasks).
- Experience with CI/CD tools such as Jenkins, GitHub Actions.
- Strong knowledge of Docker, Kubernetes, and Helm.
- Experience with monitoring/logging tools (e.g., Prometheus, Grafana, ELK stack).
- Experience with setting up cloud alerting systems (e.g., SMS, Billing alerts).
- Understanding of networking, security best practices, and system architecture.
Skills: cloud,ci,cd,devops,kubernetes,infrastructure