We are seeking two highly skilled Cloud and AI Software Engineers to join our team. The selected candidates will play a key role in designing, developing, and maintaining robust, scalable, and distributed systems that support enterprise AI workloads. This role requires expertise in Kubernetes, distributed cloud architectures, and DevOps best practices, along with experience in building tools and frameworks to accelerate the machine learning model development lifecycle.
About the Role
This role requires expertise in Kubernetes, distributed cloud architectures, and DevOps best practices, along with experience in building tools and frameworks to accelerate the machine learning model development lifecycle.
Responsibilities
- Define architecture and build scalable, distributed systems to support enterprise AI workloads.
- Lead implementation of critical systems while ensuring reliability, performance, and security.
- Collaborate with cross-functional teams to maximize infrastructure efficiency and support heavy AI/ML tasks.
- Integrate services within the machine learning model development lifecycle.
- Implement CI/CD best practices for cloud and AI services.
- Analyze, monitor, and maintain production systems.
- Participate in incident response and take on-call responsibilities.
- Contribute to setting and improving development standards and best practices.
Qualifications
- Bachelor’s degree in Computer Science, AI/ML, or related field with 5+ years of relevant experience.
- Strong hands-on experience with Kubernetes in production and container technologies (Docker, etc.).
- Expertise in event-driven, distributed, and cloud-native architectures.
- Proficiency in Python, Golang, or similar programming languages.
- Knowledge of DevOps practices including automation, CI/CD, and monitoring.
- Strong understanding of scalability, reliability, and security best practices.
- Excellent communication skills with the ability to work collaboratively with stakeholders.
Required Skills
- Experience with databases and blob storage systems.
- Familiarity with Kubeflow, Ray, Kueue, or Flyte for ML orchestration.
- Hands-on experience with production system management, monitoring, and analysis.
- Understanding of the machine learning development lifecycle.
- Working knowledge of cloud infrastructure (AWS, GCP, Azure).