Canada (Remote) - Work in EST hours
Experience:
- 6 - 10 years of experience
We are seeking a highly skilled
DevOps & MLOps Engineer with
5+ years of experience to
architect, deploy, and optimize the infrastructure for our
commercial Generative AI product. This role is ideal for professionals who excel at the
intersection of DevOps, MLOps, and AI infrastructure, ensuring
secure, scalable, and cost-efficient LLM deployments.
You will work with
cutting-edge technologies such as
LLMs, vector databases, Databricks, and GPU scaling, contributing to the fine-tuning and large-scale deployment of AI models.
Key Responsibilities 1. Cloud & Hybrid Infrastructure Management
- Architect and maintain secure, scalable cloud infrastructure on AWS (preferred), GCP, or hybrid-cloud setups.
- Deploy GPU-accelerated compute clusters on AWS for cost-efficient model training and inference.
- Implement best practices for VPC networking, IAM security, encryption, and access controls.
2. MLOps & Model Deployment
- Build and maintain end-to-end MLOps pipelines for LLM training, fine-tuning, and inference.
- Optimize GPU utilization, autoscaling, and resource allocation for large-scale LLM workloads.
- Integrate Databricks & MLflow for scalable model training and tracking.
- Deploy models with TorchServe, Triton, vLLM, or Ray Serve for efficient inference.
3. CI/CD & Automation
- Develop CI/CD pipelines for model versioning, API services, and infrastructure automation using Terraform and GitHub Actions.
- Automate model deployment & rollback strategies for reliable AI system updates.
4. Observability, Performance Tuning & Cost Optimization
- Implement monitoring & logging tools (Prometheus, Grafana, CloudWatch) for LLM performance tracking.
5. Vector Databases & Retrieval-Augmented Generation (RAG)
- Deploy and optimize vector databases (Pinecone, FAISS, Weaviate, ChromaDB) for RAG-based LLMs.
- Improve search and retrieval efficiency to enhance AI model responses.
6. Security & Compliance
- Ensure secure AI model deployments with role-based access, encryption, and cloud security best practices.
- Comply with GDPR, SOC 2, and enterprise AI security requirements.
Required Qualifications
- 5+ years of experience in DevOps, MLOps, or AI Infrastructure Engineering.
- Strong expertise in AWS (preferred), GCP, or hybrid cloud deployments.
- Hands-on experience with deploying and scaling LLMs in production.
- Proficiency in Databricks, MLflow, and Spark-based ML workflows.
- Strong knowledge of Kubernetes, Docker, Terraform, and CI/CD tools.
- Experience with GPU scaling, model quantization, and inference acceleration.
- Familiarity with LLM model serving (AWS SageMaker, BedRock).
- Expertise in vector databases (Pinecone, FAISS, Weaviate, ChromaDB) for RAG workflows.
- Solid understanding of network security, IAM, and encryption.
Nice-to-Have Skills
- Experience with multi-cloud deployments & on-prem AI infrastructure.
- Familiarity with fine-tuning LLMs using LoRA, DeepSpeed, or Hugging Face.
- Exposure to AI cost optimization strategies (Spot Instances, Serverless AI, GPU scheduling).
- Knowledge of LLM observability tools (WhyLabs, Arize AI, LangSmith).
Experience:
- 6 - 10 years of experience
Location:
- Remote (EST hours) Only Canadian candidates will be considered.
Educational Qualifications:
- Engineering Degree BE/ME/BTech/MTech/BSc/MSc.
- Technical certifications in multiple technologies are desirable.