DevOps & MLOps Engineer

VeeRteq Solutions LLC • Full-time • California, United States, US • 1w ago

Canada (Remote) - Work in EST hours

Experience:

6 - 10 years of experience

We are seeking a highly skilled DevOps & MLOps Engineer with 5+ years of experience to architect, deploy, and optimize the infrastructure for our commercial Generative AI product. This role is ideal for professionals who excel at the intersection of DevOps, MLOps, and AI infrastructure, ensuring secure, scalable, and cost-efficient LLM deployments.

You will work with cutting-edge technologies such as LLMs, vector databases, Databricks, and GPU scaling, contributing to the fine-tuning and large-scale deployment of AI models.

Key Responsibilities 1. Cloud & Hybrid Infrastructure Management

Architect and maintain secure, scalable cloud infrastructure on AWS (preferred), GCP, or hybrid-cloud setups.
Deploy GPU-accelerated compute clusters on AWS for cost-efficient model training and inference.
Implement best practices for VPC networking, IAM security, encryption, and access controls.

2. MLOps & Model Deployment

Build and maintain end-to-end MLOps pipelines for LLM training, fine-tuning, and inference.
Optimize GPU utilization, autoscaling, and resource allocation for large-scale LLM workloads.
Integrate Databricks & MLflow for scalable model training and tracking.
Deploy models with TorchServe, Triton, vLLM, or Ray Serve for efficient inference.

3. CI/CD & Automation

Develop CI/CD pipelines for model versioning, API services, and infrastructure automation using Terraform and GitHub Actions.
Automate model deployment & rollback strategies for reliable AI system updates.

4. Observability, Performance Tuning & Cost Optimization

Implement monitoring & logging tools (Prometheus, Grafana, CloudWatch) for LLM performance tracking.

5. Vector Databases & Retrieval-Augmented Generation (RAG)

Deploy and optimize vector databases (Pinecone, FAISS, Weaviate, ChromaDB) for RAG-based LLMs.
Improve search and retrieval efficiency to enhance AI model responses.

6. Security & Compliance

Ensure secure AI model deployments with role-based access, encryption, and cloud security best practices.
Comply with GDPR, SOC 2, and enterprise AI security requirements.

Required Qualifications

5+ years of experience in DevOps, MLOps, or AI Infrastructure Engineering.
Strong expertise in AWS (preferred), GCP, or hybrid cloud deployments.
Hands-on experience with deploying and scaling LLMs in production.
Proficiency in Databricks, MLflow, and Spark-based ML workflows.
Strong knowledge of Kubernetes, Docker, Terraform, and CI/CD tools.
Experience with GPU scaling, model quantization, and inference acceleration.
Familiarity with LLM model serving (AWS SageMaker, BedRock).
Expertise in vector databases (Pinecone, FAISS, Weaviate, ChromaDB) for RAG workflows.
Solid understanding of network security, IAM, and encryption.

Nice-to-Have Skills

Experience with multi-cloud deployments & on-prem AI infrastructure.
Familiarity with fine-tuning LLMs using LoRA, DeepSpeed, or Hugging Face.
Exposure to AI cost optimization strategies (Spot Instances, Serverless AI, GPU scheduling).
Knowledge of LLM observability tools (WhyLabs, Arize AI, LangSmith).

Experience: