We are looking for a highly skilled DevOps Engineer with strong experience in DevSecOps and MLOps / LLMOps to design, automate, and secure our development and deployment pipelines.You will play a critical role in building scalable, secure, and production-ready infrastructure to support both traditional applications and machine learning / LLM workloads.This role demands a strong understanding of Kubernetes, CI/CD pipelines, infrastructure-as-code, model lifecycle management, and cloud-native security practices.
DevOps & Infrastructure
- Design, implement, and manage scalable, fault-tolerant infrastructure on cloud or hybrid environments (AWS / GCP / Azure / Hetzner / Bare metal).
- Develop and maintain CI/CD pipelines using tools like GitHub Actions, GitLab CI, Jenkins, or ArgoCD.
- Manage containerized workloads using Kubernetes, Helm, and Docker.
- Implement infrastructure as code (IaC) with Terraform / OpenTofu / Terragrunt.
- Monitor system performance, availability, and cost efficiency using Prometheus, Grafana, ELK, or Loki.
DevSecOps
- Integrate security automation into CI/CD pipelines (SAST, DAST, SCA, dependency scanning).
- Implement policy as code using OPA / Conftest and enforce RBAC / IAM best practices.
- Manage secrets and credentials using tools like Vault, Sealed Secrets, or External Secrets Operator.
- Set up vulnerability scanning and runtime protection (e.g., Trivy, Falco, Aqua Security).
- Define security baselines for infrastructure, network, and containers.
MLOps / LLMOps
- Collaborate with ML and data teams to operationalize model training, evaluation, and deployment.
- Build automated pipelines for data preprocessing, model training, and inference deployment using tools like Kubeflow, MLflow, or Airflow.
- Manage feature stores, model registries, and monitoring for drift, latency, and accuracy.
- Support LLM pipelines — prompt orchestration, fine-tuning, vector DB integrations, and retrieval-augmented generation (RAG).
- Optimize GPU-based workloads and manage distributed training / inference infrastructure.
Required Skills & Qualifications
- Languages: Python, Bash, Go (preferred)
- IaC Tools: Terraform / OpenTofu / Terragrunt
- CI/CD: GitHub Actions, GitLab CI, Jenkins, ArgoCD
- Containers: Docker, Kubernetes, Helm
- Monitoring: Prometheus, Grafana, Loki, ELK
- Security: Trivy, Falco, Vault, OPA, Snyk
- MLOps Tools: MLflow, Kubeflow, Airflow, Weights & Biases
- Cloud Platforms: AWS / GCP / Azure / Hetzner
- Databases: PostgreSQL, Redis, Vector DBs (Milvus, Pinecone, Weaviate, Qdrant)
Nice to Have
- Experience with GPU orchestration on Kubernetes (NVIDIA operator, KServe).
- Exposure to LLM frameworks (LangChain, LlamaIndex, vLLM, Ollama).
- Knowledge of data governance and compliance (GDPR, SOC2).
- Experience with self-hosted runners, GitOps, or multi-cluster management.
- Familiarity with event-driven systems (Kafka, NATS, or Redis Streams).
What We Offer
- Opportunity to work on challenging, large-scale systems with real-world impact.
- Collaborative team culture with focus on learning and innovation.
- Competitive compensation and growth opportunities.