01 · THE OPPORTUNITY
A founding infrastructure leadership role.
Anthrobyte builds enterprise AI systems that move organisations from pilot to production-grade adoption. As we scale, we need a platform engineer who can own the full infrastructure vision: model serving, MLOps pipelines, GPU cluster management, observability, and cloud architecture — all as one coherent, production-grade system.
This is less a traditional DevOps role and more a founding platform seat. You will work directly with AI engineers, product leadership, and enterprise clients to define how AI transformation is deployed, scaled, and made reliable inside complex organisations. You will be the person who makes AI products real.
HORIZON ONE · FOUNDING MANDATE
AI DevOps Engineer
Own the full platform layer — AI infrastructure, MLOps pipelines, model serving, observability, and cloud architecture — as one coherent, production-grade system.
GROWTH TRACK · MERIT-BASED
Lead/Architect/Head of Platform Engineering
Grow into engineering leadership — shaping platform strategy, building and mentoring a DevOps team, and defining how AI infrastructure scales with Anthrobyte's client portfolio.
Requirements
— 02 · RESPONSIBILITIES
Own the platform. Power the transformation.
Hyperscale AI Infrastructure
– Design and operate AI infrastructure across multi-cloud environments (AWS, GCP, Azure) supporting LLM inference, fine-tuning, and RAG pipelines at production scale
– Architect GPU cluster management and optimise inference throughput using vLLM, Triton Inference Server, TensorRT, or equivalent serving frameworks
– Own infrastructure-as-code (Terraform, Pulumi) — reproducible, version-controlled, disaster-recovery-ready environments across all deployments
– Drive multi-region, high-availability architecture decisions that reflect the reliability standards enterprise clients require
MLOps & Model Lifecycle
– Build and maintain MLOps platforms — model versioning, experiment tracking, automated retraining, and deployment pipelines using MLflow, Kubeflow, or equivalent
– Implement CI/CD pipelines (GitHub Actions, ArgoCD, Tekton) that support rapid model iteration without sacrificing production stability
– Define promotion workflows from development to staging to production — with rollback, canary, and blue-green strategies as standard practice
Kubernetes & Container Orchestration
– Lead container orchestration at scale — Kubernetes (EKS/GKE/AKS), Helm charts, service mesh configuration, and auto-scaling strategies for variable AI workloads
– Configure GPU node pools, resource quotas, taints and tolerations, and network policies for secure, efficient AI workload scheduling
– Own production incident response — from detection through resolution to post-mortem and systemic fix
Observability & FinOps
– Own observability end-to-end: latency, GPU utilisation, cost-per-inference, model drift detection, and SLO/SLA dashboards (Prometheus, Grafana, or equivalent)
– Lead FinOps strategy for GPU compute — spot instance management, reserved capacity planning, cost attribution across teams and client engagements
– Surface infrastructure cost and reliability data to leadership and clients in clear, actionable terms
Security, Governance & Compliance
– Enforce security and data governance standards across AI deployments — access controls, audit logging, secret management, and PII handling in inference pipelines
– Support enterprise client compliance requirements — including data residency, model access controls, and audit trail documentation
Cross-Functional Partnership
– Translate AI engineer and product requirements into platform specifications — and push back with alternatives when requirements are unrealistic or unsafe
– Partner with client engineering teams during enterprise AI deployments, acting as the technical infrastructure authority
✶ The Growth Pathway
Demonstrate consistent excellence as a platform engineering leader and the scope expands. You will grow into Lead/Architect/Head of Platform Engineering — building and mentoring a DevOps and MLOps team, shaping infrastructure strategy across Anthrobyte's full client portfolio, and defining what production-grade enterprise AI deployment looks like at scale. This is not a title — it is a level of ownership that must be earned and continually re-earned.
— 03 · WHO YOU ARE
The profile we are searching for.
You think across the full stack — from YAML to architecture, from GPU cost to enterprise reliability SLAs. You are the kind of engineer who has felt the weight of a production incident at 2am and built the systems that prevent the next one. You are as comfortable presenting infrastructure trade-offs to a CTO as you are deep in a Terraform module.
You bring:
– 4–6 years in DevOps or platform engineering, with at least 1–2 years specifically in AI or ML infrastructure
– Demonstrated hyperscale experience — infrastructure supporting millions of daily requests, petabyte-scale data, or multi-region distributed systems
– Deep Kubernetes expertise: GPU node pools, resource quotas, network policies, and production incident ownership
– Hands-on production experience with at least one major LLM serving stack (vLLM, Triton, TGI, Ray Serve, or BentoML)
– Strong Python and scripting capability — you write automation, not just configuration
– Proficiency with IaC (Terraform preferred) and GitOps workflows as standard practice
– Cloud practitioner depth in AWS, GCP, or Azure — particularly compute, networking, and storage for AI workloads
– Clear communicator — able to translate infrastructure complexity into language that resonates with engineers, product leads, and enterprise clients alike
Bonus signals:
GPU FinOps · Enterprise AI Governance · Open-Source Contributions
Edge / On-Premise LLM Serving · AI Consultancy Experience · Regulated Industry Deployments
Multi-Cloud Architecture · MLflow or Kubeflow Ownership · Startup 0→1 Environment
— 04 · WHAT WE OFFER
A rare kind of opportunity.
– Founding platform engineering ownership — greenfield infrastructure built your way, with your architectural decisions
– Direct access to engineering and product leadership from day one
– The mandate to build the platform function the way it should be built — AI-native, observable, and enterprise-reliable
– Active involvement in enterprise AI deployment engagements — real infrastructure challenges, real clients, real consequences
– Access to GPU compute resources, premium cloud credits, and AI tooling subscriptions
– Competitive compensation benchmarked to senior engineering market rates in Hyderabad
– A culture where great infrastructure work is visible and celebrated — not invisible