We are looking for a DevOps/SRE to build and operate the foundational systems that power our data, analytics, and AI platform. This is, at its core, an infrastructure and DevOps role: you will own the cloud infrastructure, deployment pipelines, orchestration, networking, and observability that everything else runs on.
You will work across the infrastructure layer beneath our data and ML/AI workflows cloud provisioning, container orchestration, CI/CD, and monitoring keeping our platform reliable, scalable, and secure.
If you are excited about AI and want to grow into building, hosting, and operating agentic AI systems, you will have ample opportunity to do so. That work is a welcome bonus rather than a prerequisite the heart of this role is building and maintaining the infrastructure around those platforms.
What You’ll Do
DevOps & Platform Engineering
• Deploy, configure, and maintain shared platform services as containerized workloads including end-to-end ownership of networking, access, and connectivity between services.
• Manage cloud infrastructure, including container registries, managed identities, Key Vault secrets, storage backends, and virtual network configurations.
• Build and maintain CI/CD pipelines, branch protection policies, and release management workflows across repositories.
• Continuously evaluate and adopt tools and technologies that improve platform reliability, developer experience, and team velocity.
AI & Agentic Platform (Growth Opportunity)
For those interested in growing into AI systems work, there is real room to do so over time — though none of the following is required to be successful in this role:
• Support the buildout and operationalization of agentic AI workflows, including agent hosting, lifecycle management, and integration with Model Context Protocol (MCP) servers.
• Help build shared tooling and infrastructure that enables data scientists to develop, test, and deploy agents with minimal friction.
• Contribute to evaluation frameworks and quality standards for AI agents, including automated benchmarking, regression testing, and production-readiness criteria.
• Extend observability and reliability practices into agent execution environments, including logging, tracing, and performance monitoring.
What We’re Looking For
Required
• 3+ years of experience in infrastructure, DevOps, platform engineering, or SRE roles with a clear track record of building and maintaining production systems.
• Solid understanding of containerization and cloud infrastructure — Docker, Kubernetes, and at least one major cloud provider.
• Hands-on experience deploying and operating containerized services in cloud environments, including configuring networking, load balancing, and service-to-service connectivity.
• Experience building and maintaining CI/CD pipelines, Git-based release management, and branch protection workflows.
• Experience with workflow orchestration tools (Prefect, Airflow, Dagster, or similar) in production environments.
• Familiarity with monitoring and observability tooling health metrics, alerting, logging, and tracing.
• Strong documentation habits and the ability to communicate technical architecture clearly to diverse stakeholders.
Preferred (Nice to Have)
• A genuine interest in AI and a desire to learn and grow into building, hosting, and operating AI agents and agentic systems.
• Familiar with agentic workflow frameworks (e.g., MCP, LangChain, or similar).