We are redefining how AI infrastructure is built and operated. Our mission is to challenge convention and deliver transformative products powered by state-of-the-art infrastructure—including NVIDIA GB200, MGX, and DGX Grace Hopper platforms—combined with cloud-native software. Our solutions support both centralized AI data centers and distributed AI Radio Access Network (AI-RAN) environments.
We're looking for experienced engineers who thrive on innovation and want to build scalable, production-grade AI infrastructure from the ground up.
Role Overview:
As a Data Center DevOps Engineer, you will be a core member of the infrastructure team, responsible for the reliability, automation, and operational excellence of GPU-based systems supporting AI workloads (training, fine-tuning, and inference). You will own deployment pipelines, operational playbooks, and automation frameworks with a strong focus on Kubernetes and GPU systems.
In this role, you'll partner closely with Staff Engineers, Product Management, Program Management, and Data Center Operations to drive execution from concept to commercialization while maximizing uptime and resource efficiency.
Key Responsibilities:
- Own pre-deployment operations, including rack staging, hardware health validation, monitoring, triage, and troubleshooting.
- Own post-deployment operations, ensuring ongoing system health through monitoring, incident response, and continuous automation improvements.
- Identify operational gaps and design automation to improve reliability, scalability, and efficiency.
- Serve as a bridge between Data Center Operations and Software Engineering teams to align infrastructure and software requirements.
- Contribute to product requirements (PRDs) and sprint planning from an operations and reliability perspective.
- Develop and maintain deployment pipelines and operational playbooks for large-scale AI infrastructure.
- Help attract, mentor, and grow engineering talent.
- Lead by example, fostering a culture of humility, ownership, and innovation.
Minimum Qualifications:
- Bachelor's degree in computer science, Electrical Engineering, or a related field.
- 5+ years of experience in data center operations, site reliability engineering (SRE), or DevOps.
- Strong experience with Linux system administration, networking, and hardware troubleshooting.
- Hands-on experience automating infrastructure using tools such as Ansible, Terraform, and Python.
Preferred Qualifications:
- Master's degree or relevant Cloud/DevOps certifications.
- Deep hands-on experience with Kubernetes and container orchestration on bare-metal environments.
- Experience with GPU platforms (NVIDIA DGX/HGX), high-performance computing (HPC) clusters, and Ethernet-based fabric management.
- Expertise in building scalable monitoring and alerting systems (Prometheus, Grafana, ELK stack).
- Experience implementing "Day 0, Day 1, and Day 2” automation for large-scale infrastructure deployments.