About the Company
My client is building a high‑performance compute and cloud foundation enabling the next era of AI, scientific research, large‑scale simulation, and advanced analytics. Backed by a multibillion‑dollar investment in one of the world's most advanced GPU and HPC facilities-including a 400MW data center powering cutting‑edge AI and ML workloads-my client focuses on eliminating infrastructure bottlenecks so innovators can operate at unprecedented scale.
Their mission is to provide the compute capabilities and engineering excellence that help forward‑thinking organizations push past the limits of conventional cloud environments, accelerate breakthrough discoveries, and define the technologies of tomorrow.
Key Responsibilities
- Design, deploy, and maintain scalable, secure AWS cloud infrastructure with Terraform.
- Build and manage Kubernetes clusters supporting high‑performance compute and cloud‑native workloads.
- Collaborate with cross‑functional engineering teams to architect systems that support large‑scale AI, ML, and simulation environments.
- Implement automation across provisioning, CI/CD, and infrastructure lifecycle management.
- Monitor system performance and ensure high availability across distributed cloud environments.
- Troubleshoot and resolve infrastructure, network, and container‑orchestration issues.
- Stay current with modern AWS services, best practices, and emerging technologies in cloud and HPC.
Skills and Qualifications
- 5-8 years of professional experience in Linux system administration and cloud infrastructure engineering.
- Deep proficiency with AWS services, including building production‑grade cloud architectures.
- Strong Kubernetes experience (EKS or self‑managed clusters).
- Expertise with Infrastructure as Code-specifically Terraform.
- Experience with scripting languages such as Python or Bash.
- AWS certifications (Solutions Architect, DevOps Engineer, or equivalent).
- Bachelor's degree in Computer Science, Engineering, or related field, or equivalent experience