Optomi, in partnership with our client, are seeking an IOC Systems Specialist to provide Tier 2 operational support for large-scale HPC cloud environments supporting AI workloads.
Candidates should have 2–5 years of experience supporting HPC clusters, strong Kubernetes and Slurm knowledge, experience with incident response/RCA, and familiarity with HPC monitoring and observability tools.
Required Skills:
• HPC cluster operations/support experience in a production NOC/IOC environment
• Kubernetes administration/troubleshooting experience
• Slurm workload manager experience
• Experience with Weka and VAST storage platforms
• Cloud platform knowledge (AWS, Azure, or GCP)
• Understanding of HPC networking and storage technologies
• Strong incident management and root cause analysis skills
This role supports a 24x7 operations environment and is ideal for candidates passionate about high-performance computing, AI infrastructure, and large-scale GPU environments.
USC/GCH only. Onsite in Fort Worth. Direct Hire.