About the Company
We build the city where AI lives. ScitiX is building the digital foundation for AI to run reliably over time and scale into repeatable delivery. As models keep improving, real-world impact is still held back by fragmented infrastructure—training, fine-tuning, and inference spread across disconnected tools and cloud services, with compute, data, orchestration, billing, access control, and compliance out of sync.
ScitiX brings these pieces together with a cloud-native platform for unified management and intelligent scheduling of heterogeneous compute. We pool general-purpose compute, AI accelerators, and HPC across public, private, and hybrid environments to deliver efficient cross-platform scheduling and stable capacity. Centered on “Your data in. Your AI service out.”, we provide an end-to-end path that connects the full AI lifecycle within one system.
With ScitiX, teams move from experiment to production faster, run services more steadily, use resources more efficiently, and manage costs with clearer boundaries — AI doesn’t just run, it keeps running at scale, consistently.
About the Role
Responsible for kubernetes deployment, daily operation and maintenance, and troubleshooting of each training cluster.
Responsibilities
- Responsible for the design and development of monitoring and automation functions of the cluster management platform, and continuously improving the cluster management and control capabilities.
- Assisting in the analysis and troubleshooting of issues related to cluster containers, operating systems, networks, storage, etc.
- Managing the quota of each business in the cluster, analyzing utilization rates, and subsequent capacity planning.
- Participate in operation and maintenance duty, promptly handle faults, and respond to user issues and requirements.
Qualifications
- Bachelor or above degree in computer science or related majors.
- 3+ years of industrial experience, including solid Linux platform operation, maintenance, and debugging capabilities, with proficiency in troubleshooting, configuration optimization, and performance analysis.
- Proficient in programming in one of the following programming languages such as: Python, Go, Shell, etc.
- Familiar with the Kubernetes architecture, understand the functional characteristics of each component, and have rich practical experience in deployment and optimization of Kubernetes CNI, CSI, LB, etc.
- Experience in large-scale training cluster construction and optimization is preferred.
Preferred Skills
- Good communication and coordination skills.
- Demonstrated independent thinking capabilities and troubleshooting skills.