Job responsibilities:
- Design and implement solutions to enhance the reliability and scalability of platforms and applications to accommodate rapidly growing demands.
- Analyze defects, propose improvements, and drive efficiencies in systems and processes.
- Optimize the performance and utilization of AI ML platform and infrastructure.
- Develop observability, security, and finops tools and orchestration.
- Author and improve the quality of technical engineering documentation.
- Debug and solve issues in a production environment.
- Participate in on-call rotations and escalation workflows.
Required qualifications, capabilities, and skills:
- Formal training or certification on Site Reliability Engineering concepts and 3+ years applied experience
- Expertise in programming with Python and cutting-edge software engineering practices.
- Experience in designing and implementing large-scale distributed systems and cloud-native architecture.
- Experience with developing on Cloud, especially AWS, and knowledge in Infrastructure as Code tools such as Terraform.
- Systematic problem-solving and troubleshooting skills in a complex system.
- Excellent communication skills working with stakeholders and domain experts across the company to design solutions to user problems.
- Self-disciplined, self-managed, self-motivated with a strong sense of ownership, urgency, and drive.
Preferred qualifications, capabilities, and skills:
- Prior experience working in AI, ML, or Data engineering.