Experience: 8-10 years of experience as an IT Infrastructure and Operations Manager
Required Qualification: Bachelor’s degree or equivalent qualification in Information Technology, Computer Science, or relevant subject
Location: Gurgaon/Bangalore
WHAT YOU’LL DO:
● Provide inputs for IT operations in DevOps, SecOps, FinOps, and MLOps and support strategy planning.
● Develop service-level agreement key performance indicators (KPIs) and dashboards for all operations, including ML models.
● Monitor service-level dashboards to ensure compliance with KPIs and SLAs, particularly those related to model performance (e.g., inference time, latency, accuracy, and error rates).
● Automate detection of SLA non-compliance and implement corrective actions to address issues.
● Automate tasks such as application and model deployment, configuration, and updates, utilizing CI/CD pipelines.
● Build and maintain monitoring and alerting platforms for various applications and ML models, ensuring high availability and real-time performance tracking.
● Apply DevOps methodologies for CI/CD, security, and monitoring across traditional and ML workloads.
● Engage in upcoming projects' infrastructure to ensure they are scalable, fault-tolerant, and optimized for both general applications and machine learning pipelines.
● Regularly review and optimize costs for multiple applications and ML workloads running on the cloud.
● Maintain high availability during traffic spikes, disasters, and malicious attacks, including ML model serving.
● Work closely with software developers and data scientists to ensure our systems, applications, and ML models are monitored and available 24x7.
WHAT WE LOOK FOR:
● Proven experience in deploying and maintaining multi-tiered infrastructure and applications, including ML pipelines.
● Basic knowledge of Relational Databases and NoSQL databases.
● Experience with real-time failover architecture, principles, and processes.
● Excellent communication and technical writing skills; ability to convey complex technical designs through diagrams, documents, and presentations.
● Ability to work with cross-functional teams to develop automation solutions for services, including ML models.
● Be an advocate for a DevOps mindset and culture, promoting collaboration, flexibility, cross-domain knowledge, and knowledge sharing. A self-driven individual who understands requirements and develops and coaches others as needed.
● Background in infrastructure and configuration as code platforms (Terraform, Ansible, etc.).
● Hands-on experience managing and scaling CI/CD platforms (Jenkins, GitHub, etc.).
● Availability to participate in on-call rotations to respond to and handle after-hour issues.
● Hands-on experience with AWS services (EC2, RDS, S3, IAM, etc.).
● Knowledge of network protocols, monitoring systems (like New Relic, Datadog, Grafana, etc.), and database administration (MySQL, Aerospike, MongoDB, etc.).
● Hands-on experience with production-grade container orchestrators (Kubernetes, Nomad).
● Knowledge of Hadoop, AWS Kinesis, HA Proxy, and Aerospike is a plus.
● Experience working with log analysis and monitoring tools in a distributed application scenario.
● Experience in troubleshooting system issues causing downtime or performance degradation.
● Experience working in an Agile environment.
● Extensive knowledge of load balancing and auto-scaling in cloud environments.
ADDITIONAL RESPONSIBILITIES IN MLOps:
● Develop and define KPIs specific to MLOps, such as inference times, model accuracy, latency, training times, and error rates, to ensure the effectiveness and reliability of ML models in production.
● Implement real-time monitoring and alerting for ML models to detect deviations in performance and automatically trigger workflows to mitigate SLA breaches.
● Coordinate asynchronously with a globally distributed team working across different time zones to provide 24/7 support for all MLOps activities.
● Optimize cloud resources for ML workloads by utilizing cost-effective solutions such as spot instances for training and analysis, and regularly review and optimize costs associated with GPU/TPU usage.
● Enhance observability and traceability across ML pipelines, ensuring robust audit trails and quick identification of data or model-related issues.
● Develop and implement advanced security measures to protect sensitive data used in ML models and ensure compliance with data security and privacy regulations.
BONUS:
● Experience designing high-availability & cloud-native solutions.
● Python & Bash coding skills.
● Knowledge of Excel.
● Experience in low-latency architectures (Ad Tech RTB, Trading, etc.).
● Genuine interest in Open Source and/or personal projects.
● Solid understanding of security mechanisms for operating systems and cloud services.