DevOps Engineer

Jampp • Full-time • Gurugram, IN • 3d ago

Experience: 8-10 years of experience as an IT Infrastructure and Operations Manager

Required Qualification: Bachelor’s degree or equivalent qualification in Information Technology, Computer Science, or relevant subject

Location: Gurgaon/Bangalore

WHAT YOU’LL DO:

● Provide inputs for IT operations in DevOps, SecOps, FinOps, and MLOps and support strategy planning.

● Develop service-level agreement key performance indicators (KPIs) and dashboards for all operations, including ML models.

● Monitor service-level dashboards to ensure compliance with KPIs and SLAs, particularly those related to model performance (e.g., inference time, latency, accuracy, and error rates).

● Automate detection of SLA non-compliance and implement corrective actions to address issues.

● Automate tasks such as application and model deployment, configuration, and updates, utilizing CI/CD pipelines.

● Build and maintain monitoring and alerting platforms for various applications and ML models, ensuring high availability and real-time performance tracking.

● Apply DevOps methodologies for CI/CD, security, and monitoring across traditional and ML workloads.

● Engage in upcoming projects' infrastructure to ensure they are scalable, fault-tolerant, and optimized for both general applications and machine learning pipelines.

● Regularly review and optimize costs for multiple applications and ML workloads running on the cloud.

● Maintain high availability during traffic spikes, disasters, and malicious attacks, including ML model serving.

● Work closely with software developers and data scientists to ensure our systems, applications, and ML models are monitored and available 24x7.

WHAT WE LOOK FOR:

● Proven experience in deploying and maintaining multi-tiered infrastructure and applications, including ML pipelines.

● Basic knowledge of Relational Databases and NoSQL databases.

● Experience with real-time failover architecture, principles, and processes.

● Excellent communication and technical writing skills; ability to convey complex technical designs through diagrams, documents, and presentations.

● Ability to work with cross-functional teams to develop automation solutions for services, including ML models.

● Be an advocate for a DevOps mindset and culture, promoting collaboration, flexibility, cross-domain knowledge, and knowledge sharing. A self-driven individual who understands requirements and develops and coaches others as needed.

● Background in infrastructure and configuration as code platforms (Terraform, Ansible, etc.).

● Hands-on experience managing and scaling CI/CD platforms (Jenkins, GitHub, etc.).

● Availability to participate in on-call rotations to respond to and handle after-hour issues.

● Hands-on experience with AWS services (EC2, RDS, S3, IAM, etc.).

● Knowledge of network protocols, monitoring systems (like New Relic, Datadog, Grafana, etc.), and database administration (MySQL, Aerospike, MongoDB, etc.).

● Hands-on experience with production-grade container orchestrators (Kubernetes, Nomad).

● Knowledge of Hadoop, AWS Kinesis, HA Proxy, and Aerospike is a plus.

● Experience working with log analysis and monitoring tools in a distributed application scenario.

● Experience in troubleshooting system issues causing downtime or performance degradation.

● Experience working in an Agile environment.

● Extensive knowledge of load balancing and auto-scaling in cloud environments.

ADDITIONAL RESPONSIBILITIES IN MLOps:

● Develop and define KPIs specific to MLOps, such as inference times, model accuracy, latency, training times, and error rates, to ensure the effectiveness and reliability of ML models in production.

● Implement real-time monitoring and alerting for ML models to detect deviations in performance and automatically trigger workflows to mitigate SLA breaches.

● Coordinate asynchronously with a globally distributed team working across different time zones to provide 24/7 support for all MLOps activities.

● Optimize cloud resources for ML workloads by utilizing cost-effective solutions such as spot instances for training and analysis, and regularly review and optimize costs associated with GPU/TPU usage.

● Enhance observability and traceability across ML pipelines, ensuring robust audit trails and quick identification of data or model-related issues.

● Develop and implement advanced security measures to protect sensitive data used in ML models and ensure compliance with data security and privacy regulations.

BONUS:

● Experience designing high-availability & cloud-native solutions.

● Python & Bash coding skills.

● Knowledge of Excel.

● Experience in low-latency architectures (Ad Tech RTB, Trading, etc.).

● Genuine interest in Open Source and/or personal projects.

● Solid understanding of security mechanisms for operating systems and cloud services.