Jobs search

AI DevOps Engineer — Mid/Senior Level

Stackular • Full-time • Hyderabad, IN • 2d ago

Job Title: AI DevOps Engineer — Mid/Senior Level

Experience: 4–7 Years

Location: Raidurg Main Road, Hyderabad.

Work Mode: On-site

Work Hours: 2-11 PM

Notice Period: Immediate Joiner (15-30 days)

About us: At Stackular, we are more than just a team – we are a product development community driven by a shared vision. Our values shape who we are, what we do, and how we interact with our peers and our customers. We're not just seeking any regular engineer; we want individuals who identify with our core values and are passionate about software development.

About the Role

We are looking for a Mid-Level AI DevOps Engineer with 4-7 years of experience in DevOps, cloud infrastructure, automation, and production deployment environments. This role will focus on building, maintaining, and improving scalable infrastructure and deployment pipelines for AI and machine learning applications.

The ideal candidate should have strong hands-on experience with cloud platforms, CI/CD, Docker, Kubernetes, infrastructure as code, monitoring, and automation, along with a working understanding of AI/ML deployment workflows.

Key Responsibilities

Cloud Infrastructure & DevOps

Design, deploy, and manage cloud-based infrastructure for AI and software applications.
Work with cloud platforms such as AWS, Azure, or GCP.
Build and maintain infrastructure using tools such as Terraform, CloudFormation, Ansible.
Support scalable, secure, and reliable environments for production workloads.
Optimize infrastructure for performance, cost, availability, and operational efficiency.

CI/CD & Automation

Build and maintain CI/CD pipelines for application and AI service deployments.
Automate build, testing, deployment, and rollback processes.
Improve deployment reliability and reduce manual operational tasks.
Work with tools such as Azure DevOps, GitHub Actions, Jenkins.
Create reusable scripts, templates, and automation workflows for engineering teams.

Containerization & Orchestration

Deploy and manage containerized applications using Docker.
Work with Kubernetes for application deployment, scaling, networking, and troubleshooting.
Manage Helm charts and Kubernetes manifests.
Support production deployments and ensure application availability.
Troubleshoot container, cluster, and infrastructure-related issues.

AI / MLOps Support

Support deployment and monitoring of AI/ML models in production environments.
Collaborate with data scientists, ML engineers, and backend engineers to streamline model deployment workflows.
Assist with model versioning, model serving, and release automation.
Work with MLOps tools such as MLflow, Kubeflow, SageMaker, Vertex AI, Azure ML, Airflow, or similar platforms.
Support infrastructure for AI services, APIs, and model inference workloads.

Monitoring, Logging & Reliability

Implement and maintain monitoring, logging, tracing, and alerting systems.
Use tools such as Prometheus, Grafana, ELK Stack, Datadog, New Relic, CloudWatch, or Azure Monitor.
Monitor application and infrastructure performance.
Participate in incident response, root cause analysis, and production support.
Help improve system reliability, uptime, and operational visibility.

Security & Compliance

Apply DevSecOps practices across infrastructure and deployment pipelines.
Manage access controls, IAM roles, secrets, and secure configuration.
Support vulnerability scanning, patching, and security hardening.
Ensure cloud and deployment environments follow security best practices.
Work with tools such as HashiCorp Vault, AWS Secrets Manager, Azure Key Vault, or GCP Secret Manager.

Required Qualifications

4-7 years of experience in DevOps, Cloud Engineering, Site Reliability Engineering, Platform Engineering, or Infrastructure Engineering.
Strong hands-on experience with at least one cloud platform: AWS, Azure, or GCP.
Experience building and managing CI/CD pipelines.
Strong experience with Docker and containerized deployments.
Working experience with Kubernetes in production or near-production environments.
Experience with infrastructure-as-code tools such as Terraform, Ansible, CloudFormation.
Strong scripting skills using Python, Bash, or PowerShell.
Experience with monitoring and logging tools such as Prometheus, Grafana, ELK, Datadog, New Relic, or CloudWatch.
Good understanding of networking, Linux systems, security, and cloud architecture.
Familiarity with AI/ML workflows, model deployment, or MLOps concepts.
Experience supporting production applications and troubleshooting infrastructure issues.

Preferred Qualifications

Experience supporting AI/ML applications or model deployment pipelines.
Exposure to LLM applications, vector databases, RAG pipelines, or generative AI infrastructure.
Experience with GPU-based workloads or AI inference infrastructure.
Familiarity with tools such as MLflow, Kubeflow, SageMaker, Vertex AI, Azure ML, Airflow, or Argo Workflows.
Experience with Helm, service mesh, or Kubernetes operators.
Knowledge of DevSecOps practices and cloud security controls.
Cloud, Kubernetes, or DevOps certifications are a plus.

Required Technical Skills

Cloud Platforms: AWS, Azure, GCP

Containers & Orchestration: Docker, Kubernetes, Helm

Infrastructure as Code: Terraform, Ansible, CloudFormation

CI/CD: GitHub Actions, Jenkins, Azure DevOps

Scripting: Python, Bash, PowerShell

Monitoring & Logging: Prometheus, Grafana, ELK Stack, Datadog, New Relic, CloudWatch

MLOps / AI Tools: MLflow, Kubeflow, SageMaker, Vertex AI, Azure ML, Airflow

Security: IAM, secrets management, vulnerability scanning, DevSecOps

Operating Systems: Linux, Unix-based systems