Overview Of Position
As a SRE/DevOps Engineer, you will play a key role in setting up the Infra, CI/CD and supporting different key project used throughout globally.
As a member of our geographically distributed development team your communication and analytical skills are essential to the role.
Key Responsibilities
· Design cloud infrastructure that is secure, scalable, and highly available on AWS/Azure
· Work collaboratively with software engineering to define infrastructure and deployment requirements
· Provision, configure and maintain AWS cloud infrastructure defined as code.
· Containerization using Docker and Kubernetes
· Troubleshoot problems across a wide array of services and functional areas
· Build and maintain operational tools for deployment, monitoring, and analysis of AWS infrastructure and systems
· Perform infrastructure cost analysis and optimization.
· Develop self-healing and automated remediation mechanisms using AI/ML techniques
· Integrate AI/LLM capabilities into DevOps workflows (e.g., log analysis, automated RCA, deployment insights)
· Enhance monitoring strategy by leveraging intelligent alerting, noise reduction, and pattern-based anomaly detection across logs, metrics, and traces.
· Build and manage MLOps pipelines for model training, deployment, and continuous improvement.
· Collaborate with a global team of engineers in a highly agile DevOps environment, focused on efficient operation of daily activities, developer productivity and continuous improvement of the framework.
· Responsible for the development, implementation, and maintenance of CI/CD frameworks, and tools development to support hybrid environment (Cloud, On premise) with a vison to achieve “CI/CD” objectives for large-scale integration of systems in order to reduce manual build and deploy efforts.
· Work with geographically dispersed teams including multi-vendor into Scrum teams to meet “CI/CD”
Required Knowledge & Skills
· At least 6-8 years of experience building and maintaining AWS infrastructure (VPC, EC2, Security Groups, IAM, ECS/EKS, CloudFront, S3, RDS, SQS, SNS, Lambda Function, Batch jobs, AWS Glue)
· Strong understanding of how to secure AWS environments and meet compliance requirements
· Hands-on experience deploying and managing infrastructure with Terraform Enterprise.
· Hands-on experience or working knowledge of LLMs (OpenAI, Azure OpenAI, Claude etc.)
· Understanding of LLMOps concepts (prompt management, model evaluation, versioning, fine-tuning lifecycle)
· Experience with MLOps tools such as MLflow, SageMaker, Kubeflow or equivalent.
· Familiarity with AIOps platforms/tools for intelligent monitoring and incident management.
· Ability to apply AI techniques to improve deployment speed, reliability, and monitoring effectiveness.
· Strong experience working on windows & Linux based environments.
· Experience with Docker, GitHub, Jenkins, Azure DevOps, ELK and deploying applications on AWS.
· Good command on scripting languages like Python, Bash/Shell, Powershell etc
· Knowledge in log analytics tools like Elastic search and Kibana.
· Knowledge of Cloud Migration/Disaster Recovery/Blue Green Deployment implementation.
· Good understanding about monitoring the services and alerting using Cloudwatch, Datadog, Prometheus or Azure monitor.
· Good to hire a candidate with certification
Personal Attributes
· Very good communication skills.
· Ability to easily fit into a distributed development team.
· Customer service oriented.
· Enthusiastic/High initiative.
· Ability to manage timelines of multiple initiatives.
· Very good attention to detail and the ability to always follow up.