Project Role : DevOps Engineer
Project Role Description : Responsible for building and setting up new development tools and infrastructure utilizing knowledge in continuous integration, delivery, and deployment (CI/CD), Cloud technologies, Container Orchestration and Security. Build and test end-to-end CI/CD pipelines, ensuring that systems are safe against security threats.
Must have skills : DevOps
Good to have skills : Site Reliability Engineering
Minimum 5 year(s) of experience is required
Educational Qualification : 15 years full time education
Summary:
SRE- Lead will be responsible for managing a team of engineers focused on software
deployments and site reliability engineering practices. The role will involve overseeing the
deployment process of software applications and services, implementing automation,
monitoring, and alerting tools, and ensuring the reliability, availability, and performance of
critical systems and services. The Deployments and SRE Manager will collaborate closely
with development, operations, and other stakeholders to drive a culture of DevOps and SRE,
aiming to improve system stability, scalability, and resilience.
Roles & Responsibilities:
-Leadership: Lead and mentor a team of engineers responsible for software deployments and SRE practices. Set clear expectations, provide coaching and feedback, and foster a collaborative and innovative team environment.
-Deployment Management: Implement and manage the deployment process for software applications and services, including Monthly release management of AADL products, change management, and rollback procedures. Drive continuous improvement in deployment processes and tools to increase efficiency and minimize risk.
-Site Reliability Engineering: Implement best practices in site reliability engineering, including system monitoring, alerting, capacity planning, performance optimization, and incident management. Collaborate with development teams to ensure application architectures are resilient and scalable, and drive the adoption of DevOps and SRE principles
and practices.
-Automation and Tooling: Evaluate, implement, and maintain relevant automation and tooling to streamline operational tasks, reduce manual effort, and improve system reliability. This may include configuration management, containerization, and orchestration technologies, well versed with Blue Green and Canary Deployment Model.
-Incident Management: Lead incident management efforts, including incident response, root cause analysis, and post-incident reviews. Collaborate with cross-functional teams to minimize impact and restore services as quickly as possible. Implement preventive measures to avoid future incidents and drive continuous improvement in incident
management processes.
-Monitoring and Alerting: Implement and maintain effective system monitoring and alerting
tools to proactively detect and resolve issues. Define and track key performance indicators (KPIs) and service level objectives (SLOs) to measure system reliability, performance, and availability.
-Collaboration: Collaborate closely with development, operations, security, network and other stakeholders to ensure smooth operations and timely resolution of issues. Foster strong relationships and effective communication channels to promote collaboration and coordination.
-Documentation: Maintain comprehensive documentation of deployment processes, system configurations, procedures, and incident reports. Ensure documentation is up-to-date, accurate, and accessible to relevant stakeholders.
Professional & Technical Skills:
- Must To Have Skills: Proficiency in DevOps.
- Good To Have Skills: Experience with Site Reliability Engineering.
- Strong understanding of continuous integration, delivery, and deployment (CI/CD) principles.
- Experience with cloud technologies and container orchestration.
- Knowledge of security best practices and implementing security measures.
- Familiarity with automation tools and scripting languages.
- Experience with monitoring and logging tools.
- Ability to troubleshoot and resolve issues in a timely manner.
• Bachelor's degree in Computer Science, Information Technology, or related field.
• Minimum of 7 years of experience in software engineering, DevOps, deployments, or site reliability engineering.
• Strong technical skills in deployment processes and tools, such as release
management, change management, and rollback procedures.
• Proficient in scripting and automation using tools like Python, Bash, or PowerShell.
• Solid understanding of DevOps principles, Agile methodologies, and ITIL practices.
• Strong technical skills in CI/CD tools and practices, such as Jenkins, Git, Docker, Kubernetes, and related technologies.
• Strong leadership skills with experience in managing and mentoring technical teams.
• Excellent problem-solving, analytical, and communication skills.
• Ability to work independently, prioritize tasks, and manage time effectively.
• Experience with incident management tools and processes, such as ITIL Incident Management, and familiarity with ITSM frameworks.
• In-depth knowledge of relational database management systems (RDBMS) such as Oracle, Microsoft SQL Server, MySQL, or PostgreSQL.
• Knowledge of cloud computing platforms, preferably AWS is a plus.
• Relevant certifications, such as AWS Certified DevOps Engineer, Kubernetes Certified Administrator, or Site Reliability Engineering (SRE) certifications, Grafana expertise are desirable.
Additional Information:
- The candidate should have a minimum of 5 years of experience in DevOps.
- This position is based at our Gurugram office.
- A 15 years full time education is required.