Job description:
- Develop, test, and maintain high-quality software solutions, frameworks and automations.
- Collaborate with cross-functional teams to analyse requirements and design solutions around stability and reliability.
- Participate in code reviews to ensure code quality and shared knowledge.
- Identify, troubleshoot, and resolve various incidents, problems Ensure DevOps/SRE best practices.
- Contribute to continuous improvement initiatives within the engineering team.
- Proficiency in one or more programming /scripting languages such as Python.
- Solid understanding of Agile development methodologies.
- Willingness to work with operations and incident, problem management.
- Good knowledge of at least one of the three big cloud service providers: Microsoft Azure or GCP.
- Experience in building CI/CD workflows using GitHub Actions.
- Experience in Observability setup (Application, Infra) using tools such as Splunk, Grafana, etc.
- Familiarity with version control systems such as Git.
- Good problem-solving skills and eagerness to learn.
- Excellent communication and teamwork skills
- Infrastructure Management: Design, build, and maintain scalable and reliable infrastructure. Optimize system performance and reliability by managing cloud or on-premises infrastructure.
- Incident Management: Lead incident response efforts to diagnose and resolve critical issues. Participate in the on-call rotation and develop runbooks for incident response.
- Automation and DevOps: Develop and implement automation tools and frameworks to reduce manual tasks and enhance system reliability. Advocate for DevOps best practices within the engineering team and implement CI/CD workflows.
- Performance Optimization: Analyze system performance metrics to identify bottlenecks and optimize system performance. Implement monitoring and alerting solutions to detect and resolve issues proactively.
- Security and Compliance: Ensure systems are secure and compliant with industry standards. Conduct security assessments and work with security teams to implement necessary controls.
- Continuous Improvement: Identify opportunities for process improvements and implement best practices for system reliability and performance. Collaborate with software engineers to enhance the reliability and availability of applications and services.
- Documentation and Knowledge Sharing: Create and maintain comprehensive documentation of systems, processes, and procedures. Share knowledge and mentor junior team members.
- Observability: Develop monitoring and alerting setup based on Service Levels (SLI/SLO) for Application and Infrastructure.
Required cloud certification: Azure900
Start: Immediate
Location: Bangalore, India
Form of employment: Full-time until further notice, we apply 6 months probationary employment.