Job Summary
We are looking for a highly skilled and motivated Site Reliability Engineer (SRE) to join our team. The ideal candidate will be responsible for ensuring the reliability, availability, and performance of our services. You will work closely with software engineering teams to build and maintain scalable and efficient systems.
Key Responsibilities
- Design, implement, and maintain scalable and reliable infrastructure.
- Monitor system performance and troubleshoot issues to ensure high availability and performance.
- Collaborate with development teams to ensure that applications are designed with reliability and scalability in mind.
- Automate repetitive tasks to improve efficiency and reduce manual intervention.
- Develop and maintain tools for monitoring, logging, and alerting.
- Participate in on-call rotations and respond to incidents promptly.
- Conduct post-incident reviews and implement improvements to prevent recurrence.
- Ensure security best practices are followed in all aspects of system design and operation.
Qualifications
- Bachelor's degree in computer science, Engineering, or a related field.
- Proven experience as a Site Reliability Engineer or similar role.
- Strong knowledge of cloud platforms (e.g., AWS, Azure, Google Cloud).
- Proficiency in scripting languages (e.g., Python, Bash).
- Experience with infrastructure-as-code tools (e.g., Terraform, Ansible).
- Familiarity with containerization and orchestration tools (e.g., Docker, Kubernetes).
- Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
- Experience with CI/CD pipelines and tools (e.g., Jenkins, GitLab CI).
- Excellent problem-solving skills and attention to detail.
- Strong communication and collaboration skills.
Preferred Qualifications
- Knowledge of database management systems (e.g., MySQL, PostgreSQL).
Understanding networking concepts and protocols