We are looking for a Site Reliability Engineer to join our team and help us leverage data to drive business growth and innovation. You will be responsible to manage platform infrastructure and applications to improve reliability, quality, and time-to-market of our suite of software solutions
The right candidate must have excellent communication skills, be organized, and possess advanced problem-solving skills, success in technical engineering and working with other teams at our offices.
BE in Computer Science or related field. Proven experience over 3 years as a SRE.
Responsibilities
- Run the production environment by monitoring availability and taking a holistic view of system health.
- Proactive approach to identifying problems, performance bottlenecks, and areas for improvement.
- Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating for continual improvement.
- Provide primary operational support and engineering for multiple large- Gather and analyze metrics from operating systems as well as applications to assist in performance tuning and fault finding.
- Partner with development teams to improve services through rigorous testing and release procedures.
- Participate in system design consulting, platform management, and capacity planning.
- Create sustainable systems and services through automation and uplifts.
- Balance feature development speed and reliability with well-defined service-level objectives.
- Hands on AWS cloud computing platform .
- Incident Management and on call support.
Requirements
- Proven experience (over 3 years) as an SRE
- Hands on experience in to
- AWS, Azure, or Google Cloud: Familiarity with cloud services like EC2, S3, Lambda, CloudWatch, IAM, VPC, etc.
- Infrastructure as Code (IaC): Terraform, Ansible, Puppet, Chef
- Monitoring and Logging: Prometheus & Grafana, ELK Stack (Elasticsearch, Logstash, Kibana), Datadog, New Relic
- Containers and Orchestration: Docker, Kubernetes
- CI/CD Tools: Jenkins, GitLab CI/CD
- Version Control: Git / Experience with version control systems like Git, including GitHub or GitLab
- Scripting and Automation: Bash, Python, Shell Scripting
- Networking: TCP/IP, DNS, Load Balancing, CDNs, VPNs
- Linux: Strong knowledge of Linux/Unix systems, including shell scripting and system administration
- SQL/NoSQL Databases: Knowledge of database management systems like MySQL, PostgreSQL, MongoDB, or Redis
- Incident Management: On-call Management, Experience in handling on-call rotations, incident management, and root cause analysis
- SSL/TLS, Firewalls: Understanding of security best practices, including SSL/TLS, firewalls, and encryption
- IAM, RBAC: Identity and Access Management, Role-Based Access Control
- Problem-Solving: Ability to troubleshoot complex issues under pressure
- Communication: Clear communication skills for collaborating with development and operations teams