Site Reliability Engineer (SRE)

Travtech • Full-time • India • 3w ago

We are looking for a Site Reliability Engineer to join our team and help us leverage data to drive business growth and innovation. You will be responsible to manage platform infrastructure and applications to improve reliability, quality, and time-to-market of our suite of software solutions

The right candidate must have excellent communication skills, be organized, and possess advanced problem-solving skills, success in technical engineering and working with other teams at our offices.

BE in Computer Science or related field. Proven experience over 3 years as a SRE.

Responsibilities

Run the production environment by monitoring availability and taking a holistic view of system health.
Proactive approach to identifying problems, performance bottlenecks, and areas for improvement.
Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating for continual improvement.
Provide primary operational support and engineering for multiple large- Gather and analyze metrics from operating systems as well as applications to assist in performance tuning and fault finding.
Partner with development teams to improve services through rigorous testing and release procedures.
Participate in system design consulting, platform management, and capacity planning.
Create sustainable systems and services through automation and uplifts.
Balance feature development speed and reliability with well-defined service-level objectives.
Hands on AWS cloud computing platform .
Incident Management and on call support.

Requirements

Proven experience (over 3 years) as an SRE
Hands on experience in to
AWS, Azure, or Google Cloud: Familiarity with cloud services like EC2, S3, Lambda, CloudWatch, IAM, VPC, etc.
Infrastructure as Code (IaC): Terraform, Ansible, Puppet, Chef
Monitoring and Logging: Prometheus & Grafana, ELK Stack (Elasticsearch, Logstash, Kibana), Datadog, New Relic
Containers and Orchestration: Docker, Kubernetes
CI/CD Tools: Jenkins, GitLab CI/CD
Version Control: Git / Experience with version control systems like Git, including GitHub or GitLab
Scripting and Automation: Bash, Python, Shell Scripting
Networking: TCP/IP, DNS, Load Balancing, CDNs, VPNs
Linux: Strong knowledge of Linux/Unix systems, including shell scripting and system administration
SQL/NoSQL Databases: Knowledge of database management systems like MySQL, PostgreSQL, MongoDB, or Redis
Incident Management: On-call Management, Experience in handling on-call rotations, incident management, and root cause analysis
SSL/TLS, Firewalls: Understanding of security best practices, including SSL/TLS, firewalls, and encryption
IAM, RBAC: Identity and Access Management, Role-Based Access Control
Problem-Solving: Ability to troubleshoot complex issues under pressure
Communication: Clear communication skills for collaborating with development and operations teams