Site Reliability Engineer (SRE) - Airline Sciences (DMA)
Designation: Developer - Cloud SRE & DevOps
Key Responsibilities:
● Design, implement, and maintain robust and scalable infrastructure on AWS to support our microservices-based applications and REST APIs.
● Work collaboratively with DevOps Engineer’s CI/CD pipelines for automated deployment, testing, and rollback of services.
● Monitor system performance, availability, and reliability using APM tools, with a preference for Kibana-based solutions, and establish effective alerting mechanisms.
● Proactively identify potential issues and bottlenecks through log analysis, performance metrics, and synthetic monitoring; implement preventative measures.
● Troubleshoot and resolve complex production incidents, performing root cause analysis (RCA) and implementing long-term solutions.
● Manage and optimize database performance, reliability, and scalability.
● Configure and maintain network infrastructure, including load balancers, firewalls, and proxies, ensuring secure and efficient traffic flow.
● Champion and implement infrastructure-as-code (IaC) practices.
● Work with Docker for containerization of applications, managing container orchestration and registries.
● Collaborate closely with development teams to define service level objectives (SLOs), service level indicators (SLIs), and error budgets.
● Develop and maintain comprehensive documentation for system architecture, configurations, and operational procedures.
● Drive automation initiatives to reduce manual effort and improve system resilience. ● Contribute to capacity planning and performance tuning efforts.
Required Qualifications:
● Bachelor's degree in Computer Science, Engineering, or a related technical field.
● 4-6 years of experience in Site Reliability Engineering, DevOps, or a similar role.
● Proven hands-on experience with Amazon Web Services (AWS), including services like EC2, S3, RDS, VPC, IAM (Identity and Access Management), Lambda, and EKS/ECS.
● Strong understanding and practical experience with microservices architecture and REST API principles.
● Proficiency in managing and troubleshooting relational and NoSQL databases (e.g., PostgreSQL, MySQL, MongoDB, Cassandra).
● Solid knowledge of networking concepts (TCP/IP, DNS, HTTP/S, VPNs) and experience with proxies (e.g., Nginx, HAProxy).
● Demonstrable experience with monitoring, logging, and alerting systems, with a strong preference for experience with the ELK Stack (Elasticsearch, Logstash, Kibana) for APM and observability.
● Hands-on experience with Docker containerization and orchestration (e.g., Kubernetes, Docker Swarm).
● Capable in at least one scripting language (e.g., Python, Bash, Go).
● Experience with CI/CD tools (e.g., Jenkins, GitLab CI, AWS CodePipeline).
● Strong analytical and problem-solving skills with a proactive approach to identifying and resolving issues.
● Excellent communication and collaboration skills.