Job Overview:
We are looking for a skilled Site Reliability Engineer (SRE) to join our team. The SRE will play a critical role in maintaining the reliability, performance, and scalability of our services. This role involves working with cloud platforms such as AWS, Azure, and Oracle, managing Ubuntu-based systems, and ensuring seamless operation of our infrastructure. The ideal candidate will have a strong background in system administration, cloud technologies, and modern DevOps practices.
Key Responsibilities:
● Infrastructure Management:
○ Design, implement, and manage scalable, resilient, and secure infrastructure on cloud providers such as AWS, Azure, and Oracle.
○ Oversee the administration of Ubuntu servers, ensuring optimal performance and uptime.
● Automation and Monitoring:
○ Implement monitoring and alerting systems to proactively identify and resolve issues before they impact users.
○ Automate repetitive tasks to improve system reliability and operational efficiency.
● Containerization and Orchestration:
○ Deploy and manage containerized applications using Docker.
○ Utilize Kubernetes for container orchestration, ensuring efficient and reliable application deployment and scaling.
● Performance Optimization:
○ Analyze system performance metrics and optimize infrastructure to meet performance targets.
○ Troubleshoot and resolve issues related to server performance, network latency, and other system bottlenecks.
● Collaboration and Support:
○ Work closely with development teams to ensure new applications and features are designed with reliability and scalability in mind.
○ Provide guidance and mentorship to junior engineers on best practices for system reliability and cloud management.
○ Participate in on-call rotations to provide 24/7 support for critical issues.
● Security and Compliance:
○ Implement security best practices across all infrastructure components, including firewalls, VPNs, and access controls.
○ Ensure compliance with industry standards and internal policies for data protection and privacy.
Technical Skills:
● Proven experience with cloud providers: AWS, Azure, and Oracle.
● Strong proficiency in managing and troubleshooting Ubuntu operating systems.
● Hands-on experience with Nginx, Kubernetes, and Docker.
● Familiarity with scripting languages (e.g., Bash, Python) for automation tasks.
● Experience with CI/CD pipelines and tools like Jenkins, GitLab CI, or equivalent.
● Knowledge of networking fundamentals and security best practices.
Professional Experience:
● 2+ years of experience in a Site Reliability Engineer or similar role.
● Excellent problem-solving skills and attention to detail.
● Strong communication skills, with the ability to collaborate effectively with cross-functional teams.
● Self-motivated with the ability to work independently and as part of a team.