Primary Skills required.
- Strong knowledge of Linux/Unix systems and command line tools.
- Proficiency in scripting languages such as Python, Shell, or Perl.
- Experience with configuration management tools like Ansible, Puppet, or Chef.
- Familiarity with cloud platforms like AWS, Azure, or Google Cloud.
- Understanding of networking principles and protocols (TCP/IP, HTTP, DNS, etc.).
- Knowledge of containerization technologies (Docker, Kubernetes) and orchestration tools.
- Expertise in monitoring and logging tools such as Prometheus, Grafana, ELK stack, or Splunk. (Optional - But Good to Know)
- Experience with Citrix technologies such as XenApp, XenDesktop, and NetScaler
- Support the administration and engineering of the Citrix environment.
- Work with Citrix Provisioning Server, SQL Database, and Citrix License Server.
- Experienced knowledge of virtualization technologies such as VMware or Hyper-V
- Strong problem-solving and troubleshooting skills, with the ability to analyze and resolve complex technical issues.
- Excellent communication and collaboration skills to work effectively with cross-functional teams.
- Strong attention to detail and ability to work in a fast-paced, dynamic environment.
- Terraform basic syntax and GitLab CI/CD configuration, pipelines, jobs
- Cloud resources provisioning and configuration through CLI/API
- Understanding of how to do basic queries in logs tools for general questions
- Operating system (Linux) configuration, package management, startup and troubleshooting
- Block and object storage configuration
- Networking VPCs, proxies and CDNs
Secondary skills required for the role.
- Bachelor's degree in computer science, engineering, or a related field.
- Proven experience as a Site Reliability Engineer or a similar role.
- Solid understanding of software development methodologies and DevOps principles.
- Experience with agile and iterative development processes.
- Certification in relevant technologies or frameworks is a plus (e.g., AWS Certified DevOps Engineer, Certified Kubernetes Administrator).
- Familiarity with continuous integration/continuous deployment (CI/CD) pipelines.
- Experience with source control systems such as Git or SVN.
- Knowledge of security best practices and experience implementing security measures in a production environment.
- Ability to work independently and handle multiple projects and priorities simultaneously.
- Strong analytical and problem-solving skills, with a focus on continuous improvement and automation.
Role & Responsibilities of the Profile
- Design and implement highly available and scalable systems, ensuring the reliability and performance of the company's website or application.
- Collaborate with cross-functional teams to define and establish service level objectives (SLOs) and service level agreements (SLAs) for critical systems.
- Monitor systems and applications, proactively identifying and resolving any performance bottlenecks or availability issues.
- Develop and maintain monitoring tools, alerts, and dashboards to provide visibility into system health and performance.
- Conduct post-incident analyses to identify root causes and implement preventive measures to avoid future incidents.
- Automate repetitive tasks and processes to improve efficiency and reduce manual intervention.
- Create and maintain documentation for system architecture, configuration, and troubleshooting procedures.
- Perform capacity planning and resource allocation to ensure optimal system performance and scalability.
- Collaborate with development teams to implement and deploy new features and enhancements, ensuring they meet reliability and performance standards.
- Stay up to date with industry best practices, new technologies, and emerging trends in site reliability engineering.
Objectives of this role
- Run the production environment by monitoring availability and taking a holistic view of system health
- Build software and systems to manage platform infrastructure and applications
- Improve reliability, quality, and time-to-market of our suite of software solutions
- Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating for continual improvement
- Provide primary operational support and engineering for multiple large-scale distributed software applications