Key Responsibilities:
- Execute the Application Development Modernisation initiative to promote and accelerate the adoption of modern application delivery practices such as:
- CI/CD
- DevSecOps
- Shift-left security testing
- Site Reliability Engineering (SRE)
- Observability
- Goal: Improve overall quality, security, and speed of application delivery in a heavily outsourced application development environment.
- Design and implement SRE practices, including:
- Establishing SLIs/SLOs
- Managing error budgets
- Building reliability frameworks
- Objective: Enhance system resilience and drive operational excellence.
- Develop comprehensive observability strategies by incorporating:
- Metrics
- Traces
- Logs
- Use modern tooling to improve system visibility and streamline troubleshooting.
- Establish and maintain observability best practices, including:
- Playbooks and templates
- Distributed tracing implementation
- Consistent metrics collection across applications
- Design monitoring solutions, automated alerts, and dashboards to provide real-time insights into application health and performance.
What You'll Bring to the Team:
- Degree or Diploma in Computer Science, Computer or Electronics Engineering, Information Technology, or related disciplines
- Minimum 1 year of experience with CI/CD
- Hands-on experience with enterprise observability platforms, preferably ELK Stack or Dynatrace
- Familiarity with monitoring tools such as Prometheus and Grafana
- Hands-on experience with distributed tracing systems (e.g., AWS X-Ray) and log aggregation tools
- Experience defining and implementing SLIs, SLOs, and error budgets
- Skilled in designing and implementing alerting strategies and dashboard creation
- Experience with Real User Monitoring (RUM) and synthetic monitoring
- Strong problem-solving and troubleshooting skills
- Result- and customer-oriented with strong multi-tasking capabilities
- Excellent written, verbal communication, presentation, and negotiation skills
- Experience conducting post-mortem analysis and implementing reliability improvements
Bonus Points For:
- Experience with modern tech stacks or platforms
- Experience with public cloud providers such as AWS, Azure, or Google Cloud
- Experience with Atlassian JIRA and Confluence
- Proficiency in scripting languages such as Python, Bash, or PowerShell
- Experience with containerized platforms such as Docker or Kubernetes
- Experience in infrastructure automation using Ansible and/or AWS Systems Manager