Description
Proven experience in managing and deploying applications across Google Cloud Platform
(GCP)
• Hands-on experience with Infrastructure as Code tools such as Terraform and Terragrunt to
automate the provisioning, configuration, and management of cloud resources.
• Strong knowledge of containerisation technologies like Docker for building, packaging, and
distributing applications.
• Experience with Kubernetes container orchestration, including cluster configuration and scaling.
Proficiency in Helm charts for managing deployments and configurations is highly desirable.
• Proficiency in scripting languages such as Python or Bash to automate repetitive tasks,
streamline workflows, and develop custom scripts.
• Experience designing, implementing, and maintaining CI/CD pipelines using GitLab CI/CD or
similar tools.
• Hands-on experience with monitoring and observability tools such as Datadog, Prometheus, and
Grafana to ensure system reliability, performance, and availability.
Requirements
• Design, implement, and maintain Infrastructure as Code using Terraform and Terragrunt to
ensure consistent, scalable, and repeatable cloud environments across GCP, Azure, and AWS.
• Develop robust CI/CD pipelines using GitLab to automate application build, test, and deployment
processes, ensuring fast and reliable delivery of software updates.
• Continuously optimise infrastructure for cost efficiency, scalability, and high availability by
analysing usage patterns, identifying bottlenecks, and implementing improvements.
Monitoring
• Establish comprehensive monitoring, logging, and alerting solutions using Datadog to ensure
high system visibility, rapid issue resolution, and proactive performance management.
• Create and maintain intuitive dashboards and data visualisations using Datadog to provide
real-time insights into platform health, resource utilisation, and key performance metrics.
• Leverage Prometheus for collecting and querying time-series data, enabling detailed analysis of
system behavior and custom metric tracking for critical components.
Collaboration
• Partner with product owners, engineers, and cross-functional teams to align DevOps strategies
with business objectives, ensuring seamless integration of workflows.
• Provide guidance and mentorship to developers and teams on DevOps principles, cloud
architecture design, containerisation, and automation tools.
• Lead troubleshooting efforts and root cause analysis for infrastructure or application-related
incidents, implementing documentation and preventative measures to avoid recurrence.
Job responsibilities
Infrastructure
• Design, implement, and maintain Infrastructure as Code using Terraform and Terragrunt to
ensure consistent, scalable, and repeatable cloud environments across GCP, Azure, and AWS.
• Develop robust CI/CD pipelines using GitLab to automate application build, test, and deployment
processes, ensuring fast and reliable delivery of software updates.
• Continuously optimise infrastructure for cost efficiency, scalability, and high availability by
analysing usage patterns, identifying bottlenecks, and implementing improvements.
Monitoring
• Establish comprehensive monitoring, logging, and alerting solutions using Datadog to ensure
high system visibility, rapid issue resolution, and proactive performance management.
• Create and maintain intuitive dashboards and data visualisations using Datadog to provide
real-time insights into platform health, resource utilisation, and key performance metrics.
• Leverage Prometheus for collecting and querying time-series data, enabling detailed analysis of
system behavior and custom metric tracking for critical components.
Collaboration
• Partner with product owners, engineers, and cross-functional teams to align DevOps strategies
with business objectives, ensuring seamless integration of workflows.
• Provide guidance and mentorship to developers and teams on DevOps principles, cloud
architecture design, containerisation, and automation tools.
• Lead troubleshooting efforts and root cause analysis for infrastructure or application-related
incidents, implementing documentation and preventative measures to avoid recurrence.