Position Summary
Our client is building a modern, cloud-native platform that powers connected, data-driven manufacturing operations. Their technology sits at the center of increasingly automated factories, integrating equipment, software systems, and real-time production data into a scalable SaaS platform used by global manufacturers.
To support rapid growth and platform scale, they are seeking a Senior Cloud Operations Engineer to own the reliability, performance, and operational excellence of their cloud infrastructure. This is a highly impactful role responsible for ensuring the platform remains highly available, secure, and scalable as adoption continues to grow.
This position is ideal for engineers who thrive in modern cloud environments, enjoy solving complex reliability challenges, and prefer automating everything possible. The right person will combine deep technical expertise with strong operational discipline, helping build a world-class cloud platform supporting real industrial environments.
Key Responsibilities
Cloud Operations & Reliability
• Maintain and optimize production, staging, and development environments running in Kubernetes on AWS
• Implement and manage monitoring, logging, alerting, and observability frameworks
• Lead incident response efforts and drive post-incident reviews focused on continuous improvement
• Own backup, disaster recovery, and business continuity processes
• Perform system capacity planning and performance tuning
Automation & Infrastructure Management
• Build and maintain Infrastructure-as-Code using tools such as Terraform or Pulumi
• Automate provisioning, configuration management, and environment lifecycle processes
• Identify and eliminate operational inefficiencies through automation
• Manage secrets, environment configuration, and version control across infrastructure environments
Security & Compliance
• Implement and maintain least-privilege access models and cloud security guardrails
• Support vulnerability management, patching workflows, and dependency maintenance
• Assist with compliance readiness efforts including SOC 2, ISO 27001, or similar frameworks
• Ensure proper logging, retention, and audit practices across cloud environments
FinOps / Cost Optimization
• Monitor and optimize cloud spend across services and environments
• Implement tagging standards, budget alerts, and cost visibility frameworks
• Recommend architectural improvements to balance performance and cost efficiency
Collaboration & Leadership
• Partner closely with engineering teams to improve reliability, deployment pipelines, and system architecture
• Mentor engineers on operational best practices and cloud platform management
• Develop runbooks, documentation, and operational standards
• Champion reliability engineering principles, operational maturity, and risk reduction practices
Technical Environment
Candidates should be comfortable working in modern cloud-native environments and familiar with:
• Kubernetes clusters, autoscaling, Helm charts, and service mesh concepts
• AWS cloud services including compute, networking, storage, and cost management
• Infrastructure-as-Code frameworks such as Terraform
• Observability platforms such as Datadog, CloudWatch, Prometheus, or New Relic
• CI/CD tools such as GitHub Actions, Bitbucket Pipelines, or Bamboo
• Linux systems administration and troubleshooting
• SRE practices including SLIs, SLOs, MTTR, RTO/RPO, and incident management