Cloud Engineer – Observability and SRE (Grade 10)
Bay Area CA- onsite role
Max pay rate: $65/hr w2 + benefits
7 month initial duration
Position Summary
The Grade 10 Cloud Engineer within the Customer’s Cloud Collaboration Technology Group will play a key role in building and operating scalable observability and infrastructure platforms supporting Webex microservices. This role requires strong hands-on expertise in Kubernetes, cloud infrastructure, and observability systems, along with the ability to operate independently and to own components end-to-end in production environments. Candidates will demonstrate extensive use of generative AI tools for code generation and production system troubleshooting.
Key Responsibilities
• Design, develop, and operate observability platforms – to perform logging, metrics, and/or tracing – for Webex microservices.
• Manage and optimize Kubernetes clusters across multi-region environments.
• Own CI/CD pipelines using Argo CD and Helm.
• Implement Infrastructure as code (IaC) using Terraform on AWS.
• Operate monitoring ecosystems, including but not limited to:
o OpenSearch/ELK,
o Prometheus,
o Grafana,
o Splunk, and
o Kafka.
• Build automation to detect and remediate production issues.
• Ensure security compliance through vulnerability patching.
• Collaborate cross-functionally to improve reliability.
• Participate in on-call rotations and incident response.
• Contribute to distributed system design and operations.
Required Skills
General Abilities
• Bachelor’s degree in computer science or related field
General Technical Skills
• At least eight (8) years of experience in a DevOps and/or SRE platform engineering role
• Incident response and on-call operations: Demonstrated experience in a 24/7 production environment, including but not limited to:
o Triaging alerts
o Leading incident response
o Writing post-incident reviews
o Maintaining SLA commitments across large-scale distributed systems
• IaC and automation: Proficiency with Terraform, Ansible, and/or equivalent IaC tooling for provisioning and managing cloud infrastructure at scale on AWS
• Scripting and development: Working proficiency in Python, Golang, and/or Bash for building automation scripts, operational tooling, and/or CI/CD pipeline integrations (e.g., Drone, GitHub Actions, Argo CD)
Specific Technical Skills
• Kubernetes and container orchestration: Production experience operating and troubleshooting workloads on Kubernetes at large scale (i.e., hundreds of deployments and thousands of pods), including but not limited to:
o Helm chart management
o Pod scheduling
o Resource tuning
o Multi-cluster operations
• Observability stack expertise: Hands-on experience – performing pipeline design, query optimization, and/or capacity planning for high-volume environments – in at least two (2) of the following:
o OpenSearch/Elasticsearch
o Prometheus/Mimir
o Grafana
o Loki
o Splunk
o Logstash
Desired Skills
• Apache Kafka/AWS MSK: Experience in at least one (1) of the following:
o Operating or tuning Kafka clusters at scale
o Managing the following across high-throughput streaming pipelines:
Topic configurations,
ACLs,
Consumer lag, and/or
Schema registries
• Splunk administration: Experience deploying, managing, and/or migrating Splunk Enterprise environments with Kubernetes-based log shipping architectures, including but not limited to:
o Forwarder management,
o Search optimization,
o Index lifecycle, and/or
o Integration
• OpenTelemetry and distributed tracing: Experience with deploying OpenTelemetry for data collection and application performance monitoring
• Security frameworks and container hardening: Familiarity with at least one (1) of the following (for vulnerability remediation at scale):
o Government or industry security certification standards; examples:
FedRAMP
STIG
IL5
ISO 27001
SOC 2
o Container image hardening practices
o Security scanning tools (e.g., Anchore, Grype)
• AI-augmented operations: Experience using LLMs, AI coding assistants, and/or custom AI agents (e.g., MCP servers, Copilot, Claude) to:
o Accelerate engineering workflows,
o Automate runbooks, and/or
o Assist with incident triage
• Deployment pipelines (Argo CD/Helm bundles): Experience with at least one (1) of the following across multi-region clusters:
o GitOps-style deployment workflows
o Argo CD application management
o Helm bundle patterns
o Blue/green or canary release strategies
• Cost optimization and capacity planning: Experience in at least one (1) of the following in large-scale logging and/or metrics platforms:
o Right-sizing cloud resources
o Analyzing spending across AWS services
o Optimizing data retention policies (ISM/ILM)
o Reducing storage costs