Jobs search

Cloud DevOps Engineer

Pinnacle Group, Inc. • Contract • San Francisco Bay Area, US • 23h ago

Cloud Engineer – Observability and SRE (Grade 10)

Bay Area CA- onsite role

Max pay rate: $65/hr w2 + benefits

7 month initial duration

Position Summary

The Grade 10 Cloud Engineer within the Customer’s Cloud Collaboration Technology Group will play a key role in building and operating scalable observability and infrastructure platforms supporting Webex microservices. This role requires strong hands-on expertise in Kubernetes, cloud infrastructure, and observability systems, along with the ability to operate independently and to own components end-to-end in production environments. Candidates will demonstrate extensive use of generative AI tools for code generation and production system troubleshooting.

Key Responsibilities

• Design, develop, and operate observability platforms – to perform logging, metrics, and/or tracing – for Webex microservices.

• Manage and optimize Kubernetes clusters across multi-region environments.

• Own CI/CD pipelines using Argo CD and Helm.

• Implement Infrastructure as code (IaC) using Terraform on AWS.

• Operate monitoring ecosystems, including but not limited to:

o OpenSearch/ELK,

o Prometheus,

o Grafana,

o Splunk, and

o Kafka.

• Build automation to detect and remediate production issues.

• Ensure security compliance through vulnerability patching.

• Collaborate cross-functionally to improve reliability.

• Participate in on-call rotations and incident response.

• Contribute to distributed system design and operations.

Required Skills

General Abilities

• Bachelor’s degree in computer science or related field

General Technical Skills

• At least eight (8) years of experience in a DevOps and/or SRE platform engineering role

• Incident response and on-call operations: Demonstrated experience in a 24/7 production environment, including but not limited to:

o Triaging alerts

o Leading incident response

o Writing post-incident reviews

o Maintaining SLA commitments across large-scale distributed systems

• IaC and automation: Proficiency with Terraform, Ansible, and/or equivalent IaC tooling for provisioning and managing cloud infrastructure at scale on AWS

• Scripting and development: Working proficiency in Python, Golang, and/or Bash for building automation scripts, operational tooling, and/or CI/CD pipeline integrations (e.g., Drone, GitHub Actions, Argo CD)

Specific Technical Skills

• Kubernetes and container orchestration: Production experience operating and troubleshooting workloads on Kubernetes at large scale (i.e., hundreds of deployments and thousands of pods), including but not limited to:

o Helm chart management

o Pod scheduling

o Resource tuning

o Multi-cluster operations

• Observability stack expertise: Hands-on experience – performing pipeline design, query optimization, and/or capacity planning for high-volume environments – in at least two (2) of the following:

o OpenSearch/Elasticsearch

o Prometheus/Mimir

o Grafana

o Loki

o Splunk

o Logstash

Desired Skills

• Apache Kafka/AWS MSK: Experience in at least one (1) of the following:

o Operating or tuning Kafka clusters at scale

o Managing the following across high-throughput streaming pipelines:

 Topic configurations,

 ACLs,

 Consumer lag, and/or

 Schema registries

• Splunk administration: Experience deploying, managing, and/or migrating Splunk Enterprise environments with Kubernetes-based log shipping architectures, including but not limited to:

o Forwarder management,

o Search optimization,

o Index lifecycle, and/or

o Integration

• OpenTelemetry and distributed tracing: Experience with deploying OpenTelemetry for data collection and application performance monitoring

• Security frameworks and container hardening: Familiarity with at least one (1) of the following (for vulnerability remediation at scale):

o Government or industry security certification standards; examples:

 FedRAMP

 STIG

 IL5

 ISO 27001

 SOC 2