Jobs search

DevOps Engineer

Tekgence Inc • Contract • San Jose, CA, US • 2d ago

Job Title: DevOps Engineer

Location: San Jose, California

Work Arrangement: On-site 5 days per week

Duration: Long-term engagement of 1 year or more with quarterly renewals

Job Summary

World Wide Technology is seeking an experienced DevOps Engineer to support cloud based collaboration platforms within the Cloud Collaboration Technology Group. This role focuses on operating and scaling observability platforms, Kubernetes environments, and automated deployment pipelines that support large scale distributed systems. The ideal candidate brings deep production experience, strong operational discipline, and a passion for reliability and automation.

Key Responsibilities

• Design, develop, and operate observability platforms including logging, metrics, and tracing for Webex microservices

• Manage, operate, and optimize Kubernetes clusters across multi region environments

• Own and maintain continuous integration and continuous delivery pipelines using Argo CD and Helm

• Implement and manage infrastructure as code using Terraform on Amazon Web Services

• Operate monitoring and logging ecosystems including OpenSearch or ELK, Prometheus, Grafana, Splunk, and Kafka

• Build automation to proactively detect and remediate production issues

• Ensure security compliance through vulnerability patching and platform hardening

• Collaborate with application, platform, and security teams to improve service reliability

• Participate in on call rotations and lead incident response activities

• Contribute to distributed system architecture, design reviews, and operational best practices

Required Qualifications

• Bachelor’s degree in Computer Science or a related technical field

• Eight or more years of experience in a DevOps, Site Reliability Engineering, or platform engineering role

• Kubernetes and container orchestration experience operating large scale production environments with hundreds of deployments and thousands of pods

• Hands on experience with Helm chart management, pod scheduling, resource tuning, and multi cluster operations

• Observability stack expertise with at least two of the following platforms: OpenSearch or Elasticsearch, Prometheus or Mimir, Grafana, Loki, Splunk, or Logstash

• Experience designing ingestion pipelines, optimizing queries, and planning capacity for high volume telemetry systems

• Strong proficiency with infrastructure as code and automation tools such as Terraform or Ansible on Amazon Web Services

• Working proficiency in Python, Golang, or Bash for automation, tooling, and pipeline integrations

• Demonstrated experience supporting twenty four by seven production environments including alert triage, incident leadership, post incident reviews, and service level accountability

Preferred Qualifications

• Experience operating or tuning Apache Kafka or Amazon MSK clusters at scale including topic configuration, access control, consumer lag, and schema management

• Splunk administration experience including deployment, forwarder management, index lifecycle, and Kubernetes based log ingestion

• Experience implementing Open Telemetry for distributed tracing and application performance monitoring

• Familiarity with security and compliance frameworks such as FedRAMP, STIG, IL5, ISO 27001, or SOC 2

• Experience with container image hardening and vulnerability scanning tools such as Anchore or Grype

• Exposure to AI augmented operations using large language models, coding assistants, or automation agents to improve operational workflows

• Experience with GitOps deployment models using Argo CD, Helm bundles, and progressive delivery strategies such as blue green or canary releases

• Demonstrated experience with cloud cost optimization, capacity planning, and data retention strategies for logging and metrics platforms

About the Team

This team builds cloud-based collaboration solutions with a strong focus on scalability, reliability, and operational excellence. The team partners closely with engineering, security, and operations groups to deliver highly available platforms that support global collaboration services.