Job Title: Senior DevOps Engineer / Platform Engineer
Location : Santa Clara
Duration: Long Term
Job Summary
We are seeking a highly capable Senior DevOps Engineer / Platform Engineer to build, operationalize, and scale the infrastructure and deployment foundation for a strategic site-builder / network automation platform. This role will focus on creating reliable CI/CD pipelines, production-grade Kubernetes deployment patterns, managed database services, observability, environment reproducibility, secrets management, and Infrastructure as Code across development, testing, staging, and production environments.
This engineer will play a critical role in moving the platform from an early-stage, partially manual operating model into a repeatable, supportable, and production-ready DevOps model. The environment includes Kubernetes-hosted services, AWS managed services, workflow orchestration with Temporal, integration with Nautobot, Argo-based promotion flows, and the supporting tooling required for debugging, snapshotting, local development, and production support.
This is a hands-on engineering role for someone who can design the right platform patterns, implement them directly, and establish a durable operating model between development and DevOps teams.
Key Responsibilities
Platform Deployment & CI/CD
• Design, implement, and maintain CI/CD pipelines for testing, staging, and production environments.
• Build and maintain deployment workflows that support safe and seamless promotion across environments.
• Improve and maintain Argo-based deployment workflows to enable controlled release progression from test to staging to production.
• Establish baseline deployment mechanisms for the site-builder application and related services.
• Standardize Kubernetes application packaging and deployment patterns, with a strong preference toward Helm-based lifecycle management for complex services and third-party components.
• Migrate existing deployments to Helm charts where appropriate.
Kubernetes & Runtime Platform Engineering
• Support the deployment and ongoing operation of services running in Kubernetes.
• Improve runtime reliability, resiliency, and troubleshooting for distributed services operating inside shared Kubernetes clusters.
• Investigate and harden service-to-service connectivity patterns, especially for workflow components such as workers connecting to the Temporal engine.
• Partner with development teams to define production-grade runtime requirements, resource sizing, restart policies, and platform support boundaries.
Infrastructure as Code & Cloud Services
• Design and implement fully declarative Infrastructure as Code for managed cloud services, especially in AWS.
• Provision and maintain managed data services such as RDS/PostgreSQL and MongoDB-compatible document databases across all environments.
• Eliminate manual infrastructure setup where possible and replace it with reproducible, version-controlled deployment patterns.
• Prepare the platform for future scale across multiple environments and regions through repeatable IaC and GitOps-aligned practices.
Data Services, Snapshots & Developer Enablement
• Setup and maintain RDS, MongoDB, Redis/cache services, and related dependencies for all environments.
• Build tooling and operational processes for:
◦ production and staging database snapshots,
◦ restoring snapshots into development environments,
◦ enabling local debugging and development from realistic data states.
• Support creation of local and development environments, including Minikube-based environment-as-code approaches that mirror production behavior as closely as practical.
• Improve platform reproducibility so engineers can quickly stand up close-to-production development environments.
Workflow Orchestration & Temporal Support
• Lead the setup, deployment, and operational support of Temporal for workflow orchestration.
• Support production operations for Temporal, including troubleshooting performance issues, restarts, scaling concerns, and resource shortages.
• Establish maintainable deployment patterns for Temporal using supported packaging and lifecycle management approaches.
• Partner with engineering teams to ensure workflow platform reliability and upgradeability over time.
Observability, Reliability & Incident Readiness
• Design and maintain observability across testing, staging, and production using tools such as Prometheus and Grafana.
• Define and implement monitoring for:
◦ service and cluster utilization,
◦ CPU, memory, storage,
◦ IOPS / throughput metrics,
◦ database connections and session counts,
◦ cache hit / miss / coverage metrics,
◦ RDS and MongoDB utilization,
◦ service health and alerting.
• Build and maintain logging, tracing, and correlation capabilities, separated appropriately by environment.
• Create tools to support deep debugging and operational inspection, including raw database reads, cleanup of unused volumes, and emergency cache invalidation.
Security, Access & Secrets Management
• Maintain secrets management processes across environments.
• Build tooling for short-lived internal token generation and long-lived secret rotation.
• Support secure access from deployed services to active production devices and southbound systems.
• Help establish credential management patterns for southbound integrations and device-facing access.
• Partner with related teams to define safe operational limits and controls for service integrations.
External Integrations & Platform Support
• Support integration patterns with Nautobot and help define safe client-side behaviors such as rate limiting, retry/backoff, and service protection mechanisms.
• Partner with application teams to understand and mitigate integration issues such as rate limiting or request rejection.
• Support staging and testing by enabling virtual device environments where needed.
• Contribute to end-to-end acceptance testing and production readiness activities.
Operating Model & Cross-Functional Execution
• Help define an effective operating model between Development and DevOps, whether via RACI, embedded Agile delivery, or a hybrid support model.
• Support deployment readiness, incident management, environment ownership boundaries, and lifecycle responsibilities.
• Work closely with software engineering, infrastructure, application owners, and partner teams to drive production readiness and sustainable operations.
Required Qualifications
• Bachelor’s degree in Computer Science, Engineering, Information Systems, or equivalent practical experience.
• 7+ years of experience in DevOps, Platform Engineering, SRE, or Infrastructure Engineering roles.
• Strong hands-on experience with Kubernetes in production environments.
• Strong experience building and maintaining CI/CD pipelines for multi-environment software delivery.
• Strong experience with ArgoCD, GitOps workflows, or equivalent deployment tooling.
• Strong experience with Helm and Kubernetes package/deployment lifecycle management.
• Experience with AWS managed services, especially RDS/PostgreSQL, document databases, and related infrastructure.
• Strong experience with Infrastructure as Code, such as Terraform and/or similar declarative tooling.
• Experience with Prometheus, Grafana, and modern observability practices.
• Experience with Redis/cache services, secrets management, and operational debugging.
• Strong Linux, networking, and distributed systems troubleshooting skills.
• Strong scripting and automation skills in one or more languages such as Python, Bash, or Go.
• Proven ability to work cross-functionally and operate effectively in environments where ownership boundaries are still evolving.
Preferred Qualifications
• Experience with Temporal deployment and production operations.
• Experience supporting developer platforms with local environment reproducibility using Minikube, kind, or similar tools.
• Experience with MongoDB / DocumentDB operations and restore workflows.
• Experience integrating with Nautobot, NetBox, or similar infrastructure source-of-truth platforms.
• Experience operating in shared-cluster environments with multi-team tenancy and constrained access models.
• Experience designing platform patterns for internal products that must scale across regions or multiple deployment footprints.
• Familiarity with network automation or infrastructure orchestration platforms is a plus.
What Success Looks Like
• CI/CD pipelines are reliable, repeatable, and support safe promotion across all environments.
• Kubernetes deployments are standardized, maintainable, and production ready.
• Managed infrastructure is defined as code rather than through manual setup.
• Temporal, databases, cache layers, and observability tooling are stable and supportable.
• Development teams can reproduce realistic environments locally for faster debugging and delivery.
• Secrets, access patterns, and operational tooling are mature enough to support production-scale operations.
• The DevOps operating model is clearly defined and enables faster deployments with less operational risk.
Scope Notes
In scope
• CI/CD and deployment foundations
• Kubernetes packaging and release management
• RDS, MongoDB, Redis/cache services
• Temporal platform setup and operational support
• Observability, alerting, and debugging tooling
• Secrets management and access enablement
• Infrastructure as Code and environment reproducibility
• DevOps / Development operational model definition
Candidate Profile
The ideal candidate is a builder-operator: someone who can establish engineering discipline where manual patterns currently exist, create durable automation for platform operations, and raise the overall maturity of the product’s deployment and runtime ecosystem. This person should be equally comfortable discussing deployment architecture, writing IaC and Helm code, troubleshooting Kubernetes runtime issues, and defining how DevOps and software engineering teams work together over the full product lifecycle.