Job Role: DevOps Engineer – AI Infrastructure & Platforms
Location: Remote, USA (Ideally located in Palo Alto / Bay Area)
Job Type: W2 Contract Position
Job Type: Contract, 1 Year (Possible to convert to PERM within the first year)
Description:
We are seeking a Senior DevOps Engineer to join the Salesforce AI Research Incubation Team. In this role, you will be responsible for designing, implementing, and maintaining cloud infrastructure and CI/CD pipelines to support AI research and development. You will ensure the reliability, scalability, and security of our AI-driven applications through automation, containerization, and infrastructure as code (IaC).
The ideal candidate has extensive experience with AWS, GCP (DNS, VM, Kubernetes, networking, firewall), as well as strong expertise in CI/CD, Docker, Kubernetes, Helm, Terraform, Python, and shell script.
Key Responsibilities
- Design, implement, and manage cloud infrastructure (AWS, GCP) including networking, security, and compute resources.
- Develop and maintain CI/CD pipelines to automate deployment and testing of AI models and applications.
- Build, manage, and optimize Kubernetes clusters for deploying AI services and research applications.
- Implement infrastructure as code (IaC) using Terraform and Helm to ensure repeatable and scalable deployments.
- Automate system operations and monitoring using Python and shell scripting.
- Ensure security best practices across cloud environments, including firewall and access control management.
- Troubleshoot infrastructure issues and optimize system performance.
- Collaborate with AI researchers and software engineers to streamline model deployment and integration.
- Task about managing databases (SQL and No-SQL), including database provisioning, performance tuning, and backup strategies.
- Ensure database security, replication, and high availability across cloud environments.
Required Qualifications
- Bachelor’s degree in Computer Science, Software Engineering, or a related field.
- Experience with AI/ML model deployment and pipeline automation.
- 3+ years of experience in DevOps, cloud infrastructure, or site reliability engineering.
- Strong experience with AWS and GCP, including DNS, VM management, networking, Kubernetes, and firewall security.
- Proficiency in CI/CD pipeline development and automation (GitHub Actions, Jenkins, GitLab CI/CD, etc.).
- Expertise in Docker, Kubernetes, and Helm for container orchestration and deployment.
- Hands-on experience with Terraform for infrastructure provisioning and management.
- Strong scripting skills in Python and shell scripting for automation.
- Solid understanding of networking, security best practices, and cloud monitoring tools.
- Excellent troubleshooting and problem-solving skills.
Preferred Qualifications
- Knowledge of logging and monitoring tools (Prometheus, Grafana, ELK stack, etc.).
- Familiarity with serverless computing and cloud-native application design.
- Contributions to open-source DevOps tools or frameworks.
- Experience with Salesforce Falcon is a plus.
**Similar industries to target beyond AI/ML:**
Anyone who's operated high-throughput data pipelines at scale will translate well:
• High-frequency trading / fintech — real-time data pipelines, low-latency infrastructure, rigorous observability
• Ad tech — real-time bidding, event processing, attribution pipelines at massive scale
• Video/audio streaming infrastructure — transcoding, CDN orchestration, bursty GPU/CPU workloads (especially relevant given our voice AI layer)
• Genomics / bioinformatics — batch + real-time hybrid workloads with large dataset processing
• IoT / telemetry platforms — high-volume ingestion, time-series data, queue-based architectures (similar to our SQS/worker pattern)
• Gaming backends (MMO / live-service) — distributed systems under unpredictable, bursty load
The common thread: experience architecting systems that move, process, and serve large volumes of data with tight reliability and latency requirements.
**Top 3 technical execution responsibilities:**
1. Design, build, and maintain CI/CD pipelines and deployment infrastructure — Own the full path from code commit to production on Falcon (K8s). This includes environment promotion across dev/staging/prod, rollback strategies, release automation, and managing the everse-api / everse-worker / everse-ui service group. For an AI team, this also means handling model artifact versioning and deployment.
2. Architect and operate the observability stack — Metrics, structured logging, alerting, dashboards, and on-call runbooks. eVerse runs evaluation jobs and training cycles that can be long-running and resource-intensive, so they need to instrument job health, queue depth (SQS), worker utilization, API latency, database performance (Postgres/Redis), and S3 throughput. They're building the nervous system of the platform.
3. Own infrastructure reliability, scaling, and cost optimization — Capacity planning, autoscaling policies, incident response, and keeping cloud spend rational. eVerse workloads are bursty by nature (evaluation runs spike, training jobs are GPU-heavy), so the infrastructure needs to flex without breaking or burning money.
**What makes this role senior:**
Three things, none of which are people management:
1. Autonomy & ownership — This is a small research-to-production team. There's no one above them designing the infrastructure. They're making architectural decisions (queue topology, database scaling strategy, service mesh configuration, security posture) that the team will live with for years.
2. Technical judgment under ambiguity — Research workloads are inherently unpredictable. Requirements shift as experiments succeed or fail. They need to make sound trade-offs between speed, reliability, and cost without a playbook, and adapt infrastructure as the product evolves.
3. Cross-functional influence — They'll sit in strategic discussions because infrastructure capabilities directly shape what's feasible. When the team asks "can we run 10x more evaluation jobs next quarter?" or "can we support real-time voice AI pipelines?", this person needs to translate between research ambitions and operational reality — and push back when something won't scale.