Job Title: DevOps Engineer II – GCP & AI
Location: Pune, India (Hybrid) / Remote
Employment Type: Full-Time Experience Level: 5+ Years Notice Period: Immediate Joiners Preferred
About the Opportunity
Hiring for a leading product-based organization seeking an experienced DevOps Engineer II to design, build, and scale cloud infrastructure on Google Cloud Platform (GCP). This role has a strong focus on AI workloads and LLMOps, offering the opportunity to work on advanced, high-impact systems.
Role Overview
You will be part of a team responsible for maintaining, building, and scaling infrastructure that supports both production workloads and internal engineering teams. This role requires strong ownership, architectural decision-making, and a focus on reliability, security, and cost optimization.
Key Responsibilities
•Own and maintain production infrastructure with high availability and monitoring
•Build and manage scalable GCP infrastructure using Terraform/Ansible
•Deploy and manage AI/LLM workloads and vector databases
•Design and maintain secure CI/CD pipelines for services and AI systems
•Implement security best practices (IAM, secrets, vulnerability scanning)
•Set up observability for system and AI metrics (latency, usage, cost)
•Optimize cloud costs, especially for GPU/TPU workloads
•Manage cross-cloud connectivity between AWS and GCP
Required Qualifications
•6+ years of experience in DevOps, SRE, or cloud infrastructure
•Expert-level hands-on experience with GCP services: GKE, Cloud Run, Vertex AI, GCS, Network Endpoint Groups (NEG)
•Experience deploying and managing vector databases and LLM APIs
•Strong expertise in Docker, Kubernetes, Terraform, and Ansible
•Proficiency in Python and Bash scripting
•Strong understanding of Linux/Unix systems (preferably Ubuntu)
•Solid knowledge of networking, security architectures, and IAM principles
•Experience building highly scalable, fault-tolerant, and cost-efficient systems
Preferred Skills
•Experience in LLMOps and AI infrastructure scaling
•Familiarity with observability tools such as Prometheus, Grafana, or OpenTelemetry
•Strong debugging and performance optimization skills
•Automation-first mindset with the ability to work with complex distributed systems
Soft Skills
•Excellent communication and documentation skills
•Strong analytical and problem-solving abilities
•Quick learner with the ability to mentor junior engineers
Interested candidates can share their updated resume along with availability for immediate discussion