Job Description
Key Responsibilities
1. Multi-Cloud Infrastructure Architecture & Support (AWS-first)
- Design and maintain secure, resilient, and scalable infrastructure primarily on AWS, with optional support for Azure and GCP
- Architect multi-cloud deployments to support availability, compliance, and vendor flexibility
- Manage cloud resources such as EC2, S3, IAM, VPC (AWS); Compute Engine (GCP); VM Scale Sets and VNets (Azure)
- Integrate cloud-native tools like AWS CloudTrail, Azure Monitor, and Google Cloud Logging
- Oversee resource tagging, infrastructure naming conventions, and cost allocation across environments
2. Infrastructure as Code (IaC) & Automation
- Build and manage infrastructure using Terraform (multi-cloud), AWS CDK, and/or Pulumi
- Standardize infrastructure provisioning across cloud platforms for repeatable and auditable deployments
- Automate the creation of environments (Dev, QA, Stage, Prod) across cloud accounts
- Leverage CI/CD pipelines for IaC validation, compliance checks, and testing
3. CI/CD & Release Engineering
- Design and manage CI/CD pipelines for various deployment targets
- Automate builds, tests, container packaging, and deployments across hybrid and multi-cloud platforms
- Enforce release engineering best practices such as gated releases, canary deployments, and rollback strategies
4. Monitoring, Observability & Alerting
- Implement centralized monitoring using tools like Datadog, Prometheus, Grafana, CloudWatch, Azure Monitor, and GCP Operations Suite
- Define and maintain service-level objectives (SLOs) and ensure systems meet performance benchmarks
- Manage incident detection and alerting pipelines to reduce MTTR and improve reliability
5. Security, Compliance & Governance
- Apply security best practices across AWS, Azure, and GCP accounts (IAM policies, MFA, encryption, etc.)
- Conduct regular audits on access controls, logging, and cloud configuration compliance
- Align with frameworks such as NIST 800-171, CMMC, 48 CFR, DFARS 252.204-7012, or FedRAMP (as applicable)
- Support implementation of cross-cloud security policies and disaster recovery plans
6. Cloud Cost Optimization & Usage Monitoring
- Use cost analysis tools such as AWS Cost Explorer, Azure Cost Management, or GCP Billing
- Identify savings opportunities (e.g., reserved instances, autoscaling, workload shifting)
- Build dashboards to monitor cost trends across multiple cloud accounts
7. Support, Troubleshooting & Operations
- Serve as escalation point for production and infrastructure incidents
- Troubleshoot issues across network, compute, and application layers
- Collaborate with cloud vendor support (AWS, Azure, GCP) to resolve escalated incidents
8. Knowledge Transfer & Team Enablement
- Document designs, standard operating procedures (SOPs), and decision rationale
- Mentor junior DevOps engineers and developers on DevOps principles
- Lead internal workshops or onboarding sessions on CI/CD pipelines, cloud tools, and infrastructure workflows
Qualifications
Required
- 5+ years of experience as a DevOps or Infrastructure Engineer in cloud-native or hybrid cloud environments
- Strong hands-on experience with AWS infrastructure and services
- Familiarity with Azure and/or GCP environments (even partial exposure is acceptable)
- Expertise in infrastructure-as-code, CI/CD, and container orchestration
- Strong scripting skills and Linux fundamentals
- Proven ability to manage incident response and postmortems
Preferred
- Previous experience with setup and cluster management for AWS P5e Instances. (Highly Preferred)
- Previous experience with Nvidia compute systems, such as A100, H100, H200. (Highly Preferred)
- AWS certifications (e.g., Solutions Architect, DevOps Engineer), Azure/GCP certifications a plus
- Experience in compliance-driven or regulated environments
- Familiarity with distributed systems, edge compute, and global infrastructure strategies
- Startup or high-growth team experience