Core Technical Skills
- Deep expertise in Kubernetes, Docker, and container orchestration at scale.
- Solid understanding of cloud platforms (AWS preferred) with focus on scalability, reliability, and cost optimization.
- Hands-on experience with Infrastructure as Code (Terraform, CloudFormation, or equivalent).
- Strong CI/CD experience with Jenkins or GitHub Actions, including pipeline standardization and reuse.
- Experience implementing GitOps workflows using tools like ArgoCD or Flux.
- Strong experience in building self-service infrastructure and developer workflows.
- Strong programming skills in Python and/or Golang for platform tooling and automation.
- Experience designing platform APIs, reusable modules, and golden paths for developers.
- Strong understanding of SRE principles including SLIs/SLOs, error budgets, and incident management.
- Lead architecture discussions and contribute to platform strategy and roadmaps.
- Mentor engineers in cloud-native, DevOps, and platform engineering practices.
- Drive best practices in DevOps, platform reliability, and developer experience.
- Collaborate closely with Product, Security, and Infrastructure teams to improve developer productivity.
- Exposure to multi-cloud and hybrid cloud environments.
- Knowledge of cost governance and FinOps practices.
- Familiarity with platform scalability patterns and multi-tenancy design.
- Understanding of event-driven architectures and messaging systems (Kafka, SNS/SQS, etc.).
- Exposure to AI/ML platform integration (as a platform capability, not core focus).
- Understanding of LLM integrations, APIs, or AI-driven developer tooling.
DevOps Engineering, Solution Architect
Roles and Responsibilities
Platform Engineering & Developer Experience
- Define and maintain golden paths, reusable templates, and standardized workflows for developers.
- Build platform APIs, CLI tools, and automation to improve developer productivity and reduce operational overhead.
- Collaborate with application, security, and infrastructure teams to align platform capabilities with business needs.
- Drive platform adoption through documentation, onboarding, and developer enablement.
Cloud & Core Infrastructure
- Architect, build, and operate highly available, scalable, and secure cloud infrastructure primarily on AWS.
- Design VPCs, IAM, compute, storage, networking, and load balancing solutions following best practices.
- Define and implement scalability, high availability, and disaster recovery strategies.
- Support multi-AZ and multi-region architectures for production workloads.
- Optimize infrastructure for performance, cost, and reliability (FinOps awareness).
Infrastructure As Code (IaC)
- Design and maintain Terraform-based infrastructure using reusable modules and standardized patterns.
- Implement remote state management, environment isolation, and secure configuration handling.
- Integrate IaC workflows into CI/CD pipelines with automated validation, policy checks, and provisioning.
- Enforce infrastructure governance and compliance using policy-as-code.
Containers & Kubernetes Platform
- Design, deploy, and operate Kubernetes platforms (EKS and on-premise).
- Manage cluster lifecycle including provisioning, upgrades, scaling, and decommissioning.
- Operate and standardize Kubernetes add-ons (CNI, CoreDNS, ingress controllers, CSI drivers, etc.).
- Design multi-cluster strategies for workload isolation, failover, and scalability.
- Build reusable Helm charts and manage application lifecycle using Helm and GitOps practices.
- Provide standardized deployment patterns and abstractions for application teams.
GitOps & CI/CD
- Implement GitOps workflows (ArgoCD/FluxCD) for declarative, version-controlled deployments.
- Design and maintain CI/CD pipelines (Jenkins/GitHub Actions) with built-in quality, security, and compliance checks.
- Enable progressive delivery strategies (blue-green, canary) and automated rollback mechanisms.
- Ensure environment consistency, drift detection, and auditability across deployments.
Databases & Data Platform (Added Focus)
- Design and manage scalable, highly available database solutions (RDS, Aurora, DynamoDB, or equivalent).
- Support database provisioning, configuration, and lifecycle management via IaC.
- Implement backup, restore, and disaster recovery strategies for data platforms.
- Optimize database performance, scaling, and cost efficiency.
- Enable secure access patterns, credential management, and data encryption.
- Collaborate with application teams on schema management, migrations, and database reliability.
- Integrate databases into platform workflows (self-service provisioning, automation, and observability).
Development, Scripting & Automation
- Build platform tooling and automation using Python or Golang.
- Develop internal services, APIs, and CLIs to standardize operations.
- Automate infrastructure, deployments, and operational workflows.
- Maintain reusable libraries and frameworks for platform consistency.
Observability & Networking
- Implement monitoring, logging, and tracing using Prometheus, Grafana, ELK/Loki, OpenTelemetry, or CloudWatch.
- (Good to have) Experience with service mesh (Istio/Envoy) for traffic control and secure communication.
AI/ML Platform Enablement (Good To Have)
- Exposure to integrating AI/ML capabilities into platform workflows (as a service, not core focus).
- Familiarity with LLM APIs or AI-powered developer tooling.
- Support infrastructure requirements for ML workloads (compute, storage, pipelines) where applicable.