Roles and Responsibilities:
- Infrastructure Management: Deploy, manage, and optimize on-premises and cloud-based Infrastructure (AWS/Azure/On-Prem).
- Networking & Security: Configure and maintain networking components, including VPNs, firewalls, load balancers, and private/public subnets for highly available architecture.
- Kafka Administration: Set up, configure, and manage Kafka clusters for high-throughput messaging.
- Containerization & Orchestration: Implement and manage Docker and Kubernetes clusters in AWS/Azure/on-prem environments.
- Infrastructure as Code (IaC): Automate infrastructure provisioning using Terraform, CloudFormation, or Ansible.
- CI/CD Pipelines: Develop and optimize CI/CD pipelines using Jenkins, GitLab CI/CD, or Azure DevOps for seamless application deployment.
- Monitoring & Logging: Implement and maintain monitoring/logging solutions like ELK Stack, Prometheus, Grafana, CloudWatch, or Datadog.
- Database Management: Support and maintain relational (RDS, PostgreSQL, MySQL) and NoSQL (MongoDB, DynamoDB) databases with high availability & backup strategies.
- Security & Compliance: Implement best practices for security, compliance, and governance across cloud and on-prem environments.
Requirements
Key Skills & Desired Experience:
- Bachelor’s degree in computer science, Information Technology, or a related field.
- 5+ years of experience in DevOps, Site Reliability Engineering (SRE), or related roles.
- Expertise in AWS, Azure, and On-Prem Infrastructure setup and management.
- Strong networking knowledge, including VPCs, VPNs, Load Balancing, Firewalls, and DNS management.
- Proficiency in scripting and automation using Python, Bash, or PowerShell.
- Experience with Kafka cluster configuration and maintenance.
- Containerization & Orchestration using Docker & Kubernetes in cloud and on-prem environments.
- Hands-on experience with IaC tools like Terraform, Ansible, and CloudFormation.
- Knowledge of CI/CD tools like Jenkins, GitLab, or Azure DevOps.
- Experience with monitoring/logging tools (ELK, Prometheus, Grafana).
- Understanding of high-availability, scalability, and fault-tolerant infrastructure.