Responsibilities
● Manage availability, latency and performance of mission critical services and build automation to prevent problem recurrence.
● Independently determine and develop architectural approaches and Infrastructure solutions.
● Defining and ensuring adherence for strategy and roadmap to develop CI/CD, Application hosting, Security and Compliance standards and guidelines.
● Manage incident response protocol and provide hands-on direction during service interruptions - Assist with Root Cause Analysis of service interruptions and maintain SLA.
● Manage Teams and guide them to achieve above.
Basic Requirements
● 4 years of Experience handling Linux Systems at large scale.
● 4 years of Hands-on experience on Containers & Container Orchestration Tools.
● 4 years of proven Experience with designing, building, supporting and observing large-scale distributed systems/services/infrastructure
● Strong work ethic, a self-starter and demonstrate a high level of resilience
● Should be highly goal driven and work well in fast-paced, team-oriented environment
● Experience as a Site Reliability Engineer, with a focus on AWS.
● Shell/Ruby/Python scripting knowledge
● Strong written and communication skills
Preferred Qualifications
● Deep rooted understanding of Linux Systems, Databases and Network concepts
● In-depth knowledge of cloud services, including compute, storage, networking, databases, and security.
● Strong proficiency in infrastructure as code (IaC) concepts and tools, such as Terraform/CloudFormation, for automating infrastructure deployment.
● Familiarity with common web/app/db servers like (nginx, postgres etc)
● Experience with queue systems like RabbitMQ/Kafka is a plus.
● Experience with monitoring and logging tools, such as CloudWatch, Cloud Monitoring, and ELK/EFK Stack
● Proficient with Kubernetes internal architecture, networking and container microservice architectural pattern
● Strong Experience in Microservices Architecture, API GW, Service Mesh implementation and instrumenting XaaC (Infrastructure, Software, Network, Policy, Security) across global scale systems
● Hands-on Experience in defining and driving Disaster Recovery across Platforms.
● Ability to turn technical deep-dives into code, networking, operating systems, and storage, with ability to participate in an executive strategy discussion.
● Automation, auditing, and other tooling for security, compliance, and resource usage - Monitor and improve processes for all deployments.
● Have clear understanding server to server interactions and best practices