Job Description
Position Overview:
The Principal Cloud/SRE Engineer will spearhead the design, implementation, and management of Disaster Recovery (DR) solutions for data-intensive applications and data engineering pipelines. With a focus on AWS cloud infrastructure, the candidate will ensure robust DR strategies that guarantee minimal downtime, data integrity, and fast recovery times. This role is critical in safeguarding the organization's mission-critical data and systems in the event of unforeseen disruptions.
Responsibilities
Design and Implement DR Solutions: Develop comprehensive disaster recovery plans tailored to data-heavy applications and data engineering pipelines hosted on AWS. Ensure that all critical systems are recoverable within agreed Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO).
AWS Cloud Expertise: Utilize AWS-native services such as AWS Backup, S3, RDS, DynamoDB, EBS snapshots, and Terraform to build scalable and reliable backup and disaster recovery frameworks.
Data Engineering Pipeline DR: Collaborate with Data Engineering teams to set up failover solutions and backup strategies for ETL pipelines and streaming data architectures using services like EMR, Glue, Redshift, Kinesis, and Lambda.
Automated Backup and Restore Processes: Implement automated and scheduled backups, ensuring data integrity across large-scale environments. Develop and document failover strategies for continuous operation during disasters.
Monitoring and Testing: Regularly test the disaster recovery plans, simulating failure scenarios to ensure operational readiness. Identify and resolve gaps through continuous testing, including full-scale failover tests.
Failover and Redundancy Strategies: Implement advanced redundancy strategies (such as multi-region failover, cross-region replication, and autoscaling) to maintain service availability and minimize downtime during disaster recovery events.
Disaster Recovery Playbooks: Create comprehensive playbooks with detailed, step-by-step recovery procedures for both engineering and operations teams, ensuring clear guidance during an actual disaster event.
Collaboration with Stakeholders: Work closely with development, operations, and data teams to ensure DR plans are integrated into broader application and data pipeline architectures. Ensure alignment with business continuity goals.
Cost Optimization: Ensure that DR solutions are cost-effective, leveraging AWS's pricing model while optimizing for storage and data replication.
Qualifications
11-15 years of experience in Cloud, SRE, or a similar role with a strong focus on disaster recovery for large-scale, data-heavy environments.
Proven experience in setting up and managing DR solutions on AWS, including in-depth knowledge of AWS services like S3, EC2, RDS, EBS, Redshift, Glue, and Terraform.
Expertise in handling data-intensive applications and creating resilient solutions for data pipelines, including ETL, streaming, and batch processing.
Strong understanding of high availability, resilience patterns, multi-region failover, and AWS fault-tolerant architectures.
Experience with data backup, archival strategies, and restoration processes for high-volume data systems.
Familiarity with automation tools like Terraform for DR environment setup and scaling.
Experience conducting disaster recovery drills, simulations, and root cause analyses to continuously improve DR effectiveness.
Strong skills in incident management and collaborating with cross-functional teams to mitigate risks and ensure system uptime.
Exceptional problem-solving skills and meticulous attention to detail.
Excellent leadership, communication, and interpersonal skills, with a proven ability to inspire and lead teams.
Preferred Skills
Relevant professional certifications such as AWS Certified Solutions Architect or similar.
Experience with DevOps automation tools and scripting for infrastructure management (e.g., Python, Bash).
Familiarity with observability frameworks for monitoring system health and performance during disaster events.
Experience managing multi-cloud or hybrid cloud disaster recovery setups (e.g., AWS or on-premise infrastructure).
#L!-CEIPAL