Job Description – AWS Cloud DevOps / CloudOps Engineer
About the Role
Creyente Infotech is hiring an experienced AWS CloudOps Engineer to design, operate, automate, and continuously improve cloud infrastructure supporting mission-critical financial systems.
This role is ideal for an engineer with strong hands-on experience in both cloud devops and cloud operations. The engineer will be responsible for managing scalable AWS environments, automating routine operational tasks, improving production reliability, implementing monitoring and alerting frameworks, supporting incident response, and ensuring operational readiness for production workloads.
The role requires a strong understanding of AWS cloud services, Infrastructure as Code, automation, networking, security, disaster recovery, observability, and cost optimization across cloud and hybrid environments.
Key Responsibilities:
Cloud Engineering & Operations
* Design, deploy, configure, and manage scalable AWS cloud environments.
* Operate and support AWS services including **EC2, ECS/Fargate, RDS, S3, IAM, VPC, Lambda, CloudWatch, Route 53, Load Balancers, Auto Scaling, and Security Groups**.
* Manage cloud infrastructure across production, staging, development, and disaster recovery environments.
* Support hybrid infrastructure involving AWS cloud and on-premises systems.
* Ensure cloud platforms are secure, reliable, highly available, and operationally efficient.
* Perform regular health checks, capacity reviews, patching, upgrades, and environment maintenance.
CloudOps, Production Support & Operational Readiness
* Own operational readiness for production cloud environments.
* Define and implement production support processes, runbooks, SOPs, and escalation procedures.
* Configure alarms, alerts, dashboards, and operational metrics for production workloads.
* Monitor system availability, performance, errors, capacity, latency, and infrastructure health.
Automation & Infrastructure as Code
* Build and maintain Infrastructure as Code using **Terraform**, CloudFormation, or similar tools.
* Automate routine cloud operations such as provisioning, deployments, patching, scaling, backups, monitoring, and reporting.
* Develop reusable infrastructure templates, scripts, and automation workflows.
* Use scripting languages such as **Python, Shell, or PowerShell** to automate operational tasks.
* Improve deployment pipelines and operational efficiency through automation.
* Work with engineering teams to standardize cloud environments and reduce manual effort.
Monitoring, Observability & Alerting
* Build and enhance monitoring and observability frameworks for cloud infrastructure and applications.
* Configure dashboards, alerts, logs, metrics, and traces using tools such as **Amazon CloudWatch, Grafana, Prometheus, ELK/OpenSearch, or Splunk**.
* Define meaningful production alerts to detect availability, performance, security, and capacity issues.
* Reduce alert noise and improve alert quality through tuning and threshold optimization.
* Implement logging and monitoring best practices for critical production systems.
* Provide visibility into infrastructure performance, application health, and operational risks.
Incident Management & Problem Resolution
* Respond to infrastructure, application, and cloud platform incidents.
* Troubleshoot issues related to compute, networking, storage, databases, IAM, containers, and monitoring.
* Perform root cause analysis and implement preventive actions.
* Improve system reliability by identifying recurring issues and automating remediation.
* Collaborate with application, security, database, and infrastructure teams during incident resolution.
* Maintain incident documentation, post-incident reports, and improvement plans.
Disaster Recovery, Backup & Resilience
* Support disaster recovery planning, implementation, and testing.
* Configure and validate backup, restore, failover, and recovery procedures.
* Define and support RTO/RPO requirements for business-critical systems.
* Implement high-availability and fault-tolerant architecture patterns.
* Participate in DR drills and ensure cloud environments are ready for recovery scenarios.
* Support system hardening, resilience testing, and operational risk reduction.
Cloud Cost Optimization & Governance
* Monitor and optimize AWS cloud costs across compute, storage, database, networking, and managed services.
* Identify underutilized, overprovisioned, or unused cloud resources.
* Recommend cost-saving actions such as right-sizing, reserved capacity, savings plans, storage lifecycle policies, and autoscaling.
* Implement tagging standards, cost allocation, usage reporting, and budget alerts.
* Work with stakeholders to balance cost, performance, scalability, and reliability.
Security & Compliance
* Implement and manage IAM roles, policies, users, groups, and access controls.
* Follow cloud security best practices for networking, encryption, secrets management, and access governance.
* Ensure cloud environments follow organizational security and operational standards.
Required Skills & Experience
* 4–6 years of experience in Cloud Engineering, DevOps, CloudOps, or Infrastructure Operations roles.
* Strong hands-on experience with AWS cloud services in production environments.
* Experience designing, managing, and operating scalable cloud infrastructure.
* Hands-on experience with Terraform.
* Strong scripting and automation skills using Python, Shell or Bash.
* Experience with monitoring, logging, and observability tools such as:
* CloudWatch
* Grafana
* Prometheus
* ELK/OpenSearch
* Splunk
* Good knowledge of cloud networking, including VPC, subnets, routing, NAT Gateway, VPN, security groups, and load balancing.
* Strong fundamentals in Linux and/or Windows server administration.
* Experience supporting production systems, incident response, troubleshooting, and root cause analysis.
* Understanding of backup, disaster recovery, high availability, and operational readiness practices.
* Knowledge of cloud cost optimization and governance practices.
* Good communication skills and ability to work with application, infrastructure, security, and business teams.
Nice to Have
* Experience supporting high-availability and low-latency production environments.
* Experience with CI/CD tools such as Jenkins, GitLab CI/CD, GitHub Actions, Azure DevOps, or AWS CodePipeline.
* Knowledge of change management and incident management.
AWS certifications such as:
* AWS Certified Cloudops Administrator
* AWS Certified DevOps Engineer