ABOUT OOLIO:
Oolio is a leading B2B SaaS platform transforming how hospitality venues operate and grow. Trusted by more than 22,000 venues, we power mission-critical POS, payments, online ordering, kiosks, loyalty, kitchen management, and real-time insights — all within one connected ecosystem.
We are building the operating system for modern hospitality — simplifying complex operations, accelerating service, and unlocking smarter, data-driven decisions. Built by hospitality professionals with decades of industry experience, we understand the realities of every shift, every service rush, and every guest interaction. From cafés and quick-service restaurants to pubs, multi-site groups, and stadiums, Oolio enables venues to operate seamlessly at scale. With next-business-day settlements, powerful third-party integrations, and 24/7 real human support, we go beyond software — we become long-term partners in growth.
As a rapidly scaling product-led organisation, we’re shaping the future of hospitality technology.
We build the technology backbone that powers modern hospitality businesses to perform, compete, and thrive at scale.
JOB DESCRIPTION:
At Oolio, Senior Site Reliability Engineers (SREs) are responsible for ensuring the reliability, availability, and performance of our mission-critical B2B SaaS platforms. You will take ownership of production systems and work at the intersection of software engineering and infrastructure to build scalable, resilient, and highly observable systems.
This is not a traditional DevOps or support role. We are looking for engineers who understand how applications are built, architected, and operated in production — and who can apply engineering principles to solve reliability and scalability challenges.
You will collaborate closely with Product Engineering, Platform, and Security teams to improve system reliability, reduce operational overhead, and drive a strong culture of automation and operational excellence across Oolio.
#SRE #Platform Engineering #Devops
RESPONSIBILITIES:
- Own the reliability, availability, and performance of Oolio’s production systems across environments.
- Design, build, and operate highly reliable and scalable Kubernetes-based infrastructure with a focus on uptime and fault tolerance.
- Define, implement, and manage SLOs, SLIs, and error budgets to drive reliability engineering practices.
- Drive observability maturity across systems including monitoring, logging, metrics, tracing, and alerting.
- Lead incident management processes including on-call participation, incident response, root cause analysis (RCA), and postmortems.
- Improve system resilience by identifying and eliminating single points of failure.
- Partner with engineering teams to optimize application performance, scaling strategies, and production readiness.
- Automate operational workflows to reduce manual intervention and improve system efficiency.
- Improve deployment reliability by implementing robust CI/CD practices and safe release strategies (canary, blue/green, rollback mechanisms).
- Work closely with Platform Engineering to enhance infrastructure reliability, scalability, and operational tooling.
- Ensure production systems meet security, compliance, and data protection standards.
- Drive capacity planning, load testing, and performance benchmarking initiatives.
- Troubleshoot complex production issues across application, infrastructure, and networking layers.
- Participate in architecture reviews to ensure systems are designed for reliability, scalability, and operability.
- Mentor engineers and promote a culture of reliability, ownership, and continuous improvement.
REQUIREMENTS:
Role: Senior Site Reliability Engineer (SRE) - final role will depend on candidate experience, credentials and interview outcomes
Experience: 9 - 15 Years overall experience
(minimum 6–7 years in SRE / Platform Engineering / Production Engineering roles (+ Plus) minimum 3–4 years strong backend development experience)
Education: Preferred – B.Sc/M.Sc/B.Tech/B.E/M.Tech/M.E/MCA/M.S
Technology Stack: Kubernetes (EKS/AKS), AWS and/or Azure, Terraform, Helm, Docker, CI/CD (GitHub Actions/Jenkins), Prometheus, Grafana, ELK/Splunk, Go/Python/Java, Bash, Distributed Systems, Networking (DNS, TCP/IP, HTTP/HTTPS), Infrastructure Security, Observability, Reliability Engineering, Incident Management, Cloud Architecture, Scalability Engineering
Other Requirements:
- Strong background as a backend engineer before transitioning into SRE / platform / production engineering roles.
- Deep understanding of distributed systems, failure modes, and high-availability architecture.
- Hands-on experience managing production systems at scale with strong focus on uptime, latency, and system health.
- Strong experience with observability practices including metrics, logs, tracing, and alerting frameworks.
- Proven experience defining and implementing SLOs/SLIs and driving reliability improvements using error budgets.
- Experience handling production incidents, performing RCA, and implementing long-term reliability fixes.
- Strong understanding of Kubernetes in production, including scheduling, autoscaling, networking, and failure handling.
- Experience building and maintaining Infrastructure-as-Code and automation for production environments.
- Strong understanding of deployment strategies and release safety mechanisms.
- Solid troubleshooting skills across application, infrastructure, and networking layers.
- Experience in high-scale, multi-tenant SaaS or transaction-heavy systems is a strong plus.
- Ability to influence engineering teams to adopt reliability best practices.
- Strong automation-first mindset with focus on reducing toil and improving system resilience.