This role is for one of the Weekday's clients
Min Experience: 4 years
Location: Pune (On-site / Warehouse + ORice)
JobType: full-time
We are seeking a DevOps Engineer with strong systems and release engineering expertise to manage deployment, reliability, and on-premise infrastructure for large-scale distributed systems. This role sits at the intersection of Linux internals, automation, and release management, and is ideal for engineers who are comfortable operating directly at the OS, network, and application layers.
Unlike traditional cloud-focused DevOps roles, this position requires deep hands-on experience with bare-metal Linux environments, containerized workloads, and high-availability systems operating under real-world constraints.
Requirements
Key Responsibilities
Linux Systems & Infrastructure
- Operate, tune, and troubleshoot bare-metal Linux servers across CPU, memory, disk, and network layers.
- Perform deep OS-level diagnostics using system logs, process inspection, and kernel-level tooling.
- Resolve complex production issues without reliance on cloud dashboards or managed abstractions.
Release Engineering & CI/CD
- Own end-to-end CI/CD pipelines, including build, release orchestration, staged rollouts, and rollback strategies.
- Manage versioning and release lifecycle to ensure safe, repeatable deployments.
Containerization
- Build, optimize, and debug Docker images with attention to layering, performance, and reliability.
- Integrate containerized services into on-prem environments and troubleshoot runtime issues.
Networking & Reliability
- Diagnose and resolve network issues such as latency, packet loss, jitter, and Wi-Fi instability.
- Ensure reliable system performance in high-load, real-time, and high-density environments.
Automation & Tooling
- Develop automation using Bash and Python for build workflows, log parsing, system utilities, and operational tooling.
Monitoring & Operations
- Monitor system throughput, latency, and health using tools such as Prometheus and Grafana.
- Design alerts, perform Root Cause Analysis (RCA), and implement preventive improvements.
Required Skills & Experience
- Expert-level knowledge of UNIX/Linux systems (Ubuntu Server preferred), including process, memory, and log management.
- Strong experience in release engineering, deployment strategies, and rollback planning.
- Proficiency with Docker, including image optimization and debugging.
- Advanced scripting skills in Bash and Python.
- Solid understanding of networking fundamentals (TCP/UDP, routing, VLANs, Wi-Fi performance).
- Hands-on experience with monitoring, observability, and log analysis tools (Prometheus, ELK, or similar).
Good to Have
- Background in SRE, systems reliability, or build/release engineering.
- Experience running Kubernetes in on-prem or edge environments.
- Operational experience with databases such as PostgreSQL, MongoDB, or Redis (availability, backups).
What This Role Is Not
- Not a cloud-only DevOps role focused solely on AWS/Azure/GCP services.
- Not suitable for engineers dependent on dashboards without deep terminal-based troubleshooting skills.
What Success Looks Like
- Reliable, well-orchestrated releases with seamless rollback capabilities.
- Clear visibility into system behavior through meaningful metrics and logs.
- Rapid and accurate root-cause analysis at the OS, network, and application layers.
Key Skills
Linux
- Docker
- Bash Scripting
- Python
- Release Engineering
- On-Prem Infrastructure
- Monitoring & Observability