Group Company: LeadSquared (MarketXpander Services Private Limited)
Designation: Intern - Site Reliability Engineer (SRE)
Office Location: Bengaluru
Position Description: The SRE is responsible for monitoring the availability and performance of LeadSquared's 100% AWS-hosted SaaS production environment. The role combines proactive observability, capacity planning, and incident management to ensure reliability and efficiency of cloud infrastructure and services.
Primary Responsibilities
- Monitor availability and performance of production SaaS infrastructure hosted on AWS; drive capacity planning and reliability improvements
- Own end-to-end incident management including emergency response, timely mitigation, root cause analysis (RCA), and preventive action documentation
- Build and contribute to platforms and processes for full observability and automated incident response across systems, applications, and infrastructure
- Collaborate with DevOps, InfoSec, and Engineering teams to improve performance, reliability, and operability of applications and services
- Gather and analyse performance metrics from OS and application layers to identify bottlenecks and areas for improvement
- Occasionally engage with customers to address infrastructure availability and performance concerns
Additional Responsibilities
- Track and document all incidents with structured RCA reports and preventive actions
- Operate and optimise monitoring tools including NewRelic, Grafana, Loggly, PagerDuty, Site24x7, FreshService, Kibana, and AKAMAI
- Manage and monitor AWS services including EC2, RDS, Elasticsearch, ECS, Redis, SQS, Lambda, API Gateway, and VPCs
- Improve observability posture beyond baseline monitoring; implement alerting and automated response mechanisms
Reporting Team
- Reporting Department: SRE
Educational Qualifications Preferred
- Category: Full-time
- Field Specialization: Computer Science, Information Technology, or related engineering discipline
- Degree: Bachelor's (B.Tech / B.E. / B.Sc.)
Required Certification/s: AWS Certification (preferred); ITIL Certification (preferred)
Required Work Experience
- Industry: SaaS / Cloud / Technology
- Role: Site Reliability Engineer / DevOps Engineer / Cloud Infrastructure Engineer
- Years of Experience: 0.5–1 year in an SRE role on cloud-based applications (preferably AWS)
Key Performance Indicators
- Production environment uptime and availability SLAs
- Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR) for incidents
- RCA completion rate and quality of preventive actions
- Observability coverage across services and infrastructure
- Incident recurrence rate post-preventive action implementation
Required Competencies
- Incident management and emergency response
- Root cause analysis and structured problem-solving
- Proactive identification of performance bottlenecks and reliability risks
- Cross-functional collaboration with DevOps, InfoSec, and Engineering teams
- Strong documentation discipline and communication skills
Required Knowledge
- SRE principles and best practices for multi-tenant SaaS environments
- AWS services: EC2, RDS, Elasticsearch, ECS, Redis, SQS, Lambda, API Gateway, VPCs
- Monitoring and observability tools: NewRelic, Grafana, Loggly, PagerDuty, Site24x7, FreshService, Kibana, AKAMAI
- Web application, database, API, and backend job monitoring concepts
- OS and application-level performance metrics analysis
Required Skills
- Hands-on experience with observability, monitoring, alerting, and incident management on AWS
- Debugging and troubleshooting of live production application and infrastructure issues
- Programming in Python or equivalent scripting language (preferred)
- Experience monitoring multi-tenant SaaS environments across web, DB, API, and batch layers
- Documentation and RCA reporting
Required Abilities
- Physical: Ability to support on-call rotations including off-hours incident response
- Other: Ability to function effectively in a fast-paced, rapidly changing environment; ability to work collaboratively in a diverse, team-focused setup
Work Environment Details: Fast-paced SaaS product environment; cross-functional team collaboration with DevOps, InfoSec, and Engineering; on-call incident response model
- Time Constraints: On-call availability required for production incident response