Intern - SRE

LeadSquared • Internship • Bengaluru, IN • 1d ago

Group Company: LeadSquared (MarketXpander Services Private Limited)

Designation: Intern - Site Reliability Engineer (SRE)

Office Location: Bengaluru

Position Description: The SRE is responsible for monitoring the availability and performance of LeadSquared's 100% AWS-hosted SaaS production environment. The role combines proactive observability, capacity planning, and incident management to ensure reliability and efficiency of cloud infrastructure and services.

Primary Responsibilities

Monitor availability and performance of production SaaS infrastructure hosted on AWS; drive capacity planning and reliability improvements
Own end-to-end incident management including emergency response, timely mitigation, root cause analysis (RCA), and preventive action documentation
Build and contribute to platforms and processes for full observability and automated incident response across systems, applications, and infrastructure
Collaborate with DevOps, InfoSec, and Engineering teams to improve performance, reliability, and operability of applications and services
Gather and analyse performance metrics from OS and application layers to identify bottlenecks and areas for improvement
Occasionally engage with customers to address infrastructure availability and performance concerns

Additional Responsibilities

Track and document all incidents with structured RCA reports and preventive actions
Operate and optimise monitoring tools including NewRelic, Grafana, Loggly, PagerDuty, Site24x7, FreshService, Kibana, and AKAMAI
Manage and monitor AWS services including EC2, RDS, Elasticsearch, ECS, Redis, SQS, Lambda, API Gateway, and VPCs
Improve observability posture beyond baseline monitoring; implement alerting and automated response mechanisms

Reporting Team

Reporting Department: SRE

Educational Qualifications Preferred

Category: Full-time
Field Specialization: Computer Science, Information Technology, or related engineering discipline
Degree: Bachelor's (B.Tech / B.E. / B.Sc.)

Required Certification/s: AWS Certification (preferred); ITIL Certification (preferred)

Required Work Experience

Industry: SaaS / Cloud / Technology
Role: Site Reliability Engineer / DevOps Engineer / Cloud Infrastructure Engineer
Years of Experience: 0.5–1 year in an SRE role on cloud-based applications (preferably AWS)

Key Performance Indicators

Production environment uptime and availability SLAs
Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR) for incidents
RCA completion rate and quality of preventive actions
Observability coverage across services and infrastructure
Incident recurrence rate post-preventive action implementation

Required Competencies

Incident management and emergency response
Root cause analysis and structured problem-solving
Proactive identification of performance bottlenecks and reliability risks
Cross-functional collaboration with DevOps, InfoSec, and Engineering teams
Strong documentation discipline and communication skills

Required Knowledge

SRE principles and best practices for multi-tenant SaaS environments
AWS services: EC2, RDS, Elasticsearch, ECS, Redis, SQS, Lambda, API Gateway, VPCs
Monitoring and observability tools: NewRelic, Grafana, Loggly, PagerDuty, Site24x7, FreshService, Kibana, AKAMAI
Web application, database, API, and backend job monitoring concepts
OS and application-level performance metrics analysis

Required Skills

Hands-on experience with observability, monitoring, alerting, and incident management on AWS
Debugging and troubleshooting of live production application and infrastructure issues
Programming in Python or equivalent scripting language (preferred)
Experience monitoring multi-tenant SaaS environments across web, DB, API, and batch layers
Documentation and RCA reporting

Required Abilities

Physical: Ability to support on-call rotations including off-hours incident response
Other: Ability to function effectively in a fast-paced, rapidly changing environment; ability to work collaboratively in a diverse, team-focused setup

Work Environment Details: Fast-paced SaaS product environment; cross-functional team collaboration with DevOps, InfoSec, and Engineering; on-call incident response model