Intern - SRE
About the Role
Job Description Group Company: LeadSquared (MarketXpander Services Private Limited) Designation: Intern - Site Reliability Engineer (SRE) Office Location: Bengaluru Position Description: The SRE is responsible for monitoring the availability and performance of LeadSquared's 100% AWS-hosted SaaS production environment. The role combines proactive observability, capacity planning, and incident management to ensure reliability and efficiency of cloud infrastructure and services. Primary Responsibilities: Monitor availability and performance of production SaaS infrastructure hosted on AWS; drive capacity planning and reliability improvements Own end-to-end incident management including emergency response, timely mitigation, root cause analysis (RCA), and preventive action documentation Build and contribute to platforms and processes for full observability and automated incident response across systems, applications, and infrastructure Collaborate with DevOps, InfoSec, and Engineering teams to improve performance, reliability, and operability of applications and services Gather and analyse performance metrics from OS and application layers to identify bottlenecks and areas for improvement Occasionally engage with customers to address infrastructure availability and performance concerns Additional Responsibilities: Track and document all incidents with structured RCA reports and preventive actions Operate and optimise monitoring tools including NewRelic, Grafana, Loggly, PagerDuty, Site24x7, FreshService, Kibana, and AKAMAI Manage and monitor AWS services including EC2, RDS, Elasticsearch, ECS, Redis, SQS, Lambda, API Gateway, and VPCs Improve observability posture beyond baseline monitoring; implement alerting and automated response mechanisms Reporting Team Reporting Department: SRE Educational Qualifications Preferred Category: Full-time Field Specialization: Computer Science, Information Technology, or related engineering discipline Degree: Bachelor's (B.Tech / B.E. / B.Sc.) Required Certification/s: AWS Certification (preferred); ITIL Certification (preferred) Required Work Experience Industry: SaaS / Cloud / Technology Role: Site Reliability Engineer / DevOps Engineer / Cloud Infrastructure Engineer Years of Experience: 0.5–1 year in an SRE role on cloud-based applications (preferably AWS) Key Performance Indicators: Production environment uptime and availability SLAs Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR) for incidents RCA completion rate and quality of preventive actions Observability coverage across services and infrastructure Incident recurrence rate post-preventive action implementation Required Competencies: Incident management and emergency response Root cause analysis and structured problem-solving Proactive identification of performance bottlenecks and reliability risks Cross-functional collaboration with DevOps, InfoSec, and Engineering teams Strong documentation discipline and communication skills Required Knowledge: SRE principles and best practices for multi-tenant SaaS environments AWS services: EC2, RDS, Elasticsearch, ECS, Redis, SQS, Lambda, API Gateway, VPCs Monitoring and observability tools: NewRelic, Grafana, Loggly, PagerDuty, Site24x7, FreshService, Kibana, AKAMAI Web application, database, API, and backend job monitoring concepts OS and application-level performance metrics analysis Required Skills: Hands-on experience with observability, monitoring, alerting, and incident management on AWS Debugging and troubleshooting of live production application and infrastructure issues Programming in Python or equivalent scripting language (preferred) Experience monitoring multi-tenant SaaS environments across web, DB, API, and batch layers Documentation and RCA reporting Required Abilities Physical: Ability to support on-call rotations including off-hours incident response Other: Ability to function effectively in a fast-paced, rapidly changing environment; ability to work collaboratively in a diverse, team-focused setup Work Environment Details: Fast-paced SaaS product environment; cross-functional team collaboration with DevOps, InfoSec, and Engineering; on-call incident response model Time Constraints: On-call availability required for production incident response
Skills Required
Similar Job Openings
Explore more job openings in this category from companies actively hiring.
Ready to Launch Your Career?
Discover internships and job opportunities from top companies. Start applying today and take the next step toward your dream career.
View All Openings