We are looking for a Lead Site Reliability Engineer which requires a person having a strong Retail background and must have experience working on New Relic. Please have a look at the job description below: Job Description: • As a Senior/Lead Site Reliability Engineer, you’ll take ownership of the reliability, performance, and scalability of high-traffic retail platforms. • This role demands deep experience in cloud-native environments, a strong observability mindset (with New Relic as a must), and the ability to lead both incident response and system design discussions with client teams. • You’ll serve as a technical leader and mentor, partnering with engineering, DevOps, and product teams to build resilient systems for real-time retail operations—including eCommerce platforms like Shopify (bonus). Key Responsibilities: • Lead reliability and observability strategy for large-scale retail systems. • Architect and implement robust monitoring using New Relic—dashboards, SLOs, alerts, synthetic monitoring, etc. • Guide incident response processes and run blameless postmortems. • Own availability, performance, and scalability of customer-facing apps and services. • Design infrastructure for high availability using Kubernetes, Docker, and IAC tools (Terraform, CloudFormation). • Collaborate with client engineering teams to optimize system behavior during retail surges (e.g., Black Friday). • Mentor junior SREs and set operational best practices. • Partner with dev and QA to integrate performance testing and failure injection into CI/CD workflows. • Advocate for DevOps/SRE best practices (shift-left monitoring, chaos testing, performance budgets). Required Qualifications: • 8+ years in Site Reliability Engineering, DevOps, or Platform Engineering. • Expertise with New Relic—must be able to architect observability end-to-end. • Proven experience supporting retail or eCommerce platforms at scale. • Strong coding/scripting (Python, Bash, or Go). • Production experience with AWS/GCP/Azure and Kubernetes. • Deep understanding of infrastructure automation (Terraform, Ansible, or Pulumi). • Strong communication skills, client-facing presence, and leadership ability. Nice to Have: • Experience with Shopify or headless commerce stacks. • Experience leading distributed teams. • Familiarity with traffic-heavy retail events and strategies (caching, autoscaling, edge optimization). • Experience integrating monitoring into microservices, APIs, and frontend apps

Lead Site Reliability Engineer

BayOne Solutions