Site Reliability Engineer- AWADC5704026

Montreal 2 days agoFull-time External
Negotiable
The SRE will focus on ensuring reliable, resilient systems through task automation, observability, incident response, and problem elimination, while also participating in production-side operations and on-call rotations. KEY RESPONSIBILITIES • Deliver improvements to maximize system availability and performance through optimized and automated operational tasks. • Collaborate on the development of operational tools, problem management, and architecture reviews. • Troubleshoot ServiceNow issues and occasional on-premise capabilities in a Linux environment. • Explore and implement observability practices including metrics, logging, tracing, and alerting to measure product reliability. • Participate in on-call rotation with global team members, ensuring responsiveness during agreed hours. • Contribute to documentation of ServiceNow instances and related dependencies. • Identify and prioritize technical debt impacting client satisfaction or operational efficiency. • Provide feedback on policies and procedures to enhance SRE and operational practices, improving safety and efficiency. REQUIRED QUALIFICATIONS • 7+ years of experience in software development, infrastructure, or system administration. • Proficiency in at least one programming language (e.g., Python) or ServiceNow administration/development experience. • Strong oral and written communication skills. • Proven ability to establish effective relationships with colleagues and collaborate on successful delivery. • Dependable team player with demonstrated commitment to client service. • Ability to respond appropriately during technical emergencies such as outages. • Willingness to participate in on-call rotation. DESIRED QUALIFICATIONS • ServiceNow administration or development experience (can be acquired on the job with training). • Experience with SQL databases, APIs, and web infrastructure. • Familiarity with chatbot technology and on-call escalation incident management. • Strong interest in reliability, resilience principles, and SRE practices. WORKING CONDITIONS • Global team collaboration across multiple time zones. • Production-side operational responsibilities with occasional on-call duties. • Fast-paced environment requiring adaptability, problem-solving, and continuous improvement mindset. #J-18808-Ljbffr