Position: Principal Site Reliability Engineering Specialist (SRE)
Position
Description:
Location:
Edmonton
Open to other locations within proximity to a CGI Office
Hybrid work model
We are hiring a Senior Site Reliability Engineer (SRE) with a strong foundation in building and operating reliable, scalable, and resilient cloud platforms. You bring a reliability and performance engineering mindset to everything you do—balancing operational stability with modernization and automation. In this role, you will apply core SRE practices—including SLIs/SLOs, observability, incident management, and operational automation—while temporarily supporting a regional support strategy engagement focused on assessing and strengthening large-scale operational environments.
You will work closely with platform, operations, and architecture teams to evaluate current-state practices, identify reliability and support gaps, and contribute to the definition of future-state operating models and implementation roadmaps. Beyond this engagement, the role is designed for ongoing, hands-on SRE delivery, where you will lead and implement monitoring, reliability engineering, automation, and tooling across cloud and hybrid environments.
You will collaborate with cross-functional teams to design, build, and continuously improve platform reliability, engineering standards, and operational excellence practices for mission-critical services. This position places you in a client-facing, high-impact environment, where your technical depth, operational judgment, and ability to translate reliability principles into practical outcomes will directly influence service stability, modernization efforts, and future cloud initiatives. If you are a proven SRE who thrives in complex environments and values both hands-on engineering and operational leadership, this role offers the opportunity to make a meaningful and lasting impact.
Your future duties and responsibilities:
Who are You?
You are a senior Site Reliability Engineer who thrives on solving complex reliability and operational challenges are curious, collaborative, and continuously focused on improving how platforms, infrastructure, and services are operated and supported. Your strength lies in applying sound engineering judgment to real-world operational problems, balancing reliability, performance, and maintainability. You are equally comfortable working hands-on with tools and systems and stepping back to assess how operational practices, support models, and workflows impact service reliability.
You can engage confidently in technical discussions with engineers while also communicating clearly with operational leaders and stakeholders to explain risks, trade-offs, and improvement opportunities.
With a mindset grounded in continuous improvement and learning, you champion modernization, automation, and pragmatic reliability practices. You are trusted for your ability to identify root causes rather than symptoms, to raise concerns early, and to translate reliability principles into practical, actionable outcomes. Your peers value your technical depth and calm leadership in complex environments, and teams rely on you to elevate operational maturity and execution quality.
At CGI, we recognize strong SRE practitioners and provide the environment and support for them to grow, contribute, and make a meaningful impact across engagements.
Responsibilities
• Develop, operate, and evolve monitoring, logging, and alerting capabilities across cloud and hybrid environments, while temporarily contributing SRE expertise to assess and rationalize existing operational monitoring practices as part of a regional support strategy initiative.
• Define, implement, and continuously improve SLIs, SLOs, and SLAs for platform and service reliability, applying these principles during the engagement to evaluate current-state service outcomes and inform future-state reliability targets.
• Lead and participate in incident response, problem investigation, and root cause analysis, leveraging hands-on SRE experience to identify systemic reliability issues and recurring operational failure patterns observed across regional support operations.
• Design and automate reliability and operational…