Senior SRE — AI/ML GPU HPC Infrastructure

Toronto

9 days ago

Full-time

External

Negotiable

Boson AI

A technology company in Toronto seeks a Senior Site Reliability Engineer to manage and optimize HPC cluster operations in a datacenter equipped with advanced GPUs. The ideal candidate has over 5 years of experience, proficiency in Linux and Kubernetes, and skills in automation tools. Responsibilities include managing infrastructure, supporting ML teams, and developing automation for operational efficiency. The salary range is competitive at $150,000 to $250,000 annually. #J-18808-Ljbffr