Senior SRE for AI/ML HPC Infra

Toronto 9 days agoFull-time External
Negotiable
A technology company in Toronto is seeking a Senior Site Reliability Engineer to manage and optimize their HPC cluster operations. The role includes deploying infrastructure-as-code solutions and supporting research teams with cluster optimization. The ideal candidate will have over 5 years of experience in SRE or HPC operations, proficiency in Linux and Kubernetes, and expertise in Ceph storage deployments. Join us to work with cutting-edge GPU technology in a dynamic environment. #J-18808-Ljbffr