AI Platform Site Reliability Engineer

Montreal 28 days agoFull-time External
Negotiable
Skills Required : • Production experience in SRE / Infrastructure / ops for large-scale systems • Strong programming/scripting skills (Python, Go, Java, or equivalent) • Deep experience with containerization (Docker), orchestration (Kubernetes, etc.) • Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.) • Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures • Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.) • Production experience in SRE / Infrastructure / ops for large-scale systems • Strong programming/scripting skills (Python, Go, Java, or equivalent) • Deep experience with containerization (Docker), orchestration (Kubernetes, etc.) • Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.) • Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures • Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.) • Networking & systems engineering knowledge (TCP/IP, DNS, routing, load balancing, distributed storage) • Solid experience in capacity planning, performance tuning, scaling, and incident response • Demonstrated ability to lead RCAs, deploy fixes, and drive reliability improvements • Experience in regulated environments (financial services, compliance, audit, security) is a strong plus • Excellent communication, documentation, and cross-team collaboration skills • Proven track record of reducing operational toil via automation