Research DevOps Engineer GEMINI Systems

Toronto 10 days agoFull-time External
236 - 298 / hr
Research DevOps Engineer GEMINI (geminimedicine) is at the forefront of medical research and innovation, providing cutting-edge computational resources to researchers through our high-performance computing infrastructure. We operate a 100% Linux environment and are deeply committed to automating our infrastructure to deliver seamless, efficient services to our users. We are seeking an experienced DevOps Engineer to join our team and champion the evolution of our HPC infrastructure. This role is pivotal in transforming our configuration management into a robust, scalable GitOps architecture. You will be responsible for designing and building the CI/CD pipelines and automation workflows that manage our entire scientific computing platform. Our ideal candidate has a strong background in Python and tools like Ansible and Jenkins, a passion for Infrastructure as Code, and experience in implementing modern observability. Your work will directly support researchers by automating complex processes and ensuring our Slurm HPC environment is reliable, scalable, and secure. Duties and Responsibilities: CI/CD Pipeline Development and Infrastructure Automation (35%) • Design, build, and maintain CI/CD pipelines using Jenkins and Ansible to automate deployment, configuration, and management of our entire HPC stack. • Lead the transition to a GitOps methodology, ensuring all system configurations are version-controlled and deployed through automated, auditable pipelines • Expand and enhance existing Ansible playbooks to manage the full lifecycle of our environment, including Slurm cluster nodes, identity management, web applications, and databases. • Develop and implement infrastructure as code (IaC) practices across all environments. • Design and maintain disaster recovery and backup automation strategies. • Work closely with team members to establish automation standards, coding techniques, and infrastructure best practices. • Evaluate and integrate new tools and technologies to improve team productivity. • Write detailed technical documentation and infrastructure automation plans Observability and Telemetry (25%) • Implement and manage a comprehensive observability stack (e.g., VictoriaMetrics, Grafana, Vector) to provide deep insights into cluster health, job performance, and resource utilization. • Design and maintain monitoring dashboards to communicate system health and performance metrics to end users. • Monitor and analyze system logs, metrics, and telemetry data to identify performance bottlenecks and optimization opportunities • Provide direction and guidance on infrastructure and observability projects • Administer and optimize Slurm for large language model and GPU deep learning workloads. • Troubleshoot complex technical issues across the entire infrastructure stack • Workflow Automation and Integration (25%) • Automate manual and recurring infrastructure-related tasks by integrating APIs from public-facing services like SmartSheet. • Develop Python scripts for automation and system integration tasks. • Work closely with technical team members and researchers to troubleshoot complex issues, optimize workflows, and ensure a seamless user experience. • Build and maintain automated testing for infrastructure changes. • Create self-service automation tools to empower users and reduce manual operational overhead. • Integrate multiple systems and services to create streamlined automated workflows. Security Integration and Compliance (15%) • Embed security and compliance best practices into the CI/CD pipeline and all automation, ensuring systems adhere to healthcare data standards. • Implement security controls and automated security scanning in deployment pipelines. • Conduct regular security assessments and vulnerability management of our HPC infrastructure. • Collaborate with security teams to ensure infrastructure meets organizational security requirements. • Document all security configurations, policies, and compliance measures using version control. Qualifications: Experience: Minimum 3-5 years of experience in Linux systems administration, with a focus on high-performance computing (HPC) environments. Technical Skills: • Proficient in Python scripting for automation and system integration tasks. • Proficiency in cluster management systems (SLURM, Kubernetes). • Strong experience with Ansible and Jenkins . Problem-Solving: Strong analytical and troubleshooting skills, with the ability to resolve complex technical issues. Communication: Excellent verbal and written communication skills, with the ability to convey technical concepts to non-technical audiences. Team Player: Ability to work collaboratively in a team environment and contribute to a culture of continuous improvement. Bachelor's or Master's degree in Computer Science, Information Technology, or related field (or equivalent experience) Why Join Us? Innovative Environment: Work with cutting-edge technologies in a dynamic, research-driven setting. Impactful Work: Contribute to critical medical research that makes a difference in people's lives. Professional Growth: Opportunities for continuous learning and career development. Collaborative Team: Join a team of passionate professionals committed to excellence and innovation. Unity Health Toronto is committed to creating an accessible and inclusive organization. We strive to provide a recruitment process that is barrier-free and in compliance with the Accessibility for Ontarians with Disabilities Act (AODA) and the Ontario Human Rights Code. We understand that you may require an accommodation at any stage of the recruitment process. When you are contacted, please inform the Talent Acquisition Specialist and we will work with you to meet your accommodation needs. We want to emphasize that all accommodation requests are handled with the utmost confidentiality, respecting your privacy and dignity.