MBZUAI (Mohamed bin Zayed University of Artificial Intelligence)

高级高性能计算工程师

MBZUAI (Mohamed bin Zayed University of Artificial Intelligence) 阿布扎比 ¥6万 / 月 10天前 全职
【对业务的影响】 作为 HPC 专家,您将在提高我们组织的计算能力和加速尖端深度学习解决方案的开发方面发挥关键作用。 您将负责管理和优化 GPU 集群的性能,并实施分布式计算策略以高效训练大规模深度学习模型。 【工作职责】 GPU集群管理:设计、部署和维护高性能GPU集群,确保其稳定性、可靠性和可扩展性。 监控和管理集群资源以最大限度地提高利用率和效率。 分布式/并行训练:实施分布式计算技术,实现跨多个 GPU 和节点的大型深度学习模型的并行训练。 优化数据分布和同步,以实现更快的收敛并减少训练时间。 性能优化:微调 GPU 集群和深度学习框架,以实现特定工作负载的最佳性能。 通过分析和系统分析来识别并解决性能瓶颈。 深度学习框架集成:与数据科学家和机器学习工程师合作,将分布式训练功能集成到现有深度学习框架(例如 TensorFlow、PyTorch、MXNet)中。 可扩展性和资源管理:确保 GPU 集群能够有效扩展,以满足不断增长的计算需求。 制定资源管理策略,根据项目要求确定计算资源的优先级并分配计算资源。 安全性和合规性:实施安全措施来保护 GPU 集群和数据,同时遵守行业最佳实践和合规性标准。 故障排除和支持:故障排除并解决与 GPU 集群、分布式训练和性能异常相关的问题。 为用户提供技术支持,高效解决技术难题。 文档:创建和维护与 GPU 集群配置、分布式训练工作流程和最佳实践相关的文档,以确保新团队成员的知识共享和无缝入职。 【要求】 英语要求:6级 以上。 计算机科学或相关领域的学士学位,重点关注高性能计算、分布式系统或深度学习。 拥有 3 年以上管理 GPU 集群的成熟经验,包括安装、配置和优化。 在分布式深度学习和并行训练技术方面拥有丰富的专业知识。 精通流行的深度学习框架,如 TensorFlow、PyTorch 或 MXNet。 Python 编程技能以及 GPU 加速库(例如 CUDA、cuDNN)的经验。 了解 HPC 和深度学习的性能分析和优化工具。 熟悉资源管理和调度系统(例如 SLURM、Kubernetes)。 【招聘单位】 阿联酋AI国王大学。 【投递方式】 1. 右上角【投递简历】按钮 2. 发送简历到海拉拉邮箱:[email protected] Senior High Performance Computing Engineer Impact on Business As an HPC Expert, you will play a pivotal role in advancing our organization's computational capabilities and accelerating the development of cutting-edge deep learning solutions. You will be responsible for managing and optimizing the performance of GPU clusters and implementing distributed computing strategies to efficiently train large-scale deep learning models. Job Responsibilities GPU Cluster Management: Design, deploy, and maintain high-performance GPU clusters, ensuring their stability, reliability, and scalability. Monitor and manage cluster resources to maximize utilization and efficiency. Distributed/Parallel Training: Implement distributed computing techniques to enable parallel training of large deep learning models across multiple GPUs and nodes. Optimize data distribution and synchronization to achieve faster convergence and reduced training times. Performance Optimization: Fine-tune GPU clusters and deep learning frameworks to achieve optimal performance for specific workloads. Identify and resolve performance bottlenecks through profiling and system analysis. Deep Learning Framework Integration: Collaborate with data scientists and machine learning engineers to integrate distributed training capabilities into existing deep learning frameworks (e.g., TensorFlow, PyTorch, MXNet). Scalability and Resource Management: Ensure that the GPU clusters can scale effectively to handle increasing computational demands. Develop resource management strategies to prioritize and allocate computing resources based on project requirements. Security and Compliance: Implement security measures to protect GPU clusters and data while adhering to industry best practices and compliance standards. Troubleshooting and Support: Troubleshoot and resolve issues related to GPU clusters, distributed training, and performance anomalies. Provide technical support to users and resolve technical challenges efficiently. Documentation: Create and maintain documentation related to GPU cluster configuration, distributed training workflows, and best practices to ensure knowledge sharing and seamless onboarding of new team members. Requirements Bachelor’s degree in computer science, or a related field with a focus on High-Performance Computing, Distributed Systems, or Deep Learning. 3+ years proven experience in managing GPU clusters, including installation, configuration, and optimization. Strong expertise in distributed deep learning and parallel training techniques. Proficiency in popular deep learning frameworks like TensorFlow, PyTorch, or MXNet. Programming skills in Python and experience with GPU-accelerated libraries (e.g., CUDA, cuDNN). Knowledge of performance profiling and optimization tools for HPC and deep learning. Familiarity with resource management and scheduling systems (e.g., SLURM, Kubernetes).

简历优化

* 结合该职位优化简历。

您还未登录

热门职位

首席前端工程师

洛杉矶 ¥145万 - ¥160万

机器学习科学家

多伦多 ¥131万 - ¥178万

首席软件工程经理 - CTJ - Poly

芝加哥 ¥194万 - ¥194万

解决方案架构师

纽约 ¥160万 - ¥174万

全球信任与安全软件工程师

旧金山 ¥134万 - ¥178万

高级经理 / 游戏制作 - TFT

洛杉矶 ¥143万 - ¥200万

人工智能工程副总裁

伦敦 ¥140万 - ¥149万

机器学习工程师

旧金山 ¥131万 - ¥160万

软件工程师(产品)

芝加哥 ¥141万 - ¥174万

高级Flutter开发者

旧金山 ¥145万 - ¥182万