【对业务的影响】
作为 HPC 专家,您将在提高我们组织的计算能力和加速尖端深度学习解决方案的开发方面发挥关键作用。 您将负责管理和优化 GPU 集群的性能,并实施分布式计算策略以高效训练大规模深度学习模型。
【工作职责】
GPU集群管理:设计、部署和维护高性能GPU集群,确保其稳定性、可靠性和可扩展性。 监控和管理集群资源以最大限度地提高利用率和效率。
分布式/并行训练:实施分布式计算技术,实现跨多个 GPU 和节点的大型深度学习模型的并行训练。 优化数据分布和同步,以实现更快的收敛并减少训练时间。
性能优化:微调 GPU 集群和深度学习框架,以实现特定工作负载的最佳性能。 通过分析和系统分析来识别并解决性能瓶颈。
深度学习框架集成:与数据科学家和机器学习工程师合作,将分布式训练功能集成到现有深度学习框架(例如 TensorFlow、PyTorch、MXNet)中。
可扩展性和资源管理:确保 GPU 集群能够有效扩展,以满足不断增长的计算需求。 制定资源管理策略,根据项目要求确定计算资源的优先级并分配计算资源。
安全性和合规性:实施安全措施来保护 GPU 集群和数据,同时遵守行业最佳实践和合规性标准。
故障排除和支持:故障排除并解决与 GPU 集群、分布式训练和性能异常相关的问题。 为用户提供技术支持,高效解决技术难题。
文档:创建和维护与 GPU 集群配置、分布式训练工作流程和最佳实践相关的文档,以确保新团队成员的知识共享和无缝入职。
【要求】
英语要求:6级 以上。
计算机科学或相关领域的学士学位,重点关注高性能计算、分布式系统或深度学习。
拥有 3 年以上管理 GPU 集群的成熟经验,包括安装、配置和优化。
在分布式深度学习和并行训练技术方面拥有丰富的专业知识。
精通流行的深度学习框架,如 TensorFlow、PyTorch 或 MXNet。
Python 编程技能以及 GPU 加速库(例如 CUDA、cuDNN)的经验。
了解 HPC 和深度学习的性能分析和优化工具。
熟悉资源管理和调度系统(例如 SLURM、Kubernetes)。
【招聘单位】
阿联酋AI国王大学。
【投递方式】
1. 右上角【投递简历】按钮
2. 发送简历到海拉拉邮箱:cv@hirelala.com
Senior High Performance Computing Engineer
Impact on Business
As an HPC Expert, you will play a pivotal role in advancing our organization's computational capabilities and accelerating the development of cutting-edge deep learning solutions. You will be responsible for managing and optimizing the performance of GPU clusters and implementing distributed computing strategies to efficiently train large-scale deep learning models.
Job Responsibilities
GPU Cluster Management: Design, deploy, and maintain high-performance GPU clusters, ensuring their stability, reliability, and scalability. Monitor and manage cluster resources to maximize utilization and efficiency.
Distributed/Parallel Training: Implement distributed computing techniques to enable parallel training of large deep learning models across multiple GPUs and nodes. Optimize data distribution and synchronization to achieve faster convergence and reduced training times.
Performance Optimization: Fine-tune GPU clusters and deep learning frameworks to achieve optimal performance for specific workloads. Identify and resolve performance bottlenecks through profiling and system analysis.
Deep Learning Framework Integration: Collaborate with data scientists and machine learning engineers to integrate distributed training capabilities into existing deep learning frameworks (e.g., TensorFlow, PyTorch, MXNet).
Scalability and Resource Management: Ensure that the GPU clusters can scale effectively to handle increasing computational demands. Develop resource management strategies to prioritize and allocate computing resources based on project requirements.
Security and Compliance: Implement security measures to protect GPU clusters and data while adhering to industry best practices and compliance standards.
Troubleshooting and Support: Troubleshoot and resolve issues related to GPU clusters, distributed training, and performance anomalies. Provide technical support to users and resolve technical challenges efficiently.
Documentation: Create and maintain documentation related to GPU cluster configuration, distributed training workflows, and best practices to ensure knowledge sharing and seamless onboarding of new team members.
Requirements
Bachelor’s degree in computer science, or a related field with a focus on High-Performance Computing, Distributed Systems, or Deep Learning.
3+ years proven experience in managing GPU clusters, including installation, configuration, and optimization.
Strong expertise in distributed deep learning and parallel training techniques.
Proficiency in popular deep learning frameworks like TensorFlow, PyTorch, or MXNet.
Programming skills in Python and experience with GPU-accelerated libraries (e.g., CUDA, cuDNN).
Knowledge of performance profiling and optimization tools for HPC and deep learning.
Familiarity with resource management and scheduling systems (e.g., SLURM, Kubernetes).