Responsibilities
HPC Network Architecture & Engineering
• Design, deploy, and maintain high-performance network architectures for HPC clusters, GPU servers, CPU nodes, and parallel storage systems.
• Configure and optimize high-speed interconnects, including InfiniBand, RoCE, and high-speed Ethernet (25/100/200GbE+), to support low-latency and high-throughput workloads.
• Design network topologies optimized for MPI traffic, NCCL collectives, and large-scale data transfers.
• Integrate networking solutions with parallel file systems such as Lustre, BeeGFS, or GPFS.
Network Operations, Monitoring & Troubleshooting
• Monitor network performance, capacity, and availability across all HPC facilities.
• Diagnose and resolve complex network issues affecting compute, storage, and distributed training workloads.
• Implement performance monitoring, alerting, and diagnostics using HPC-specific networking tools.
• Ensure maximum uptime and performance for research computing resources.
Security, Compliance & Reliability
• Implement and maintain network security controls aligned with data center and institutional standards.
• Ensure compliance with internal policies, safety requirements, and regulatory obligations.
• Develop preventive maintenance procedures and support disaster recovery and resilience planning for network infrastructure.
Upgrades, Capacity Planning & Innovation
• Plan and execute network upgrades, expansions, and technology refreshes with minimal disruption to research activities.
• Support capacity planning and forecasting for growing AI/HPC workloads.
• Evaluate emerging networking technologies relevant to AI and HPC (e.g., SmartNICs, CXL, GPUDirect RDMA).
Documentation & Collaboration
• Develop and maintain detailed network documentation, architecture diagrams, configuration records, and operational procedures.
• Collaborate with HPC system engineers, storage architects, MLOps, and research teams to ensure end-to-end system performance.
• Provide expert-level support and guidance on network-related issues to internal stakeholders.
Requirements
• Minimum 5 years of experience in network engineering, with at least 3 years in HPC or research computing environments.
• Extensive hands-on experience with high-performance networking technologies such as InfiniBand, Omni-Path, RoCE, or high-speed Ethernet.
• Proven expertise configuring and troubleshooting network infrastructure for parallel file systems (e.g., Lustre, GPFS, BeeGFS).
• Strong understanding of data-center networking concepts, including routing, switching, VLANs, RDMA, and network security.
• Experience designing networks optimized for MPI workloads and large-scale distributed AI training.
• Proficiency with network monitoring and diagnostic tools in HPC environments.
• Ability to work in a demanding, service-oriented environment with strong organization, communication, and collaboration skills.
Preferred Qualifications
• Experience with software-defined networking (SDN) in HPC contexts.
• Professional certifications such as CCNP, CCIE, or equivalent.
• Experience supporting HPC environments in academic or research institutions.
• Exposure to GPU-centric networking architectures and NVIDIA networking technologies.
About the Company
MBZUAI is seeking a highly skilled HPC Network Engineer to design, implement, and operate the high-performance networking infrastructure that underpins the university’s research computing environment. This role is critical to ensuring reliable, low-latency, and high-bandwidth connectivity across GPU and CPU clusters, parallel storage systems, and research platforms supporting large-scale AI/ML and robotics workloads.