Responsibilities HPC Network Architecture & Engineering • Design, deploy, and maintain high-performance network architectures for HPC clusters, GPU servers, CPU nodes, and parallel storage systems. • Configure and optimize high-speed interconnects, including InfiniBand, RoCE, and high-speed Ethernet (25/100/200GbE+), to support low-latency and high-throughput workloads. • Design network topologies optimized for MPI traffic, NCCL collectives, and large-scale data transfers. • Integrate networking solutions with parallel file systems such as Lustre, BeeGFS, or GPFS. Network Operations, Monitoring & Troubleshooting • Monitor network performance, capacity, and availability across all HPC facilities. • Diagnose and resolve complex network issues affecting compute, storage, and distributed training workloads. • Implement performance monitoring, alerting, and diagnostics using HPC-specific networking tools. • Ensure maximum uptime and performance for research computing resources. Security, Compliance & Reliability • Implement and maintain network security controls aligned with data center and institutional standards. • Ensure compliance with internal policies, safety requirements, and regulatory obligations. • Develop preventive maintenance procedures and support disaster recovery and resilience planning for network infrastructure. Upgrades, Capacity Planning & Innovation • Plan and execute network upgrades, expansions, and technology refreshes with minimal disruption to research activities. • Support capacity planning and forecasting for growing AI/HPC workloads. • Evaluate emerging networking technologies relevant to AI and HPC (e.g., SmartNICs, CXL, GPUDirect RDMA). Documentation & Collaboration • Develop and maintain detailed network documentation, architecture diagrams, configuration records, and operational procedures. • Collaborate with HPC system engineers, storage architects, MLOps, and research teams to ensure end-to-end system performance. • Provide expert-level support and guidance on network-related issues to internal stakeholders. Requirements • Minimum 5 years of experience in network engineering, with at least 3 years in HPC or research computing environments. • Extensive hands-on experience with high-performance networking technologies such as InfiniBand, Omni-Path, RoCE, or high-speed Ethernet. • Proven expertise configuring and troubleshooting network infrastructure for parallel file systems (e.g., Lustre, GPFS, BeeGFS). • Strong understanding of data-center networking concepts, including routing, switching, VLANs, RDMA, and network security. • Experience designing networks optimized for MPI workloads and large-scale distributed AI training. • Proficiency with network monitoring and diagnostic tools in HPC environments. • Ability to work in a demanding, service-oriented environment with strong organization, communication, and collaboration skills. Preferred Qualifications • Experience with software-defined networking (SDN) in HPC contexts. • Professional certifications such as CCNP, CCIE, or equivalent. • Experience supporting HPC environments in academic or research institutions. • Exposure to GPU-centric networking architectures and NVIDIA networking technologies. About the Company MBZUAI is seeking a highly skilled HPC Network Engineer to design, implement, and operate the high-performance networking infrastructure that underpins the university’s research computing environment. This role is critical to ensuring reliable, low-latency, and high-bandwidth connectivity across GPU and CPU clusters, parallel storage systems, and research platforms supporting large-scale AI/ML and robotics workloads.

HPC Network Engineer

Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)