Infrastructure Engineer for AI and HPC Solutions

Los Angeles 9 days agoFull-time External
Negotiable
We Are: The Global Infrastructure Engineering AI & HPC team plays a crucial role in shaping the infrastructure for cutting-edge advancements in AI and High-Performance Computing (HPC). Our team skillfully blends technical expertise across cloud, on-premise, and hybrid environments to develop and maintain sophisticated infrastructure that supports high-performance workloads at scale. By delivering innovative solutions, we enable our key clients to achieve remarkable levels of performance, efficiency, and creativity. Our work spans the entire project lifecycle—from strategic planning and architecture to implementation and ongoing management—propelling modernization initiatives throughout the infrastructure framework. We collaborate with the ecosystem to leverage emerging technologies, foster growth, and transform industries. In this rapidly changing environment, our team is leading the charge, assisting enterprises in harnessing AI and HPC to drive transformative innovation and elevate infrastructure capabilities. Key Responsibilities: • Design and implement robust infrastructure solutions for HPC and AI, ensuring they meet specific industry performance and scalability standards. • Deploy, configure, and oversee clusters utilizing XPU (CPU/GPU/accelerators) technologies through schedulers, VM/Kubernetes orchestration platforms, Slurm, and containerized services to provide Metal as a Service (MaaS), GPUaaS, and AIaaS. • Enhance performance, scalability, energy efficiency, and cost-effectiveness of clusters across on-premises, cloud, and hybrid setups. • Integrate AI and HPC platforms with existing IT systems, data pipelines, and security protocols. • Manage, troubleshoot, and optimize infrastructure to ensure high availability, low-latency networking, and resilient workloads. • Create and maintain detailed documentation, including architecture diagrams, configuration guidelines, and operational manuals. • Provide technical support and guidance to users, optimizing the execution of HPC/AI tasks, large models, and simulations. Travel may be required for this role, ranging from 25% to 100% depending on business needs and client requirements. Required Skills and Qualifications: • Minimum of 4 years of hands-on experience in designing, deploying, and managing HPC and AI infrastructure across on-premises, cloud, and hybrid environments in multiple sectors, including hyperscalers, neocloud, large enterprises, and Telco/Mobile, while serving critical industries like Financial Services, Life Sciences, Manufacturing, and Retail. • At least 4 years of experience with accelerated computing architectures (GPUs, XPUs, DPUs), high-performance networking (InfiniBand, Ethernet), SONiC, and modern storage/data platforms (e.g., NVMe-oF, Lustre, GPFS, BeeGFS, VAST, DDN, Weka) for effective solution development. • A minimum of 4 years in cluster management and orchestration (e.g., Slurm, Run:ai, Kubernetes, Docker), along with real-time performance monitoring and observability frameworks. • At least 4 years working with cloud and virtualization platforms (e.g., AWS, Azure, GCP, VMware, Nutanix), with expertise in automation and optimization using scripting (Python, AI tools) in addition to foundational Infrastructure-as-Code tools like Terraform and Ansible. • Minimum of 4 years of experience in implementing MLOps and DevSecOps frameworks to facilitate secure, automated, and reproducible workflows. • A Bachelor's degree or equivalent experience (minimum of 12 years). Candidates with an Associate's Degree must have at least 6 years of relevant work experience. Preferred Skills and Qualifications: • Experience managing deployments of large-scale GPU clusters (1,000+ GPUs) for HPC and AI workloads with diverse infrastructure services enabled. • Familiarity with GPU computing libraries and accelerators (e.g., NVIDIA CUDA, Dynamo, AMD ROCm). • Knowledge of AI and HPC Networking (e.g., RoCE, InfiniBand, multi-planar/multi-rail designs, platform buffer architectures). • Proficiency in Machine Learning and AI frameworks (e.g., TensorFlow, PyTorch, JAX), including experience in Jupyter notebooks and Google Colab environments. • Experience with optimization techniques for managing HPC & AI workloads. • Familiarity with DevOps practices and tools (e.g., Ansible, Terraform) for automating infrastructure processes. • Industry certifications related to NVIDIA infrastructure, public cloud providers, or Data Science are a plus. Please note that compensation at Accenture varies based on numerous factors, including office location, role, skill set, and experience level. We accept applications on an ongoing basis, with no fixed deadline for submission. For details on benefits and accommodation options, please refer to Accenture's resources. Accenture is unwavering in its commitment to equal employment opportunities and values diversity within the workforce. All employment decisions are made without regard to discrimination. We emphasize innovation, competitiveness, and creativity driven by our diverse team.