Senior Cloud Engineer

San Francisco 1 days agoContractor External
Negotiable
Title: Senior Cloud Engineer (AWS) Location: Foster City, CA 94404 Type: Contract About Smart IT Frame: At Smart IT Frame, we connect top talent with leading organizations across the USA. With over a decade of staffing excellence, we specialize in IT, healthcare, and professional roles, empowering both clients and candidates to grow together. Scope of Work HPC Cluster Deployment • Automate the deployment process of HPC clusters using CI/CD pipelines by utilizing GitHub pipeline and AWS Systems Manager • Implement CI/CD pipelines to manage and deploy updates to the HPC cluster efficiently • Set up and configure HPC clusters to meet specific requirements and workloads • Manage and maintain HPC hardware components such as CPUs and GPUs, along with the necessary software • Conduct regression testing to verify the functionality and performance of non-GXP HPC clusters Workload Scheduler Management • Install and configure workload managers and schedulers like LSF, SLURM, and PBS Pro • Manage the addition and removal of compute nodes and adjust the priority of master and slave nodes • Develop and manage resource policies and rules to optimize cluster performance • Configure and allocate resources such as CPU and memory, and profile applications for optimal performance • Address and resolve issues related to schedulers, daemons, and license servers Network and High-Performance Connectivity Management • Install and configure HPC interconnect networks • Design and configure the network topology for HPC clusters • Ensure the maintenance and monitoring of InfiniBand connectivity • Resolve connectivity issues related to InfiniBand, RoCE, and Ethernet Monitoring and Reports • Produce daily health check reports for the HPC cluster • Automate monitoring scripts to streamline the monitoring process • Conduct periodic reviews of reports and audit trails OS Administration and Management • Install and configure operating systems for HPC clusters • Address OS-related issues such as CPU, memory, and SWAP utilization, and perform application file system cleanup • Ensure application service continuity by performing pre and post checks from both OS and application perspectives during planned and unplanned outages Applications and Tools • Install HPC libraries and tools such as MPI and compilers • Install and configure HPC applications, both commercial off-the-shelf (COTS) and open source, and manage packages using Spack • Apply patches and upgrades to HPC applications • Resolve issues related to HPC applications HPC Storage Management • Administer and configure HPC storage systems • Oversee the administration of HPC file systems • Monitor and troubleshoot HPC storage systems • Manage backup and tape library systems Key Responsibilities Cluster Management • Install, configure, and maintain compute nodes, GPUs (NVIDIA), high-speed storage (Lustre, GPFS), and interconnects (InfiniBand, RoCE) Performance Tuning • Optimize scientific applications, kernels, and workflows for maximum throughput, scalability, and minimal queue times User Support • Act as a technical expert for researchers, debugging jobs, resolving complex issues, and providing training on tools and best practices Software Management • Manage workload managers (Slurm, LSF), schedulers, software licensing (FlexLM), OpenPBS, containers (Singularity), and compilers Infrastructure • Administer high-speed interconnects (InfiniBand), storage (Lustre, CEPH), and potentially cloud/hybrid solutions • Implement and manage monitoring (Grafana, Prometheus) and orchestration tools (Slurm, Kubernetes) Automation • Develop scripts (Python, Ansible) for provisioning, monitoring, and automating routine tasks Security & Policy • Implement and enforce security policies, manage user access, and oversee lifecycle management Essential Skills & Qualifications Technical Expertise • Strong Linux, Python, scripting (Ansible, Terraform), HPC schedulers (Slurm), networking (InfiniBand), and GPU computing HPC Domain Knowledge • Experience with parallel file systems, workload management, and performance analysis tools Problem Solving • Excellent analytical and debugging skills for complex distributed systems Communication • Ability to explain complex technical issues to scientists and non-technical stakeholders Experience • Hands-on experience in data centers, managing large clusters, and supporting diverse scientific/AI workloads Top Skills Required • HPC – High Performance Computing • AWS Cloud Services • DevOps CI/CD • Python Apply today or share profiles at Gayathri.s@smartitframe.com