Title: Senior Cloud Engineer (AWS)
Location: Foster City, CA 94404
Type: Contract
About Smart IT Frame:
At Smart IT Frame, we connect top talent with leading organizations across the USA. With over a decade of staffing excellence, we specialize in IT, healthcare, and professional roles, empowering both clients and candidates to grow together.
Scope of Work
HPC Cluster Deployment
• Automate the deployment process of HPC clusters using CI/CD pipelines by utilizing GitHub pipeline and AWS Systems Manager
• Implement CI/CD pipelines to manage and deploy updates to the HPC cluster efficiently
• Set up and configure HPC clusters to meet specific requirements and workloads
• Manage and maintain HPC hardware components such as CPUs and GPUs, along with the necessary software
• Conduct regression testing to verify the functionality and performance of non-GXP HPC clusters
Workload Scheduler Management
• Install and configure workload managers and schedulers like LSF, SLURM, and PBS Pro
• Manage the addition and removal of compute nodes and adjust the priority of master and slave nodes
• Develop and manage resource policies and rules to optimize cluster performance
• Configure and allocate resources such as CPU and memory, and profile applications for optimal performance
• Address and resolve issues related to schedulers, daemons, and license servers
Network and High-Performance Connectivity Management
• Install and configure HPC interconnect networks
• Design and configure the network topology for HPC clusters
• Ensure the maintenance and monitoring of InfiniBand connectivity
• Resolve connectivity issues related to InfiniBand, RoCE, and Ethernet
Monitoring and Reports
• Produce daily health check reports for the HPC cluster
• Automate monitoring scripts to streamline the monitoring process
• Conduct periodic reviews of reports and audit trails
OS Administration and Management
• Install and configure operating systems for HPC clusters
• Address OS-related issues such as CPU, memory, and SWAP utilization, and perform application file system cleanup
• Ensure application service continuity by performing pre and post checks from both OS and application perspectives during planned and unplanned outages
Applications and Tools
• Install HPC libraries and tools such as MPI and compilers
• Install and configure HPC applications, both commercial off-the-shelf (COTS) and open source, and manage packages using Spack
• Apply patches and upgrades to HPC applications
• Resolve issues related to HPC applications
HPC Storage Management
• Administer and configure HPC storage systems
• Oversee the administration of HPC file systems
• Monitor and troubleshoot HPC storage systems
• Manage backup and tape library systems
Key Responsibilities
Cluster Management
• Install, configure, and maintain compute nodes, GPUs (NVIDIA), high-speed storage (Lustre, GPFS), and interconnects (InfiniBand, RoCE)
Performance Tuning
• Optimize scientific applications, kernels, and workflows for maximum throughput, scalability, and minimal queue times
User Support
• Act as a technical expert for researchers, debugging jobs, resolving complex issues, and providing training on tools and best practices
Software Management
• Manage workload managers (Slurm, LSF), schedulers, software licensing (FlexLM), OpenPBS, containers (Singularity), and compilers
Infrastructure
• Administer high-speed interconnects (InfiniBand), storage (Lustre, CEPH), and potentially cloud/hybrid solutions
• Implement and manage monitoring (Grafana, Prometheus) and orchestration tools (Slurm, Kubernetes)
Automation
• Develop scripts (Python, Ansible) for provisioning, monitoring, and automating routine tasks
Security & Policy
• Implement and enforce security policies, manage user access, and oversee lifecycle management
Essential Skills & Qualifications
Technical Expertise
• Strong Linux, Python, scripting (Ansible, Terraform), HPC schedulers (Slurm), networking (InfiniBand), and GPU computing
HPC Domain Knowledge
• Experience with parallel file systems, workload management, and performance analysis tools
Problem Solving
• Excellent analytical and debugging skills for complex distributed systems
Communication
• Ability to explain complex technical issues to scientists and non-technical stakeholders
Experience
• Hands-on experience in data centers, managing large clusters, and supporting diverse scientific/AI workloads
Top Skills Required
• HPC – High Performance Computing
• AWS Cloud Services
• DevOps CI/CD
• Python
Apply today or share profiles at Gayathri.s@smartitframe.com