We are seeking a senior distributed machine learning (ML) research developer to join our team working on a novel AI safety agenda. In this role, you will work closely with ML research scientists to solve difficult training and inference problems using very large models.
Key Responsibilities
• Collaborate with researchers to accelerate research, model training and inference, and facilitate the use of large-scale models in distributed computing environments.
• Investigate performance bottlenecks, profile research experiment code, debug reported issues, and optimize the utilization of computing resources.
• Develop tools and libraries to simplify and orchestrate the use of distributed computing resources for research experiments.
• Establish, document, and maintain best practices for large-scale, distributed ML model development workflows.
Skills and Qualifications
• A degree in a relevant computer science field (e.g., computer science, computer engineering, software engineering) is required. An advanced degree (master's or PhD) related to machine learning or distributed ML systems is preferred but not required if the candidate demonstrates exceptional abilities and experience.
• 3+ years of experience in designing and implementing distributed ML training frameworks, with recent experience using e.g. Megatron, DeepSpeed, HuggingFace Accelerate, FSDP, vLLM, and/or verl.
• Ability to collaborate effectively with cross-functional teams, document best practices, and stay updated with the latest advancements in ML and software development.
• Experience with cloud platforms (e.g., AWS, GCP, Azure) and workload managers (e.g., Ray, SLURM).
• Experience with GPU profiling tools (e.g. PyTorch profiler, PyProf, NVIDIA Nsight).
• Familiarity with containerization tools (e.g., gRPC, Docker, Kubernetes).
• Familiarity with data infrastructures and platforms (e.g., vector databases).
• A track record of contributing to high-quality research projects in deep learning.
The title of Engineer is used for reference purposes and may or may not be the official title of the applicant based on jurisdiction.
What we offer
• The opportunity to contribute to a unique mission with a major impact.
• Comprehensive health benefits (including mental health and wellness management account)
• 20 days of vacation per year upon start
• Employer contribution of 4% to your retirement savings, with no required employee matching
• Additional compensation totaling 8% of your salary to apply towards additional retirement savings or bonuses (independent of group and individual performance)
• A team of passionate world-class experts in their field
• A collaborative and inclusive work environment in our vibrant office space in the heart of Little Italy, in the trendy Mile-Ex district, close to public transportation