Senior ML Platform Engineer

Singapore 2 days agoFull-time External
65.8k - 109.7k / mo
About the Team: As part of our brand-new AI team, we're building cutting-edge AI and its platforms to transform the way we provide support and automation solutions for our internal teams and external consumers. We aim to unlock any knowledge that exists within Airwallex to power use cases across our organization. The team is crucial in driving innovation and setting the standard for future developments in this exciting new field. Role & Project Scope: We are seeking a skilled and passionate ML Platform Engineer to join our team and build the next generation of our machine learning infrastructure. You will be responsible for designing, implementing, and maintaining the core MLOps platform that empowers our Data Science and ML Engineering teams to rapidly develop, deploy, and monitor high-performance models at scale. Crucially, you will contribute to the evolution of our unified AI Platform, covering both traditional ML and our growing LLM (Large Language Model) platform. What You'll Do: • Platform Development: Design, build, and maintain the end-to-end MLOps platform using Kubernetes and Cloud Services. • Infrastructure as Code (IaC): Use Terraform or similar tools to manage, provision, and scale all ML-related infrastructure securely and efficiently. • Pipeline Automation: Implement and optimize CI/CD/CT (Continuous Integration, Delivery, Training) pipelines to automate model training, testing, packaging, and deployment using tools like Argo and Kubeflow Pipelines. • Serving Infrastructure: Build highly available, low-latency, and high-throughput model serving infrastructure. • Observability: Implement robust monitoring, alerting, and logging solutions to track infrastructure health, model performance, and data/model drift. • Tooling & Support: Evaluate, integrate, and support ML tools such as Feature Stores and distributed model training pipelines. • Security & Compliance: Ensure platform security, implement RBAC (Role-Based Access Control), and manage secrets for sensitive data and production environments. • Collaboration: Work closely with Data Scientists and ML Engineers to understand their needs and provide technical guidance on best practices for scaling their models. What You Need to Have: • 5+ years in backend software development, with at least 2+ years focus on AI/ML Platform or MLOps infrastructure. • Deep expertise in MLOps practices, including automated deployment pipelines, model optimization, and production lifecycle management. • Proven experience designing and implementing low-latency model serving solutions. • Proficiency in Python. • Skill in writing high-quality, maintainable code. • Experience in design and development of large-scale distributed, high concurrency, low-latency inference, high availability systems. • Excellent communication and mentoring abilities. • A relevant degree in Computer Science, Mathematics or related fields. Preferred qualifications: • Familiarity with distributed compute/training frameworks (e.g., Ray, Spark). • Experience configuring and managing ML workflows on cloud infrastructure (e.g., Kubernetes, Kubeflow). • Working knowledge of LLM serving optimization (e.g., vLLM, TGI, Triton) and GPU resource management