Infrastructure Site Reliability Engineering (Infra SRE) Lead

Hong Kong 9 days agoFull-time External
Negotiable
Infrastructure Site Reliability Engineering (Infra SRE) Lead The Infra SRE Lead is a senior technical and people leader responsible for the design, reliability, availability, security, and scalability of all infrastructure supporting our 24/7, regulated trading, custody, and payments platforms. This role demands expert-level knowledge of AWS to protect systems where downtime or failure has direct financial and security implications. This role leads a cross-site Infra SRE team (HK + SZ), driving Infrastructure-as-Code, Kubernetes platforms, observability, and disaster recovery. This is a hands-on leadership role: 50% technical ownership + 50% team leadership & governance. Key Responsibilities 1. Infrastructure Ownership & Reliability • Own the reliability, scalability, and performance of core infrastructure across AWS. • Architect, optimize, and manage Kubernetes platforms (EKS, multi-cluster, multi-region). • Architect and manage secure, scalable, and cost-optimized network topologies using VPC, subnets, security groups, and PrivateLink. • Ensure capacity planning, auto-scaling, and performance tuning across compute, storage, and networking. 2. Lead Infra SRE Team (HK + SZ) • Manage and mentor a team of Infra SREs across two locations. • Define team OKRs focusing on reliability, automation, and SLOs. • Drive a strong engineering culture of documentation, runbooks, and proactive improvement. 3. Infrastructure as Code & Standardization • Lead the design and implementation of complex, reusable Terraform modules to govern all cloud resources. • Enforce infrastructure change governance, cost control, and compliance across multi-account AWS setups. 4. Observability, Monitoring, & Incident Response • Own the observability stack (Prometheus / Grafana) to ensure full metrics, logging, and alerting coverage. • Act as a technical escalation point for production incidents, leading troubleshooting and robust post-incident reviews. 5. Security, DR, and High Availability • Implement DR and failover strategies across regions. • Ensure high-availability design across Kubernetes, databases, and VPC networking. • Partner with Security on IAM governance, hardening, and audit compliance. Who We Are Looking For • 8+ years of experience in a dedicated Site Reliability Engineering (SRE), Infrastructure, or Production Engineering role, with at least 3+ years in a formal team lead or management position. • 5+ years of hands-on, practical experience building and managing mission-critical infrastructure on AWS. • Expert-level proficiency with Infrastructure as Code, specifically Terraform, used to manage large-scale, complex environments. • Deep, architectural knowledge of core AWS services is required: VPC, EKS, IAM, KMS, and RDS. • Proven experience in a high-availability (24/7) environment, preferably within financial services, trading, or a similarly regulated industry. • Strong scripting skills (Python, Bash, Go preferred). • Excellent leadership, communication, and cross-team collaboration skills. Life at OSL • Pioneer: Build the foundational technology for the future of Web3 with a listed industry leader. • Impact: Your work directly shapes the security and scalability of our global digital asset platform. • Talent: Work alongside and learn from the industry's best engineers and leaders. • Growth: We invest in your career and development as much as you do. How to Apply If you're ready to build the future of finance with us, please apply with your resume. About OSL Group OSL Group (863.HK) is a leading global financial infrastructure platform bridging traditional finance and the digital asset economy through blockchain technology. The Group is dedicated to providing efficient, seamless, and regulatory-compliant financial services to individuals and businesses worldwide. OSL delivers a comprehensive suite of regulated services through its licensed platforms, including 24/7 OTC brokerage with deep liquidity fiat gateways and competitive pricing; omnibus brokerage solutions enabling traditional financial institutions to integrate digital assets; SOC 2 Type 2-certified custody with up to US$1 billion insurance protection; and compliant retail trading channels; wealth management solutions, including scheduled launches on tokenised treasuries and RWAs; and in preparation for cross-border payment infrastructure via OSL Pay. "Open, Secure, Licensed" are the principles OSL lives by. OSL is expanding its compliant infrastructure across Japan, Australia, and Europe, potentially Southeast Asia, powering the next generation of global financial infrastructure.