Site Reliability Engineering (SRE) Manager

Singapore 2 days agoFull-time External
Negotiable
Situated in the heart of Singapore's Central Business District, Rakuten Asia Pte. Ltd. is Rakuten's Asia Regional headquarters. Established in August 2012 as part of Rakuten's global expansion strategy, Rakuten Asia comprises various businesses that provide essential value-added services to Rakuten's global ecosystem. Through advertisement product development, product strategy, and data management, among others, Rakuten Asia is strengthening Rakuten Group's core competencies to take the lead in an increasingly digitalized world. Rakuten Group, Inc. is a global leader in internet services that empower individuals, communities, businesses, and society. Founded in Tokyo in 1997 as an online marketplace, Rakuten has expanded to offer services in e-commerce, fintech, digital content, and communications to approximately 1.7 billion members around the world. The Rakuten Group has nearly 32,000 employees and operations in 30 countries and regions. For more information visit https://global.rakuten.com/corp/. The Marketing Cloud Platform Department (MCPD) drives Rakuten's marketing product strategy, executes product development, and ensures successful implementation. We empower Rakuten's internal marketing teams by creating engaging, respectful, and cost-efficient marketing platforms that prioritize our customers. Leveraging the Rakuten Ecosystem, we offer comprehensive marketing solutions, including campaign management, multichannel communication, and personalization. As a team of over 150 experts across Japan, India, and Singapore, we pride ourselves on being a technology-driven organization that shares knowledge within the Rakuten Tech community. As an SRE Manager in MCPD, you will lead a team of Site Reliability Engineers responsible for ensuring the reliability, scalability, and performance of our marketing cloud platforms. You will drive operational excellence by implementing best practices in observability, incident management, and automation. This role bridges engineering and operations, requiring both strong technical expertise and people management skills to build and maintain highly available systems that serve millions of Rakuten's customers globally. Responsibilities: • Lead, mentor, and grow a team of Site Reliability Engineers across multiple locations (Singapore, Japan, India), fostering a culture of collaboration, continuous learning, and operational excellence • Define and drive SRE strategy, including SLO/SLI frameworks, error budgets, and reliability targets aligned with business objectives and customer expectations • Establish and improve incident management processes, including on-call rotations, escalation procedures, and blameless post-mortem practices to minimize MTTR and prevent recurring issues • Collaborate with development teams to embed reliability practices into the software development lifecycle, advocating for design reviews, chaos engineering, and production readiness reviews • Design and implement comprehensive observability solutions (monitoring, logging, tracing, alerting) to provide actionable insights into system health and performance • Drive automation initiatives to reduce toil, improve deployment reliability, and enable self-service capabilities for engineering teams • Partner with Architecture and Platform teams to ensure infrastructure decisions support scalability, fault tolerance, and cost optimization goals • Manage capacity planning and performance optimization for critical marketing platforms handling high-volume campaign executions and real-time personalization • Report on reliability metrics, incident trends, and operational health to leadership, translating technical insights into business impact assessments Required Qualifications: • 8+ years of experience in software engineering, DevOps, or site reliability engineering, with at least 3 years in a people management role • Proven track record of building and leading high-performing SRE or platform engineering teams in a distributed, multi-timezone environment • Deep expertise in cloud platforms (GCP preferred, AWS/Azure acceptable) including compute, networking, storage, and managed services • Strong knowledge of containerization and orchestration technologies (Kubernetes, Docker) and Infrastructure as Code (Terraform, Ansible) • Hands-on experience with observability tools and practices (Prometheus, Grafana, Datadog, ELK Stack, or similar) and defining meaningful SLOs/SLIs • Experience with CI/CD pipelines, deployment strategies (blue-green, canary), and release engineering best practices • Strong programming/scripting skills in languages such as Python, Go, or Java for automation and tooling development • Excellent communication skills with the ability to collaborate effectively across engineering, product, and business stakeholders • Strong incident management experience with demonstrated ability to lead high-pressure situations calmly and effectively Nice to Have: • Experience with big data technologies (Hadoop, Spark, Kafka) and data pipeline reliability • Familiarity with marketing technology platforms, email delivery systems, or customer data platforms • Knowledge of database administration and optimization (PostgreSQL, MySQL, Redis, Couchbase) • Experience with chaos engineering practices and tools (Chaos Monkey, Litmus, Gremlin) • Certifications such as Google Cloud Professional Cloud Architect, AWS Solutions Architect, or Kubernetes Administrator (CKA) • Japanese language proficiency is a plus for collaboration with Japan-based teams Rakuten is an equal opportunities employer and welcomes applications regardless of sex, marital status, ethnic origin, sexual orientation, religious belief, or age.