Senior Site Reliability Engineer
Job ID
6630
Location
SLAC - Menlo Park, CA
Full-Time
Regular
SLAC Job Postings
Join the Data Management (DM) team at the Vera C. Rubin Observatory , one of modern astronomy's defining missions. The Rubin Observatory is a new astronomy facility in Chile designed to create a 10-year time-lapse map of the southern sky through the Legacy Survey of Space and Time (LSST).
As part of this team, you'll design, operate, and sustain the systems that process Rubin's data in near real time. LSST will generate 15 TB of raw pixels per night with its 8-meter mirror and 3.2 gigapixel camera, creating one of the most demanding petascale data challenges in science.
The Data Management System¿s Prompt Processing Framework identifies and distributes Alerts for every astrophysical object that moves, changes, or appears in the sky within minutes of observation. These alerts include potentially hazardous asteroids, supernovae, and entirely new classes of transient phenomena. Your work will directly enable astrophysical discoveries by keeping Rubin's alerts flowing.
You will join a distributed team of roughly 80 scientists and engineers building and operating Rubin's petascale data management systems. Our work spans large-scale image processing, distributed databases, and production services. Python is our lingua franca, and we develop our software openly on GitHub under an open-source license.
Your role:
You will own the reliability and robustness of Rubin Observatory's Prompt Processing Framework, the system responsible for detecting and distributing near-real-time alerts for transient and moving objects in the night sky. The Prompt Processing Framework runs on Kubernetes, with event-driven scaling using Kubernetes Event-Driven Autoscaling (KEDA) integrated with Redis Streams. It interfaces with PostgreSQL databases and Kafka to ingest data and publish alerts to the global astronomy community.
Your responsibilities:
• Ensure, through both architecture and practice, the reliable operation of the near-real-time data processing pipeline and timely delivery of alerts to downstream brokers.
• Design and develop software that reduces operational risk and improves system resilience, scalability, and usability, including addressing failure modes, error handling, and contention in shared resources.
• Improve system performance and resilience by applying architectural and systems-level optimizations to increase throughput and reduce end-to-end latency.
• Operate DevOps-oriented continuous deployment of services using modern distributed systems tooling and development practices (e.g., Kubernetes, Helm, ArgoCD, Kafka, Redis)
• Develop monitoring dashboards and alerts for the prompt processing service and work with teammates to design and implement a sustainable on-call rotation that provides coverage during the start of observing hours in Chile (typically 2-5pm Pacific Time), with limited off-hours responsibility.
• Define KPIs and metrics for observability and accountability of the pipeline.
• Participate in the collective engineering activities of the team, including performing code reviews, acting as a troubleshooting buddy, participating in design discussions, and writing documentation to effectively capture and communicate architectural and implementation choices.
• Collaborate with members of the Data Management team to identify opportunities to improve tools, workflows, and operational practices.
• Share responsibility with the broader team for the overall success of the Data Management system, beyond the Prompt Processing Framework.
Tech Stack
The Prompt Processing Framework is built on a modern, cloud-native foundation. It runs on Kubernetes, with deployments managed via Helm and ArgoCD, and uses event-driven scaling through KEDA and Redis Streams. The system integrates with PostgreSQL and Kafka to ingest data and distribute alerts, with additional databases including Cassandra and InfluxDB. Our primary development language is Python, and our code is developed openly under an open-source model.
To be successful in this position you will bring:
• Bachelor's degree and eight years of relevant experience, or a combination of education and relevant experience designing and operating distributed systems at-scale in production environments.
• Experience working in an SRE, DevOps, or data-intensive systems role, with responsibility for building, operating, and improving robust services.
• Experience engaging with modern production infrastructure (e.g., containerized services, messaging systems, and databases; see above for our current tech stack), with the ability to learn and apply new tools quickly in a production environment.
• Familiarity with contemporary distributed service architectures, including service-to-service communication patterns, common failure modes, and system behavior under load and scale.
• Fluency in at least one modern programming language (Python preferred) with experience working across the boundary between software engineering and operations.
• Experience working with large-scale datasets or high-throughput data processing systems, and an understanding of the operational challenges that come with data volume and velocity.
• Ability to communicate clearly with engineers and scientists from diverse backgrounds, including explaining technical concepts, participating in design discussions, and documenting systems and decisions.
• Comfort working with a high degree of autonomy, taking ownership of technical decisions and execution, while being supported by an experienced team with clear priorities and goals.
We expect candidates to bring strength in some of these areas and curiosity to grow in others.
SLAC Employee competencies :
• Effective Decisions : Uses job knowledge and solid judgment to make quality decisions in a timely manner.
• Self-Development : Pursues a variety of venues and opportunities to continue learning and developing.
• Dependability : Can be counted on to deliver results with a sense of personal responsibility for expected outcomes.
• Initiative : Pursues work and interactions proactively with optimism, positive energy, and motivation to move things forward.
• Adaptability : Flexes as needed when change occurs, maintains an open outlook while adjusting and accommodating changes.
• Communication : Ensures effective information flow to various audiences and creates and delivers clear, appropriate written, spoken, presented messages.
• Relationships : Builds relationships to foster trust, team collaboration, and a positive climate to achieve common goals.
Physical requirements and working conditions :
• Consistent with its obligations under the law, the University will provide reasonable accommodation to any employee with a disability who requires accommodation to perform the essential functions of his or her job.
• Given the nature of this position, SLAC is open to on-site, hybrid, and remote work options.
Work standards :
• Interpersonal Skills: Demonstrates the ability to work well with Stanford colleagues and clients and with external organizations.
• Promote Culture of Safety: Demonstrates commitment to personal responsibility and value for environment, safety and security; communicates related concerns; uses and promotes safe behaviors based on training and lessons learned. Meets the applicable roles and responsibilities as described in the ESH Manual, Chapter 1General Policy and Responsibilities: http://www-
• Subject to and expected to comply with all applicable University policies and procedures, including but not limited to the personnel policies and other policies found in the University's Administrative Guide,
Classification Title: Software Developer 3
Duration: Regular Continuing
Job code: 4823
The expected pay range for this position is $137,773 to $ 194,585 per annum. SLAC National Accelerator Laboratory/Stanford University provides pay ranges representing its good faith estimate of the salary the university reasonably expects to pay for a position upon hire. The pay offered to a selected candidate will be determined based on factors such as (but not limited to) the scope and responsibilities of the position, the qualifications of the selected candidate, departmental budget availability, internal equity, geographic location and external market pay for comparable jobs. At SLAC/Stanford, base pay represents only one aspect of the comprehensive rewards package.
SLAC National Accelerator Laboratory is an Affirmative Action / Equal Opportunity Employer and supports diversity in the workplace. All employment decisions are made without regard to race, color, religion, sex, national origin, age, disability, veteran status, marital or family status, sexual orientation, gender identity, or genetic information. All staff at SLAC National Accelerator Laboratory must be able to demonstrate the legal right to work in the United States. SLAC is an E-Verify employer.