Director of Site Reliability Engineering/DevOps

San Francisco 23 months agoFull-time External
Negotiable
At JDA TSG, we equip many of the world’s major brands with top-tier specialized talent, business process expertise innovations which drive their organizations in exciting new directions. We have established a reputation for bringing exceptional focus, flexibility, and confidence with every client we serve. We have an immediate opportunity for a motivated and energetic Director of Site Reliability Engineering/DevOps with a strong sense of ownership and technical ability. Our client has a 100% “cloud” based infrastructure and is seeking a tech leader with strong experience in Infrastructure as Code, automation, CI/CD, Containers, AWS, and DevOps best practices to lead their DevOps/Site Reliability Engineering team. Excellent communication skills are desired, as the TechOps team has developed a strong and close working relationship with both development owners and product owners to define clear expectations of objectives and fast, robust, and future proof results. The ideal candidate has a very strong sense of ownership and passion for learning. This position will report directly to the Vice President of Technology - Operations & CyberSecurity, who will rely on the Director - DevOps & SRE to build, lead, manage and consistently track and report on the DevOps/SRE progress for key stakeholders. Primary Accountabilities: • This DevOps/SRE Engineering leader will be responsible for managing the cloud infrastructure and the underlying ecosystem of services and all associated components, Including owning and driving the Major Incident Management process • Mentor and guide the professional and technical development of engineers on your team and build a culture of accountability while setting the strategic direction • Work with development teams within and across Agile development processes to design, develop, test, implement, and support technical solutions across a full-stack of development tools and technologies • Lead the availability, resilience, and scalability of your solutions • Stay on top of tech trends, experiment with / learn new technologies, participate in internal & external technology communities, and mentor members as needed • Drive the automation of deployment, configuration management, and monitoring processes to improve efficiency and reduce manual intervention • Review and streamline the DevOps process, tools and platforms • Evaluate and select third-party tools and services that align with the organization's needs • Develop and maintain disaster recovery plans to ensure business continuity • Partnering with the Security Team to ensure that HIPAA, NIST and CIS controls are implemented and maintained within all environments • Perform additional tasks as assigned. The Experience you need to thrive in the role: • Site Reliability Engineering principles, including setting and managing Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgeting • Advanced skills with Terraform and CI/CD tools such as Github Actions or Jenkins. • Extensive experience with AWS managed service offerings. • Must know ECS Fargate, EC2, S3, RDS, Lambda, Cloudfront, and Cloudwatch X-Ray/Eventbus • NewRelic or other similar APM tools. • Software monitoring and log aggregation tools. • Strong sense of ownership and troubleshooting skills. • Advanced knowledge of Linux, Windows operating systems • 3+ years working around DNS and Network concepts, enabling efficient communication, scalability, security, and automation. • Strong working knowledge of Docker or Kubernetes • Designing Event Driven Architecture and Applications • You are not afraid to question any existing processes and solutions, yet you display a keen sense of business value proposition and focus on the right priorities • 8+ years in a software development environment with DevOps/SRE and CI/CD engineering responsibility and experience • 4+ years managing direct reports and a geographically dispersed team • 5+ years working with AWS • 3+ years Google’s Site Reliability Engineering (SRE) methodologies with establishing, tracking and reporting on daily metrics for management and instill a “manage by metrics” framework • 5+ years in a Software Engineering, SRE, or DevOps discipline • 3 + years writing Terraform, preferably Modules • “Containerizing” legacy applications. • Strong communication skills and experience working with Tech Leaders and business/product owners • Be part of the team - be fully capable of reviewing the teams work, offer solutions/suggestions and be able to troubleshoot and resolve issues • Strong troubleshooting skills able to come up with “outside the box” solutions in a timely, cost-effective manner • Demonstrable track record of dealing well with ambiguity, prioritizing needs, and delivering measurable results in an agile environment Education Requirements • Bachelor’s Degree in Computer Science or related field, or equivalent college degree with 5+ years relevant experience