At JDA TSG, we equip many of the world’s major brands with top-tier specialized talent, business process expertise innovations which drive their organizations in exciting new directions. We have established a reputation for bringing exceptional focus, flexibility, and confidence with every client we serve.
We have an immediate opportunity for a motivated and energetic Director of Site Reliability Engineering/DevOps with a strong sense of ownership and technical ability. Our client has a 100% “cloud” based infrastructure and is seeking a tech leader with strong experience in Infrastructure as Code, automation, CI/CD, Containers, AWS, and DevOps best practices to lead their DevOps/Site Reliability Engineering team.
Excellent communication skills are desired, as the TechOps team has developed a strong and close working relationship with both development owners and product owners to define clear expectations of objectives and fast, robust, and future proof results.
The ideal candidate has a very strong sense of ownership and passion for learning. This position will report directly to the Vice President of Technology - Operations & CyberSecurity, who will rely on the Director - DevOps & SRE to build, lead, manage and consistently track and report on the DevOps/SRE progress for key stakeholders.
Primary Accountabilities:
• This DevOps/SRE Engineering leader will be responsible for managing the cloud infrastructure and the underlying ecosystem of services and all associated components, Including owning and driving the Major Incident Management process
• Mentor and guide the professional and technical development of engineers on your team and build a culture of accountability while setting the strategic direction
• Work with development teams within and across Agile development processes to design, develop, test, implement, and support technical solutions across a full-stack of development tools and technologies
• Lead the availability, resilience, and scalability of your solutions
• Stay on top of tech trends, experiment with / learn new technologies, participate in internal & external technology communities, and mentor members as needed
• Drive the automation of deployment, configuration management, and monitoring processes to improve efficiency and reduce manual intervention
• Review and streamline the DevOps process, tools and platforms
• Evaluate and select third-party tools and services that align with the organization's needs
• Develop and maintain disaster recovery plans to ensure business continuity
• Partnering with the Security Team to ensure that HIPAA, NIST and CIS controls are implemented and maintained within all environments
• Perform additional tasks as assigned.
The Experience you need to thrive in the role:
• Site Reliability Engineering principles, including setting and managing Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgeting
• Advanced skills with Terraform and CI/CD tools such as Github Actions or Jenkins.
• Extensive experience with AWS managed service offerings.
• Must know ECS Fargate, EC2, S3, RDS, Lambda, Cloudfront, and Cloudwatch X-Ray/Eventbus
• NewRelic or other similar APM tools.
• Software monitoring and log aggregation tools.
• Strong sense of ownership and troubleshooting skills.
• Advanced knowledge of Linux, Windows operating systems
• 3+ years working around DNS and Network concepts, enabling efficient communication, scalability, security, and automation.
• Strong working knowledge of Docker or Kubernetes
• Designing Event Driven Architecture and Applications
• You are not afraid to question any existing processes and solutions, yet you display a keen sense of business value proposition and focus on the right priorities
• 8+ years in a software development environment with DevOps/SRE and CI/CD engineering responsibility and experience
• 4+ years managing direct reports and a geographically dispersed team
• 5+ years working with AWS
• 3+ years Google’s Site Reliability Engineering (SRE) methodologies with establishing, tracking and reporting on daily metrics for management and instill a “manage by metrics” framework
• 5+ years in a Software Engineering, SRE, or DevOps discipline
• 3 + years writing Terraform, preferably Modules
• “Containerizing” legacy applications.
• Strong communication skills and experience working with Tech Leaders and business/product owners
• Be part of the team - be fully capable of reviewing the teams work, offer solutions/suggestions and be able to troubleshoot and resolve issues
• Strong troubleshooting skills able to come up with “outside the box” solutions in a timely, cost-effective manner
• Demonstrable track record of dealing well with ambiguity, prioritizing needs, and delivering measurable results in an agile environment
Education Requirements
• Bachelor’s Degree in Computer Science or related field, or equivalent college degree with 5+ years relevant experience