The Cloud Services Operations Lead will be a critical leader within our Cloud Shared Services team, responsible for the day-to-day operational excellence, stability, and continuous improvement of our multi-cloud (primarily AWS and Azure) environments. This role requires a strong blend of technical expertise in cloud operations, a deep understanding of IT service management (ITSM) best practices, and proven leadership skills to manage a team of cloud operations engineers. The successful candidate will ensure that our cloud services are delivered efficiently, securely, and in accordance with agreed-upon service level agreements (SLAs).
Key Responsibilities:
• Operational Leadership:
• Lead, mentor, and develop a team of cloud operations engineers, fostering a culture of continuous learning, collaboration, and high performance.
• Oversee daily operations of our multi-cloud environments (AWS, Azure, and others as applicable), ensuring high availability, performance, and reliability of all cloud services.
• Implement and enforce operational best practices, standards, and procedures for cloud infrastructure and platform management.
• Manage on-call rotations and ensure effective incident response and problem resolution.
• Service Management & Performance:
• Define, monitor, and report on key performance indicators (KPIs) and service level agreements (SLAs) for all cloud services.
• Proactively identify and address potential operational issues, performance bottlenecks, and capacity constraints.
• Drive continuous improvement initiatives to optimize cloud operations, reduce manual effort, and enhance service delivery.
• Collaborate with internal customers to understand their evolving needs and ensure our cloud services meet their requirements.
• Incident, Problem, and Change Management:
• Establish and mature robust incident management processes, ensuring timely resolution and effective communication during outages.
• Implement and manage problem management to identify root causes of incidents and prevent recurrence.
• Oversee change management processes for cloud infrastructure and services, ensuring proper planning, testing, and execution to minimize risk.
• Conduct post-incident reviews (PIRs) and implement corrective actions.
• Monitoring, Alerting, and Automation:
• Ensure comprehensive monitoring and alerting systems are in place for all cloud resources and services.
• Drive automation initiatives using Infrastructure as Code (IaC) tools (e.g. Terraform, CloudFormation, ARM templates) and scripting (e.g., Python, PowerShell) to streamline operational tasks and improve efficiency.
• Develop and maintain runbooks and operational documentation.
• Cost Optimization & Governance:
• Monitor and optimize cloud spending, identifying cost-saving opportunities without compromising performance or reliability.
• Ensure adherence to cloud governance policies, security standards, and compliance requirements (e.g., ISO 27001, SOC 2, industry-specific regulations).
• Work closely with finance and procurement teams to manage cloud expenditures.
• Collaboration & Stakeholder Management:
• Partner closely with architecture, engineering, security, and development teams to ensure seamless deployment and operation of cloud services.
• Communicate effectively with internal stakeholders, providing regular updates on operational status, incidents, and improvement initiatives.
• Act as a subject matter expert for cloud operations within the organization.
Qualifications:
• Education: Bachelor's degree in Computer Science, Information Technology, Engineering, or a related field; or equivalent practical experience.
• Experience:
• 5+ years of progressive experience in cloud operations, with at least 3 years in a dedicated cloud operations or SRE role focusing on AWS and Azure.
• 1+ years of experience leading and managing a team of operations engineers.
• Proven experience with large-scale, highly available, and fault-tolerant cloud environments.
• Extensive experience with cloud monitoring tools (e.g., CloudWatch, Azure Monitor, Datadog, Prometheus, Grafana).
• Strong practical experience with Infrastructure as Code (IaC) tools (e.g., Terraform, CloudFormation, ARM templates).
• Proficiency in scripting languages (e.g., Python, PowerShell, Bash).
• Solid understanding of networking concepts (TCP/IP, DNS, VPNs, Load Balancing, Firewalls) in a cloud context.
• Experience with containerization technologies (e.g., Docker, Kubernetes) is a strong plus.
• Familiarity with CI/CD pipelines and DevOps principles.
• Certifications (Preferred):
• AWS Certified Solutions Architect – Associate/Professional
• Microsoft Certified Azure Administrator Associate / Azure Solutions Architect Expert
• ITIL Foundation or higher certification
Please refer to U3’s Privacy Notice for Job Applicants/Seekers at https://u3infotech.com/privacy-notice-job-applicants/. When you apply, you voluntarily consent to the collection, use and disclosure of your personal data for recruitment/employment and related purposes.