Senior Site Reliability Engineer (SRE)
Location: Chicago, IL (Onsite)
Type: Contract
Role Overview:
We are seeking a Senior Site Reliability Engineer (SRE) with strong expertise in AWS infrastructure, automation, observability, and production support. The ideal candidate will bring a blend of DevOps and SRE practices, ensuring our systems remain resilient, scalable, and cost-efficient. This role requires hands-on technical depth, proactive problem-solving, and the ability to embed reliability engineering across development teams.
Key Responsibilities:
• Design, implement, and maintain secure, scalable, and highly available AWS infrastructure.
• Build and enhance CI/CD pipelines and Infrastructure as Code (IaC) solutions using Terraform and Harness.
• Implement and manage monitoring, logging, alerting, and distributed tracing with tools like Dynatrace and Datadog.
• Troubleshoot production incidents, conduct blameless postmortems, and strengthen incident response processes.
• Optimize systems for performance, cost efficiency, and reliability.
• Drive chaos engineering and resilience testing initiatives.
• Collaborate with developers to implement SLAs, SLOs, and error budgets.
• Mentor junior SREs and promote DevOps/SRE best practices across the organization.
Required Skills & Experience:
• 8+ years of experience in DevOps/SRE roles with a strong focus on AWS.
• Proven expertise in AWS services and infrastructure automation.
• Strong hands-on experience with Terraform, Harness, or similar IaC/CICD tools.
• Advanced knowledge of monitoring & observability platforms (Dynatrace, Datadog, Prometheus, Grafana, etc.).
• Deep understanding of incident response, disaster recovery, and reliability frameworks.
• Solid coding/scripting skills in Python, Bash, or similar languages.
• Experience with chaos engineering, resilience testing, and fault tolerance design.
• Strong collaboration, leadership, and mentoring capabilities.
Preferred Qualifications:
• Familiarity with Kubernetes, Docker, and container orchestration.
• Experience with FinOps practices (cloud cost optimization).
• Background in distributed systems, scalability, and fault-tolerant architectures.