What You'll Do
• Define SLIs/SLOs, maintain error budgets, and drive platform reliability.
• Implement safe CI/CD with automated tests, blue/green & canary rollouts (Argo Rollouts) and auto-rollbacks.
• Harden security: image signing, SBOM, secrets management, PodSecurity, NetworkPolicies, and just-in-time access.
• Improve observability: OpenTelemetry pipelines, logs/traces correlation, dashboards, and SLO reporting.
• Optimize costs: right-size resources, Karpenter provisioning, HPA/VPA tuning, FinOps practices.
• Lead incidents and postmortems; create runbooks, templates, and training.
• Partner with Product, Backend, and Security teams on capacity, compliance, and roadmap planning.
Tech You'll Work With
AWS, EKS, Argo CD & Rollouts, Terraform/Terragrunt, GitHub Actions, Prometheus/Grafana, OpenTelemetry, Elastic APM, Secrets Manager, Cilium, Aurora/DynamoDB, SQS/SNS/Kafka.