Senior Site Reliability Engineer
Marathahalli, Bangalore
6 years
🔎 Role Overview
We are looking for a hands-on Senior SRE with deep expertise in Observability, Kubernetes, and Cloud Platforms. This role focuses on building and operating highly reliable, scalable, and observable systems in GCP (preferred) and AWS environments.
______________
🔹 Key Responsibilities
Reliability & Operations
• Design and operate highly available Kubernetes-based systems
• Define & manage SLOs, SLIs, and Error Budgets
• Lead incident response, RCA, and blameless postmortems
• Improve platform reliability through automation
Observability (Core Focus)
• Build centralized observability platforms (metrics, logs, traces)
• Hands-on with Prometheus, Alertmanager, Grafana is Must
• Logging/Tracing using ELK / OpenSearch, Loki, OpenTelemetry
• Cloud-native monitoring (GCP Monitoring preferred)
• Define actionable, low-noise alerting standards
Cloud & Platform Engineering
• Infrastructure on GCP (GKE preferred) / AWS (EKS)
• Kubernetes cluster operations
• Helm deployments & Docker workloads
• Infra automation using Terraform / Ansible / Packer
Automation & Tooling
• Strong Python coding for reliability tooling
• Build internal tools for SLO tracking & incident workflows
• Integrate observability into CI/CD (Jenkins)
Leadership
• Mentor engineers
• Influence reliability architecture
• Collaborate with platform & cloud teams
______________
✅ Mandatory Skills
SRE | Python (Coding) | Kubernetes | ELK | Prometheus | Grafana | AWS/GCP | Docker | Helm | Terraform | Linux | Jenkins CI/CD
⭐ Nice to Have
Splunk | Datadog | Cribl | Vector | OpenTelemetry | Multi-cloud | Platform Security
______________
📅 Project Highlights
✨ Build a centralized observability platform
📉 Reduce MTTR using SLO-driven engineering
🚨 Lead production incident response
⚡ Optimize performance, scalability & cloud cost
______________
Site Reliability EngineeringPython courseKubernetesPrometheus (Software)+16