Senior Site Reliability Engineer

Senior Site Reliability Engineer

Marathahalli, Bangalore

6 years

Site Reliability EngineeringPython courseKubernetesPrometheus (Software)Alert ApplicationGrafanaElastic (ELK) StackOpensearchLokiOpen telemetry collectorGoogle Cloud Platform (GCP)Google Kubernetes Engine (GKE)Amazon Web Services (AWS)AWS Elastic Kubernetes Service (EKS)Docker ContainerHelmTerraformAnsiblePackerJenkins

Job Description:

🔎 Role Overview

We are looking for a hands-on Senior SRE with deep expertise in Observability, Kubernetes, and Cloud Platforms. This role focuses on building and operating highly reliable, scalable, and observable systems in GCP (preferred) and AWS environments.

______________

🔹 Key Responsibilities

Reliability & Operations

• Design and operate highly available Kubernetes-based systems

• Define & manage SLOs, SLIs, and Error Budgets

• Lead incident response, RCA, and blameless postmortems

• Improve platform reliability through automation

Observability (Core Focus)

• Build centralized observability platforms (metrics, logs, traces)

• Hands-on with Prometheus, Alertmanager, Grafana is Must

• Logging/Tracing using ELK / OpenSearch, Loki, OpenTelemetry

• Cloud-native monitoring (GCP Monitoring preferred)

• Define actionable, low-noise alerting standards

Cloud & Platform Engineering

• Infrastructure on GCP (GKE preferred) / AWS (EKS)

• Kubernetes cluster operations

• Helm deployments & Docker workloads

• Infra automation using Terraform / Ansible / Packer

Automation & Tooling

• Strong Python coding for reliability tooling

• Build internal tools for SLO tracking & incident workflows

• Integrate observability into CI/CD (Jenkins)

Leadership

• Mentor engineers

• Influence reliability architecture

• Collaborate with platform & cloud teams

______________

✅ Mandatory Skills

SRE | Python (Coding) | Kubernetes | ELK | Prometheus | Grafana | AWS/GCP | Docker | Helm | Terraform | Linux | Jenkins CI/CD

⭐ Nice to Have

Splunk | Datadog | Cribl | Vector | OpenTelemetry | Multi-cloud | Platform Security

______________

📅 Project Highlights

✨ Build a centralized observability platform

📉 Reduce MTTR using SLO-driven engineering

🚨 Lead production incident response

⚡ Optimize performance, scalability & cloud cost

______________