Alpen Labs

Senior Site Reliability Engineer

NAMERFull-timeGlobal

📊 Mid

ActivePosted within the last 30 days

Job Description

[AI-summarized by JobStash]

You will lead the design, scalability, and reliability of core infrastructure. You will build and operate production AWS infrastructure, manage Kubernetes clusters, and own infrastructure as code with Terraform. You will implement observability with Prometheus, Grafana, and Loki, define SLIs and SLOs, run incident response and postmortems, automate deployments and CI/CD, and mentor engineers to raise operational standards while collaborating across distributed teams.

Requirements

●10+ years experience in SRE or DevOps
●Deep expertise in AWS, Kubernetes, and Terraform in production
●Hands-on experience with Prometheus, Grafana, Loki, and alerting systems
●Proven ability to lead infrastructure or reliability-focused projects
●Strong understanding of distributed systems
●Networking
●Linux systems
●Experience working in globally distributed async teams
●Based in the United States
●Experience with Layer 1 or Layer 2 blockchain infrastructure (Bitcoin, Ethereum, Polygon, ZkSync) (nice to have)
●Familiarity with zero-knowledge systems or cryptographic infrastructure (nice to have)
●Experience supporting high-throughput low-latency distributed systems (nice to have)
●Startup or high-growth experience (nice to have)

Responsibilities

●Design, build, and maintain highly available scalable AWS infrastructure
●Lead Kubernetes architecture and operations for production systems
●Own and evolve infrastructure as code using Terraform
●Design and maintain observability systems using Prometheus, Grafana, and Loki
●Define and implement SLIs, SLOs, and error budgets
●Lead incident response, conduct postmortems, and drive reliability improvements
●Automate deployments, scaling, and system management
●Improve CI/CD pipelines and release processes
●Champion DevOps best practices
●Lead complex infrastructure and reliability projects end-to-end
●Improve on-call experience with clear alerting, documentation, and faster incident resolution
●Encourage teams to adopt DevOps by empowering engineers to use infrastructure tools independently
●Collaborate effectively with globally distributed teams

Tech Stack

error budgetautomationKubernetesincident responseobservabilitynetworkingAWSSLOSLICI/CD