Alpen Labs
Senior Site Reliability Engineer
NAMERFull-timeGlobal
š Mid
ActivePosted within the last 30 days
Job Description
[AI-summarized by JobStash]
You will lead the design, scalability, and reliability of core infrastructure. You will build and operate production AWS infrastructure, manage Kubernetes clusters, and own infrastructure as code with Terraform. You will implement observability with Prometheus, Grafana, and Loki, define SLIs and SLOs, run incident response and postmortems, automate deployments and CI/CD, and mentor engineers to raise operational standards while collaborating across distributed teams.
Requirements
- ā10+ years experience in SRE or DevOps
- āDeep expertise in AWS, Kubernetes, and Terraform in production
- āHands-on experience with Prometheus, Grafana, Loki, and alerting systems
- āProven ability to lead infrastructure or reliability-focused projects
- āStrong understanding of distributed systems
- āNetworking
- āLinux systems
- āExperience working in globally distributed async teams
- āBased in the United States
- āExperience with Layer 1 or Layer 2 blockchain infrastructure (Bitcoin, Ethereum, Polygon, ZkSync) (nice to have)
- āFamiliarity with zero-knowledge systems or cryptographic infrastructure (nice to have)
- āExperience supporting high-throughput low-latency distributed systems (nice to have)
- āStartup or high-growth experience (nice to have)
Responsibilities
- āDesign, build, and maintain highly available scalable AWS infrastructure
- āLead Kubernetes architecture and operations for production systems
- āOwn and evolve infrastructure as code using Terraform
- āDesign and maintain observability systems using Prometheus, Grafana, and Loki
- āDefine and implement SLIs, SLOs, and error budgets
- āLead incident response, conduct postmortems, and drive reliability improvements
- āAutomate deployments, scaling, and system management
- āImprove CI/CD pipelines and release processes
- āChampion DevOps best practices
- āLead complex infrastructure and reliability projects end-to-end
- āImprove on-call experience with clear alerting, documentation, and faster incident resolution
- āEncourage teams to adopt DevOps by empowering engineers to use infrastructure tools independently
- āCollaborate effectively with globally distributed teams
Tech Stack
error budgetautomationKubernetesincident responseobservabilitynetworkingAWSSLOSLICI/CD