Hyperbolic

Senior Site Reliability Engineer

NEW

San Francisco, CAFull-timeGlobal

📊 Mid

ActivePosted within the last 30 days

Job Description

[AI-summarized by JobStash]

You will ensure GPU marketplace and AI infrastructure run with high reliability, performance, and security. You will define and maintain SLOs, design monitoring and alerting, build automation for capacity management and resource allocation, implement progressive rollouts and rollback mechanisms, lead incident response and post-mortems, and harden infrastructure through tenant isolation, secrets and key management, and compliance work.

Requirements

●Expertise in site reliability engineering with experience defining, monitoring, and maintaining SLOs and SLAs
●Strong background in capacity planning, forecasting, resource allocation, and cost optimization for distributed systems
●Experience in incident response, on-call rotations, and post-mortem processes with measurable MTTR improvements
●Knowledge of deployment systems including progressive rollouts, canary deployments, feature flags, and automated rollback mechanisms
●Proficiency with observability tools and practices including metrics, logging, tracing, and alerting (Prometheus, Grafana, ELK or similar)
●Strong understanding of infrastructure security including tenant isolation, workload isolation, and network segmentation
●Experience with secrets management, key management systems, certificate management, and secure credential rotation
●Knowledge of compliance frameworks and security best practices for cloud platforms (SOC 2, ISO 27001 or similar)
●Experience with infrastructure-as-code, configuration management, and CI/CD pipelines
●Preferred experience operating GPU infrastructure, AI/ML platforms, distributed systems, container security, chaos engineering, and cost optimization strategies

Responsibilities

●Define and maintain service level objectives and service level agreements
●Build and operate incident response systems and lead on-call rotations
●Manage capacity planning and resource allocation across distributed GPU networks
●Implement progressive rollouts, canary deployments, feature flags, and automated rollbacks
●Design monitoring and alerting systems to provide deep infrastructure visibility
●Automate capacity management and resource allocation
●Lead post-mortem processes and drive improvements to reduce MTTR
●Harden infrastructure security including tenant and workload isolation
●Implement secrets and key management and certificate rotation
●Develop compliance frameworks and security best practices

Tech Stack

capacity planningsecurity hardeningInfrastructure-as-Codeloggingworkload isolationincident responserollbackkey managementPrometheusGrafana