Hyperbolic
Senior Site Reliability Engineer
NEWSan Francisco, CAFull-timeGlobal
š Mid
ActivePosted within the last 30 days
Job Description
[AI-summarized by JobStash]
You will ensure GPU marketplace and AI infrastructure run with high reliability, performance, and security. You will define and maintain SLOs, design monitoring and alerting, build automation for capacity management and resource allocation, implement progressive rollouts and rollback mechanisms, lead incident response and post-mortems, and harden infrastructure through tenant isolation, secrets and key management, and compliance work.
Requirements
- āExpertise in site reliability engineering with experience defining, monitoring, and maintaining SLOs and SLAs
- āStrong background in capacity planning, forecasting, resource allocation, and cost optimization for distributed systems
- āExperience in incident response, on-call rotations, and post-mortem processes with measurable MTTR improvements
- āKnowledge of deployment systems including progressive rollouts, canary deployments, feature flags, and automated rollback mechanisms
- āProficiency with observability tools and practices including metrics, logging, tracing, and alerting (Prometheus, Grafana, ELK or similar)
- āStrong understanding of infrastructure security including tenant isolation, workload isolation, and network segmentation
- āExperience with secrets management, key management systems, certificate management, and secure credential rotation
- āKnowledge of compliance frameworks and security best practices for cloud platforms (SOC 2, ISO 27001 or similar)
- āExperience with infrastructure-as-code, configuration management, and CI/CD pipelines
- āPreferred experience operating GPU infrastructure, AI/ML platforms, distributed systems, container security, chaos engineering, and cost optimization strategies
Responsibilities
- āDefine and maintain service level objectives and service level agreements
- āBuild and operate incident response systems and lead on-call rotations
- āManage capacity planning and resource allocation across distributed GPU networks
- āImplement progressive rollouts, canary deployments, feature flags, and automated rollbacks
- āDesign monitoring and alerting systems to provide deep infrastructure visibility
- āAutomate capacity management and resource allocation
- āLead post-mortem processes and drive improvements to reduce MTTR
- āHarden infrastructure security including tenant and workload isolation
- āImplement secrets and key management and certificate rotation
- āDevelop compliance frameworks and security best practices
Tech Stack
capacity planningsecurity hardeningInfrastructure-as-Codeloggingworkload isolationincident responserollbackkey managementPrometheusGrafana