Hyperbolic

Senior AI Infrastructure Engineer

NEW

San Francisco, CAFull-timeGlobal

📊 Mid🏠 On-site

ActivePosted within the last 30 days

Job Description

[AI-summarized by JobStash]

You will design, build, and operate the infrastructure that transforms raw GPUs into a programmable, orchestrated pool for AI workloads. You will implement bare-metal provisioning and lifecycle management, develop GPU scheduling and placement strategies, automate provisioning with infrastructure as code, integrate storage solutions for training data, design APIs and cloud-init workflows for automated configuration, optimize GPU compute (CUDA), and work directly with hardware vendors to troubleshoot and improve integrations.

Requirements

●Bare-metal provisioning and lifecycle management (IPMI, Redfish, BMC, PXE, automated OS deployment)
●GPU scheduling and orchestration with GPU type awareness, memory and topology considerations
●Experience with Terraform or Pulumi and CI/CD for infrastructure
●Secrets management and configuration management experience
●Observability stack implementation experience
●Storage and data infrastructure for AI/ML including object storage, high-IOPS block storage, and distributed file systems
●API design and cloud-init for automated provisioning
●Solid understanding of GPU architecture, CUDA, and GPU compute optimization
●Experience building and scaling cloud infrastructure or distributed systems in production
●Proven ability to work with hardware vendors and vendor engineering teams
●Strong communication skills

Responsibilities

●Build and scale a multi-tenant GPU cloud marketplace
●Design and implement multi-tenancy provisioning and virtualization solutions
●Transform raw GPUs into a programmable, orchestrated resource pool
●Implement bare-metal provisioning and lifecycle management
●Develop GPU scheduling, placement strategies, and fragmentation minimization
●Automate infrastructure using Terraform or Pulumi and CI/CD pipelines
●Implement secrets management, configuration management, and observability
●Design APIs and cloud-init workflows for automated provisioning
●Integrate and operate storage solutions for AI/ML workloads
●Collaborate with hardware vendors to troubleshoot and optimize integrations

Tech Stack

Cephblock storagePulumiAnsibleorchestrationbare-metalTerraformRedfishCI/CDobservability