Hyperbolic
Senior AI Infrastructure Engineer
NEWSan Francisco, CAFull-timeGlobal
š Midš On-site
ActivePosted within the last 30 days
Job Description
[AI-summarized by JobStash]
You will design, build, and operate the infrastructure that transforms raw GPUs into a programmable, orchestrated pool for AI workloads. You will implement bare-metal provisioning and lifecycle management, develop GPU scheduling and placement strategies, automate provisioning with infrastructure as code, integrate storage solutions for training data, design APIs and cloud-init workflows for automated configuration, optimize GPU compute (CUDA), and work directly with hardware vendors to troubleshoot and improve integrations.
Requirements
- āBare-metal provisioning and lifecycle management (IPMI, Redfish, BMC, PXE, automated OS deployment)
- āGPU scheduling and orchestration with GPU type awareness, memory and topology considerations
- āExperience with Terraform or Pulumi and CI/CD for infrastructure
- āSecrets management and configuration management experience
- āObservability stack implementation experience
- āStorage and data infrastructure for AI/ML including object storage, high-IOPS block storage, and distributed file systems
- āAPI design and cloud-init for automated provisioning
- āSolid understanding of GPU architecture, CUDA, and GPU compute optimization
- āExperience building and scaling cloud infrastructure or distributed systems in production
- āProven ability to work with hardware vendors and vendor engineering teams
- āStrong communication skills
Responsibilities
- āBuild and scale a multi-tenant GPU cloud marketplace
- āDesign and implement multi-tenancy provisioning and virtualization solutions
- āTransform raw GPUs into a programmable, orchestrated resource pool
- āImplement bare-metal provisioning and lifecycle management
- āDevelop GPU scheduling, placement strategies, and fragmentation minimization
- āAutomate infrastructure using Terraform or Pulumi and CI/CD pipelines
- āImplement secrets management, configuration management, and observability
- āDesign APIs and cloud-init workflows for automated provisioning
- āIntegrate and operate storage solutions for AI/ML workloads
- āCollaborate with hardware vendors to troubleshoot and optimize integrations
Tech Stack
Cephblock storagePulumiAnsibleorchestrationbare-metalTerraformRedfishCI/CDobservability