Skip to main content
NEUN
Back to Careers

Hyperbolic

Senior AI Infrastructure Engineer

NEW
San Francisco, CAFull-timeGlobal
šŸ“Š MidšŸ  On-site
ActivePosted within the last 30 days

Job Description

[AI-summarized by JobStash]

You will design, build, and operate the infrastructure that transforms raw GPUs into a programmable, orchestrated pool for AI workloads. You will implement bare-metal provisioning and lifecycle management, develop GPU scheduling and placement strategies, automate provisioning with infrastructure as code, integrate storage solutions for training data, design APIs and cloud-init workflows for automated configuration, optimize GPU compute (CUDA), and work directly with hardware vendors to troubleshoot and improve integrations.

Requirements

  • ā—Bare-metal provisioning and lifecycle management (IPMI, Redfish, BMC, PXE, automated OS deployment)
  • ā—GPU scheduling and orchestration with GPU type awareness, memory and topology considerations
  • ā—Experience with Terraform or Pulumi and CI/CD for infrastructure
  • ā—Secrets management and configuration management experience
  • ā—Observability stack implementation experience
  • ā—Storage and data infrastructure for AI/ML including object storage, high-IOPS block storage, and distributed file systems
  • ā—API design and cloud-init for automated provisioning
  • ā—Solid understanding of GPU architecture, CUDA, and GPU compute optimization
  • ā—Experience building and scaling cloud infrastructure or distributed systems in production
  • ā—Proven ability to work with hardware vendors and vendor engineering teams
  • ā—Strong communication skills

Responsibilities

  • ā—Build and scale a multi-tenant GPU cloud marketplace
  • ā—Design and implement multi-tenancy provisioning and virtualization solutions
  • ā—Transform raw GPUs into a programmable, orchestrated resource pool
  • ā—Implement bare-metal provisioning and lifecycle management
  • ā—Develop GPU scheduling, placement strategies, and fragmentation minimization
  • ā—Automate infrastructure using Terraform or Pulumi and CI/CD pipelines
  • ā—Implement secrets management, configuration management, and observability
  • ā—Design APIs and cloud-init workflows for automated provisioning
  • ā—Integrate and operate storage solutions for AI/ML workloads
  • ā—Collaborate with hardware vendors to troubleshoot and optimize integrations

Tech Stack

Cephblock storagePulumiAnsibleorchestrationbare-metalTerraformRedfishCI/CDobservability
Expired
Search