TRM Labs

Machine Learning Infrastructure Engineer

San Francisco, CAFull-timeGlobal

📊 Mid🏠 Hybrid

Job Description

[AI-summarized by JobStash]

You will design, build, and operate GPU-backed infrastructure to run production ML and LLM workloads. You will optimize inference systems for throughput and cost, implement model optimization and compilation workflows, and support distributed inference patterns such as model and tensor parallelism. You will schedule heterogeneous workloads across accelerators, instrument systems for GPU load, memory, batching, and token throughput, and work with engineering and ML teams to transition models from experimentation to reliable production services.

Requirements

●Bachelor's degree or equivalent in Computer Science or related field
●5+ years of experience building and operating distributed systems or infrastructure in production
●Experience deploying and operating ML/LLM inference workloads on GPU clusters in cloud environments (AWS and/or GCP)
●Deep understanding of high-throughput inference systems including batching strategies and token throughput optimization
●Experience with ML serving frameworks such as Triton Inference Server, vLLM, Ray Serve, ONNX Runtime, or HuggingFace Optimum
●Experience optimizing GPU load, memory efficiency, and production performance bottlenecks
●Familiarity with distributed inference strategies including model parallelism and tensor parallelism
●Experience working with Kubernetes or equivalent orchestration systems
●Familiarity with heterogeneous accelerators (e.g., Inferentia) is a plus
●CUDA familiarity and experience debugging GPU-related issues is a plus
●Adaptable and autonomous with excellent communication and collaboration skills

Responsibilities

●Design and operate GPU cluster infrastructure
●Optimize high-throughput inference
●Enable distributed inference strategies
●Implement model optimization and compilation workflows
●Schedule heterogeneous workloads
●Build observability into ML infrastructure
●Partner across engineering teams to transition models to production

Tech Stack

autoscalingCUDAparallel computingAWSobservabilityInferenceFlashAttentiontensor parallelismtoken throughputperformance engineering