TRM Labs
Machine Learning Infrastructure Engineer
San Francisco, CAFull-timeGlobal
š Midš Hybrid
Job Description
[AI-summarized by JobStash]
You will design, build, and operate GPU-backed infrastructure to run production ML and LLM workloads. You will optimize inference systems for throughput and cost, implement model optimization and compilation workflows, and support distributed inference patterns such as model and tensor parallelism. You will schedule heterogeneous workloads across accelerators, instrument systems for GPU load, memory, batching, and token throughput, and work with engineering and ML teams to transition models from experimentation to reliable production services.
Requirements
- āBachelor's degree or equivalent in Computer Science or related field
- ā5+ years of experience building and operating distributed systems or infrastructure in production
- āExperience deploying and operating ML/LLM inference workloads on GPU clusters in cloud environments (AWS and/or GCP)
- āDeep understanding of high-throughput inference systems including batching strategies and token throughput optimization
- āExperience with ML serving frameworks such as Triton Inference Server, vLLM, Ray Serve, ONNX Runtime, or HuggingFace Optimum
- āExperience optimizing GPU load, memory efficiency, and production performance bottlenecks
- āFamiliarity with distributed inference strategies including model parallelism and tensor parallelism
- āExperience working with Kubernetes or equivalent orchestration systems
- āFamiliarity with heterogeneous accelerators (e.g., Inferentia) is a plus
- āCUDA familiarity and experience debugging GPU-related issues is a plus
- āAdaptable and autonomous with excellent communication and collaboration skills
Responsibilities
- āDesign and operate GPU cluster infrastructure
- āOptimize high-throughput inference
- āEnable distributed inference strategies
- āImplement model optimization and compilation workflows
- āSchedule heterogeneous workloads
- āBuild observability into ML infrastructure
- āPartner across engineering teams to transition models to production
Tech Stack
autoscalingCUDAparallel computingAWSobservabilityInferenceFlashAttentiontensor parallelismtoken throughputperformance engineering