TRM Labs

Senior or Staff AI Infrastructure Engineer

North AmericaFull-timeGlobal

💰 USD 200,000 - 275,000/yr

📊 Senior🏠 Remote

Job Description

[AI-summarized by JobStash]

You will design, build, and operate the infrastructure that supports large-scale AI and agent systems. You will create reusable CI/CD workflows for training, evaluation, and deployment; automate model versioning, approvals, and compliance checks; and assemble modular stacks including vector databases, feature stores, and model registries. You will integrate and evaluate cutting-edge LLM tools, instrument observability and monitoring, and deploy online and offline evaluation pipelines with regression testing, cost monitoring, and human-in-the-loop workflows. You will collaborate with engineers and data scientists to embed models and agents into real-time applications, provide sandboxes and reproducible environments for researchers, and continuously improve model performance, reliability, and governance.

Requirements

●Write high quality maintainable software primarily in Python
●Strong background in scalable infrastructure including Docker and Kubernetes
●Experience with infrastructure as code and deployment tools such as Terraform and CI/CD pipelines
●Familiarity with monitoring and logging frameworks such as Datadog Prometheus and OpenTelemetry
●Knowledge of MLOps best practices including model versioning rollback strategies automated evaluation and drift detection
●Experience with scalable model and agent serving infrastructure such as vLLM Triton and BentoML
●Experience deploying and maintaining LLM and agentic workflows in production including cost latency and performance monitoring
●Experience capturing traces for analysis debugging and optimizing prompt response flows with real time data access
●Strong ownership pragmatism and ability to balance infrastructure design with iterative delivery

Responsibilities

●Build reusable CI/CD workflows for model training evaluation and deployment
●Automate model versioning approval workflows and compliance checks
●Build modular scalable AI infrastructure including vector databases feature stores model registries and observability tooling
●Partner with engineering and data science to embed AI models and agents into real-time applications and workflows
●Continuously evaluate and integrate state of the art AI tools and frameworks
●Drive AI reliability and governance to ensure compliance security and uptime
●Ensure data accuracy consistency and reliability for model training and inference
●Deploy infrastructure to support offline and online evaluation including regression testing cost monitoring and human in the loop workflows
●Enable researchers with sandboxes dashboards and reproducible environments
●Improve AI and ML model performance

Benefits & Perks

●Remote work
●Eligibility to participate in TRM's equity plan

Tech Stack

Vector databasefeature storeexperiment trackingModel trainingPythonCI/CDobservabilitymodel deploymentGitHub ActionsOpenTelemetry