Skip to main content
NEUN
Back to Careers

TRM Labs

Senior or Staff AI Infrastructure Engineer

North AmericaFull-timeGlobal

šŸ’° USD 200,000 - 275,000/yr

šŸ“Š SenioršŸ  Remote

Job Description

[AI-summarized by JobStash]

You will design, build, and operate the infrastructure that supports large-scale AI and agent systems. You will create reusable CI/CD workflows for training, evaluation, and deployment; automate model versioning, approvals, and compliance checks; and assemble modular stacks including vector databases, feature stores, and model registries. You will integrate and evaluate cutting-edge LLM tools, instrument observability and monitoring, and deploy online and offline evaluation pipelines with regression testing, cost monitoring, and human-in-the-loop workflows. You will collaborate with engineers and data scientists to embed models and agents into real-time applications, provide sandboxes and reproducible environments for researchers, and continuously improve model performance, reliability, and governance.

Requirements

  • ā—Write high quality maintainable software primarily in Python
  • ā—Strong background in scalable infrastructure including Docker and Kubernetes
  • ā—Experience with infrastructure as code and deployment tools such as Terraform and CI/CD pipelines
  • ā—Familiarity with monitoring and logging frameworks such as Datadog Prometheus and OpenTelemetry
  • ā—Knowledge of MLOps best practices including model versioning rollback strategies automated evaluation and drift detection
  • ā—Experience with scalable model and agent serving infrastructure such as vLLM Triton and BentoML
  • ā—Experience deploying and maintaining LLM and agentic workflows in production including cost latency and performance monitoring
  • ā—Experience capturing traces for analysis debugging and optimizing prompt response flows with real time data access
  • ā—Strong ownership pragmatism and ability to balance infrastructure design with iterative delivery

Responsibilities

  • ā—Build reusable CI/CD workflows for model training evaluation and deployment
  • ā—Automate model versioning approval workflows and compliance checks
  • ā—Build modular scalable AI infrastructure including vector databases feature stores model registries and observability tooling
  • ā—Partner with engineering and data science to embed AI models and agents into real-time applications and workflows
  • ā—Continuously evaluate and integrate state of the art AI tools and frameworks
  • ā—Drive AI reliability and governance to ensure compliance security and uptime
  • ā—Ensure data accuracy consistency and reliability for model training and inference
  • ā—Deploy infrastructure to support offline and online evaluation including regression testing cost monitoring and human in the loop workflows
  • ā—Enable researchers with sandboxes dashboards and reproducible environments
  • ā—Improve AI and ML model performance

Benefits & Perks

  • ā—Remote work
  • ā—Eligibility to participate in TRM's equity plan

Tech Stack

Vector databasefeature storeexperiment trackingModel trainingPythonCI/CDobservabilitymodel deploymentGitHub ActionsOpenTelemetry
Expired
Search