TRM Labs

Senior MLOps Engineer, LLMOps

North AmericaFull-timeGlobal

💰 USD 200,000 - 275,000/yr

📊 Mid🏠 Remote

Job Description

[AI-summarized by JobStash]

You will build and maintain the infrastructure and pipelines that enable production AI systems. You will design CI/CD workflows for model training, evaluation, and deployment, automate model versioning and approval workflows, and implement compliance and observability tooling. You will integrate and evaluate state-of-the-art LLM and agent tools, deploy scalable model serving, monitor cost, latency and performance, and run offline and online evaluations including human-in-the-loop processes. You will provide reproducible sandboxes and dashboards so researchers and engineers can iterate quickly and reliably.

Requirements

●Write high-quality maintainable software primarily in Python
●Experience with containerization and orchestration such as Docker and Kubernetes
●Experience with infrastructure-as-code and deployment tooling such as Terraform and CI/CD pipelines
●Experience with monitoring and logging frameworks such as Datadog Prometheus and OpenTelemetry
●Implement MLOps best practices including model versioning rollback strategies automated evaluation and drift detection
●Experience with scalable model and agent serving infrastructure such as vLLM Triton and BentoML
●Experience deploying and maintaining LLM and agentic workflows in production including monitoring cost latency and performance and capturing traces
●Strong ownership pragmatism and ability to balance infrastructure elegance with iterative delivery

Responsibilities

●Build reusable CI/CD workflows for model training evaluation and deployment
●Automate model versioning approval workflows and compliance checks
●Build modular and scalable AI infrastructure including vector database feature store model registry and observability tooling
●Embed AI models and agents into real-time applications and workflows
●Continuously evaluate and integrate state-of-the-art AI tools
●Drive AI reliability governance and uptime
●Ensure data accuracy consistency and reliability for training and inference
●Deploy infrastructure for offline and online evaluation including regression testing cost monitoring and human-in-the-loop workflows
●Provide sandboxes dashboards and reproducible environments for researchers

Benefits & Perks

●Equity plan eligibility

Tech Stack

Vector databasefeature storePythonCI/CDobservabilityOpenTelemetryLLMagentMLOpsPrometheus