TRM Labs
Senior MLOps Engineer, LLMOps
North AmericaFull-timeGlobal
š° USD 200,000 - 275,000/yr
š Midš Remote
Job Description
[AI-summarized by JobStash]
You will build and maintain the infrastructure and pipelines that enable production AI systems. You will design CI/CD workflows for model training, evaluation, and deployment, automate model versioning and approval workflows, and implement compliance and observability tooling. You will integrate and evaluate state-of-the-art LLM and agent tools, deploy scalable model serving, monitor cost, latency and performance, and run offline and online evaluations including human-in-the-loop processes. You will provide reproducible sandboxes and dashboards so researchers and engineers can iterate quickly and reliably.
Requirements
- āWrite high-quality maintainable software primarily in Python
- āExperience with containerization and orchestration such as Docker and Kubernetes
- āExperience with infrastructure-as-code and deployment tooling such as Terraform and CI/CD pipelines
- āExperience with monitoring and logging frameworks such as Datadog Prometheus and OpenTelemetry
- āImplement MLOps best practices including model versioning rollback strategies automated evaluation and drift detection
- āExperience with scalable model and agent serving infrastructure such as vLLM Triton and BentoML
- āExperience deploying and maintaining LLM and agentic workflows in production including monitoring cost latency and performance and capturing traces
- āStrong ownership pragmatism and ability to balance infrastructure elegance with iterative delivery
Responsibilities
- āBuild reusable CI/CD workflows for model training evaluation and deployment
- āAutomate model versioning approval workflows and compliance checks
- āBuild modular and scalable AI infrastructure including vector database feature store model registry and observability tooling
- āEmbed AI models and agents into real-time applications and workflows
- āContinuously evaluate and integrate state-of-the-art AI tools
- āDrive AI reliability governance and uptime
- āEnsure data accuracy consistency and reliability for training and inference
- āDeploy infrastructure for offline and online evaluation including regression testing cost monitoring and human-in-the-loop workflows
- āProvide sandboxes dashboards and reproducible environments for researchers
Benefits & Perks
- āEquity plan eligibility
Tech Stack
Vector databasefeature storePythonCI/CDobservabilityOpenTelemetryLLMagentMLOpsPrometheus