Skip to main content
NEUN
Back to Careers

TRM Labs

Senior MLOps Engineer, LLMOps

North AmericaFull-timeGlobal

šŸ’° USD 200,000 - 275,000/yr

šŸ“Š MidšŸ  Remote

Job Description

[AI-summarized by JobStash]

You will build and maintain the infrastructure and pipelines that enable production AI systems. You will design CI/CD workflows for model training, evaluation, and deployment, automate model versioning and approval workflows, and implement compliance and observability tooling. You will integrate and evaluate state-of-the-art LLM and agent tools, deploy scalable model serving, monitor cost, latency and performance, and run offline and online evaluations including human-in-the-loop processes. You will provide reproducible sandboxes and dashboards so researchers and engineers can iterate quickly and reliably.

Requirements

  • ā—Write high-quality maintainable software primarily in Python
  • ā—Experience with containerization and orchestration such as Docker and Kubernetes
  • ā—Experience with infrastructure-as-code and deployment tooling such as Terraform and CI/CD pipelines
  • ā—Experience with monitoring and logging frameworks such as Datadog Prometheus and OpenTelemetry
  • ā—Implement MLOps best practices including model versioning rollback strategies automated evaluation and drift detection
  • ā—Experience with scalable model and agent serving infrastructure such as vLLM Triton and BentoML
  • ā—Experience deploying and maintaining LLM and agentic workflows in production including monitoring cost latency and performance and capturing traces
  • ā—Strong ownership pragmatism and ability to balance infrastructure elegance with iterative delivery

Responsibilities

  • ā—Build reusable CI/CD workflows for model training evaluation and deployment
  • ā—Automate model versioning approval workflows and compliance checks
  • ā—Build modular and scalable AI infrastructure including vector database feature store model registry and observability tooling
  • ā—Embed AI models and agents into real-time applications and workflows
  • ā—Continuously evaluate and integrate state-of-the-art AI tools
  • ā—Drive AI reliability governance and uptime
  • ā—Ensure data accuracy consistency and reliability for training and inference
  • ā—Deploy infrastructure for offline and online evaluation including regression testing cost monitoring and human-in-the-loop workflows
  • ā—Provide sandboxes dashboards and reproducible environments for researchers

Benefits & Perks

  • ā—Equity plan eligibility

Tech Stack

Vector databasefeature storePythonCI/CDobservabilityOpenTelemetryLLMagentMLOpsPrometheus
Expired
Search