Metagov
Founding Infrastructure Engineer
š° USD 50,000 - 50,000/yr
Job Description
[AI-summarized by JobStash]
You will be the technical owner of the platform's operational backbone. You will harden the platform for major launches, perform load testing, and build fallback routing and per-agent monitoring. You will implement end-to-end observability and integrated trace analysis across heterogeneous infrastructure, ship downtime warnings and fallback behavior, and implement routing transparency and endpoint provenance so users can verify which backend served their inference. You will improve performance of public endpoints, integrate programmatic infrastructure interfaces such as an MCP server, and make the utility more transparent and contributable. You will set priorities autonomously, operate production inference and ML serving infrastructure, and coordinate with cloud providers, HPC centers, and other infrastructure partners. Occasional travel for team workshops may be required.
Requirements
- āSignificant experience operating production inference or ML serving infrastructure (vLLM, model routing, multi-region deployments, GPU-backed services)
- āStrong distributed systems and SRE instincts including observability, incident response, fallback design, and capacity planning
- āComfort working across heterogeneous infrastructure partners including cloud providers and HPC centers
- āExperience orchestrating many stacks and integrating open-source projects
- āMaintainer and integrator experience with pride in operational excellence
- āAbility to work autonomously in a small team and travel occasionally for workshops
Responsibilities
- āHarden platform for launches
- āPerform load testing
- āBuild fallback routing
- āSet up per-agent monitoring
- āBuild end-to-end observability across stacks
- āShip downtime warnings and fallback behavior
- āImplement routing transparency and endpoint provenance
- āImprove production service performance
- āIntegrate MCP server or programmatic infrastructure interfaces
- āMake infrastructure transparent and contributable
- āOperate and maintain production inference and ML serving infrastructure
- āCoordinate with heterogeneous infrastructure partners
- āOrchestrate and integrate multiple open-source stacks