Yellow Card
Senior AI Platform Engineer
RemoteFull-timeGlobal
š Midš Remote
RemoteRemote work position available
Job Description
[AI-summarized by JobStash]
You will design, deploy, and operate internal AI platforms across Kubernetes (EKS), AWS Serverless, and local developer environments. You will implement observability and FinOps for LLM usage, build human-in-the-loop automation to reduce operational toil, create custom tooling and API integrations, design cost and usage dashboards, ensure reliability and security, and produce documentation and golden paths to enable product teams.
Requirements
- ā7+ years of hands-on technical experience with large-scale production environments and infrastructure
- āIn-depth knowledge of AWS architecture including Serverless, Lambda, and EKS and ability to manage diverse environments including local developer setups
- āStrong grasp of Kubernetes, microservice architecture, and CI/CD principles (GitHub Actions)
- āPractical experience setting up infrastructure to run, monitor, and scale AI-driven applications or internal developer tooling
- āProven ability to learn and master new technologies quickly
- āSolid understanding of performance monitoring tools and troubleshooting complex production environments
- āProactive approach to upskill teams and integrate cutting-edge solutions
Responsibilities
- āArchitect, deploy, and manage internal automation platforms and AI orchestration tools across Kubernetes (EKS), AWS Serverless, and local deployment configurations
- āImplement scalable logging and monitoring for AI model usage to provide visibility into LLM expenditures and token budgets
- āBuild human-in-the-loop processes to streamline operational workflows including infrastructure patching and maintenance using AI and automation
- āLeverage interface protocols such as MCP to build custom internal tools and API integrations
- āDesign and maintain dashboards that track operational costs and provide data-driven insights to leadership
- āExecute service capacity planning and system tuning for internal AI tools to ensure high availability
- āEnsure internal AI tools adhere to security standards and maintain a minimal vulnerability window in partnership with SecOps
- āCreate golden paths and technical documentation to democratize access to AI tooling and upskill product engineering teams
Benefits & Perks
- āFlexible fully remote work
- āOn-call rotations designed to maintain a healthy work-life balance
Tech Stack
Human-in-the-loopEKSAILLMFinOpsMCPautomationdashboardcapacity planningKubernetes