Solidus Labs
DevOps Site Reliability Engineer
NEWJob Description
[AI-summarized by JobStash]
You will own the reliability, stability, and operational support of production systems. You will lead incident response end-to-end, troubleshoot and mitigate outages, and perform deep-dive root cause analysis to drive corrective actions. You will operate production Kubernetes (EKS), manage scaling and capacity using KEDA, Karpenter, and HPA, and evolve infrastructure as code with Terraform and Helm. You will support GitLab CI/CD pipelines, design observability systems with Prometheus, Grafana and EFK, and resolve networking issues involving TLS, load balancing, VPCs, NAT and VPN. You will respond to security-related incidents, support compliance initiatives, leverage AI-powered tools for automation, and participate in on-call rotations to provide operational coverage.
Requirements
- ā3+ years of hands-on DevOps / SRE experience
- āStrong production experience with Docker and Kubernetes
- āSolid knowledge of AWS (EKS, EC2, IAM, RDS, S3, CloudWatch, Lambda)
- āExperience with monitoring, logging and alerting systems
- āProficiency with Terraform, Helm and GitLab CI
- āStrong troubleshooting skills across infrastructure, CI/CD and networking
- āScripting experience with Bash and Python
- āFluent English and willingness to participate in on-call rotations
- āFamiliarity with pub/sub systems such as SQS, RabbitMQ or Kafka
- āNice to have: Experience with Redis, Airflow, Databricks, Spark/EMR
- āNice to have: GitOps workflows and advanced Git usage
- āNice to have: Experience supporting Postgres, Snowflake or ClickHouse
- āNice to have: Proficiency in Mandarin
Responsibilities
- āOwn reliability, availability and performance of production environments
- āLead incident response end-to-end including troubleshooting, mitigation and resolution
- āPerform deep-dive root cause analysis and drive long-term corrective actions
- āOperate production Kubernetes (EKS) including cluster upgrades and Helm deployments
- āManage scaling and capacity using KEDA, Karpenter and HPA
- āEvolve infrastructure as code using Terraform and Helm following security best practices
- āSupport GitLab CI/CD pipelines and resolve deployment issues
- āDesign and maintain observability using Prometheus, Grafana and EFK
- āTroubleshoot networking issues involving TLS, load balancing, VPCs, NAT and VPN
- āSupport compliance initiatives and respond to security-related incidents
- āLeverage AI-powered tools to automate tasks and improve productivity
- āParticipate in on-call rotations to provide consistent operational coverage