Own the hardware and devops side of our inference stack. You'll set infrastructure strategy, run GPU / ASIC load balancing and model placement, and keep end-to-end latency as low as physics allows.
Responsibilities
Lead infrastructure strategy for our inference fleet, from rack layout and power to the load balancer that fronts it.
Own GPU / ASIC load balancing and model placement across racks, driven by live utilization and tail latency.
Drive end-to-end inference latency down across the full client-to-token path.
Own the physical layer: rack density, power, cooling, cabling, and top-of-rack fabric.
Lead devops and SRE: observability, deployment, oncall, and incident response for the production fleet.
Partner with ASIC vendors and firmware teams on bringup, drivers, and hardware qualification.
Hire and grow the infrastructure team.
What we're looking for
7+ years in infrastructure, SRE, or platform engineering at scale.
Deep experience operating large GPU or accelerator fleets in production.
Hands-on expertise with load balancing and scheduling that target utilization and tail latency, not just request rate.
Strong grasp of rack-level topology (fabric, PCIe, NUMA, top-of-rack networking) and how it shows up in latency.
Comfortable at the hardware boundary: firmware, drivers, thermals, and power distribution.
Track record leading engineering teams and owning production oncall.
Nice to have
Experience with custom inference ASICs, TPUs, or non-NVIDIA accelerators.
Background in large-scale model serving (vLLM, TGI, TensorRT-LLM, or custom runtimes).
Network fabric design at data center scale (RoCE, InfiniBand).