Inference Stack

Software Engineer, Inference

San Francisco, CA · On-site·Full-time

Build the inference runtime on top of our ASIC hardware: batching, KV cache, scheduling, and the OpenAI-compatible API surface.

Responsibilities

Build and improve the inference runtime that serves our ASIC hardware.
Own pieces of the serving stack: scheduling, continuous batching, KV cache, and prefill/decode separation.
Ship optimizations that move tokens/sec, TTFT, p99 latency, and cost per token.
Work directly with the hardware and compiler teams when kernels or fused ops need to change.
Maintain the OpenAI-compatible API surface: chat completions, streaming, tool use.
Write the benchmarks and regression harnesses that catch latency and correctness drift.
Be on the oncall rotation for what you ship.

3+ years writing production systems code in Rust, C++, Go, or performance-oriented Python.
Solid fundamentals in concurrency, memory, and tail latency.
Familiarity with modern LLM inference: transformers, attention, KV cache, batching, speculative decoding, quantization.
Comfortable optimizing with a profiler in hand.