Inference Stack

Software Engineer, Inference

San Francisco, CA · On-site·Full-time

Build the inference runtime on top of our ASIC hardware: batching, KV cache, scheduling, and the OpenAI-compatible API surface.

Responsibilities

  • Build and improve the inference runtime that serves our ASIC hardware.
  • Own pieces of the serving stack: scheduling, continuous batching, KV cache, and prefill/decode separation.
  • Ship optimizations that move tokens/sec, TTFT, p99 latency, and cost per token.
  • Work directly with the hardware and compiler teams when kernels or fused ops need to change.
  • Maintain the OpenAI-compatible API surface: chat completions, streaming, tool use.
  • Write the benchmarks and regression harnesses that catch latency and correctness drift.
  • Be on the oncall rotation for what you ship.

What we're looking for

  • 3+ years writing production systems code in Rust, C++, Go, or performance-oriented Python.
  • Solid fundamentals in concurrency, memory, and tail latency.
  • Familiarity with modern LLM inference: transformers, attention, KV cache, batching, speculative decoding, quantization.
  • Comfortable optimizing with a profiler in hand.

Nice to have

  • Experience with vLLM, TGI, TensorRT-LLM, SGLang, or llama.cpp.
  • CUDA / Triton / kernel-level work, or experience with non-NVIDIA accelerators.
  • Built or operated an OpenAI-compatible API at scale.
ModeHumanAgent