We raised $15M to build the world's fastest neocloud.Read
Inference Platform

Software Engineer, Inference Platform

New York, NY · On-site·Full-time

Build the platform layer of our inference cloud — the OpenAI-compatible API surface, the request path, the streaming infrastructure, and the benchmarking harnesses that keep us honest. The runtime on our ASIC sits below you and is owned by our hardware partner today; your job is everything between an OpenRouter request landing on our edge and a token streaming back to the user.

This is a senior IC role on a small team. You'll own surfaces, not tickets. As we take more of the runtime in-house over the next 6-8 months, the work moves closer to the metal but the first year is platform and serving, not kernels.

Responsibilities

  • Own the OpenAI-compatible API surface: chat completions, streaming, tool use, error semantics, and the long tail of compatibility edge cases that matter to real customers.
  • Build the request path that fronts our ASIC fleet — routing, admission control, queueing, retries, and graceful degradation.
  • Own integration with OpenRouter and other distribution partners. Their request shapes, their billing hooks, their failure modes are your problem.
  • Build the benchmarking and regression harness that catches latency and correctness drift before customers do — TTFT, inter-token latency distributions, tokens/sec under load.
  • Ship optimizations that move real metrics: TTFT, p99 latency, throughput per dollar.
  • Work directly with our hardware partner's runtime team when the bottleneck is below your layer. As we take more of the stack in-house, your scope grows to include batching, scheduling, and KV-cache work directly.
  • Be on the oncall rotation for what you ship.

What we're looking for

  • 5+ years writing production systems code.
  • Have built and operated a high-throughput API service before — ideally one where tail latency mattered as much as throughput.
  • Strong fundamentals in concurrency, memory, queueing, and where latency actually comes from in a distributed system.
  • Working knowledge of modern LLM inference: transformers, attention, KV cache, batching, speculative decoding, quantization. You don't need to have written a kernel, but you should know why batching changes everything about a serving stack.
  • Comfortable with a profiler. You reach for measurement before intuition.
  • Self-directed. We don't have the bandwidth to assign you tickets — you'll find the work.

Nice to have

  • Have built or operated an OpenAI-compatible API at production scale.
  • Familiar with vLLM, TGI, TensorRT-LLM, SGLang, or llama.cpp internals.
  • Kernel-level work in CUDA, Triton, or on non-NVIDIA accelerators — relevant as the role grows down-stack.
  • Have shipped streaming infrastructure (SSE, gRPC streaming, WebSockets) under real load.
ModeHumanAgent
Software Engineer, Inference Platform