GPU Cluster for LLM Inference: Build vs Buy Analysis for ML Teams
At some point, every team running LLMs in production ends up asking the same question: would we save money by owning hardware? The question sounds straightforward, but the answer depends heavily on utilization patterns, team size, and how you account for costs that don't show up on a GPU invoice.
This post walks through the full TCO picture on both sides, provides a break-even framework, and discusses hybrid approaches that work well for teams that don't fit cleanly into either camp.
Sizing Your Infrastructure First
Before comparing costs, you need a concrete usage estimate. The relevant number is tokens per day, broken into:
- Prefill tokens: the input context (prompt + documents + chat history)
- Decode tokens: the output the model generates
Most applications are decode-bound -- decode is the expensive phase because it's sequential and cannot be easily parallelized. A typical chat application might have a 4:1 or 8:1 input-to-output ratio; a summarization pipeline might be 50:1. Throughput requirements differ substantially across these use cases.
For this analysis, we'll use a concrete example: a team running Llama 3.1 70B for an internal coding assistant, targeting 100 concurrent users, with an average of 2,000 input tokens and 500 output tokens per request, at peak load.
At 100 concurrent users and 30 seconds average response time, you're looking at roughly 3-4 requests completing per second at peak, or about 8-10 billion tokens per day if the system runs under sustained load for 8 hours.
The Cost of Building
Hardware
H100 SXM5 80GB GPUs, as of 2026, cost between $28,000 and $35,000 per card when purchased through standard channels (less on spot markets, more when supply is constrained). Running Llama 3.1 70B at reasonable throughput requires the model weights to fit in GPU memory. At FP16, 70B parameters use about 140 GB -- so you need at least two H100s (160 GB combined) per replica.
A realistic production setup with redundancy and reasonable throughput might look like:
- 2 nodes of 8x H100 SXM5 each (16 GPUs total)
- ~$480,000 in GPU hardware alone
That's just the GPUs. A complete node requires:
- Host servers (dual-socket Xeon or EPYC, 1-2TB RAM): $25,000-$40,000 per node
- InfiniBand networking (400Gb HDR or 800Gb NDR for tight tensor parallelism): $15,000-$30,000 per node
- NVMe storage (model checkpoints, logs, datasets): $5,000-$10,000 per node
Total hardware cost for 2 nodes: roughly $600,000-$680,000 before you've touched facilities.
Facilities and Power
Data center costs vary enormously by region, but you should budget:
- Colocation space: $1,500-$3,000 per rack per month. Two dense GPU nodes fill roughly one full cabinet.
- Power: H100 SXM5 draws 700W each, so 16 cards = 11.2 kW for GPUs alone. A full node (with CPUs, storage, networking) runs 4-6 kW. Two nodes: ~10-12 kW total. At $0.10-$0.12/kWh, that's about $800-$1,000/month in power costs for the hardware.
- Cooling overhead: facilities typically apply a PUE (Power Usage Effectiveness) multiplier of 1.3-1.6 for cooling. Effective power cost: $1,000-$1,600/month.
- Networking egress: if you're serving external users, bandwidth costs add up. Budget $500-$2,000/month depending on traffic.
Monthly facilities and power: roughly $4,000-$7,000/month.
Staff
This is where build costs get underestimated most often. Running GPU infrastructure requires:
- At minimum, one ML infrastructure engineer who can handle CUDA driver updates, NVLink diagnostics, failed GPU replacement, and vLLM/triton configuration.
- For anything production-critical, you want a 24/7 on-call rotation, which implies at least 2 engineers.
Mid-level ML infra engineers in the US cost $180,000-$250,000 fully loaded (salary + benefits + equity + overhead). Two engineers: $360,000-$500,000/year, or $30,000-$42,000/month.
You can reduce this by using managed colocation services that handle physical operations, but someone still needs to manage the software stack, respond to incidents, and plan capacity.
Software
The open-source inference stack (vLLM, TGI, Triton Inference Server) is free, but you may need:
- Monitoring and observability tooling: $500-$2,000/month
- Model storage and versioning: $200-$500/month
- Secrets management, networking tools, CI/CD for model deployments: $500-$1,000/month
Total software overhead: roughly $1,500-$3,500/month.
Total Build Cost Summary
| Category | One-Time | Monthly | |----------|----------|---------| | GPU hardware | $480,000 | -- | | Server infrastructure | $150,000 | -- | | Facilities setup | $20,000 | $4,000-$7,000 | | Engineering staff | -- | $30,000-$42,000 | | Software/tools | -- | $1,500-$3,500 | | Total | $650,000 | $35,500-$52,500 |
Annualized (year 1): approximately $1.1M-$1.3M.
The Cost of Buying
Managed inference APIs charge per token. Current rates for Llama 3.1 70B-class models typically run $0.35-$0.90 per million input tokens and $0.45-$1.20 per million output tokens, depending on the provider and volume tier.
For the usage profile above (10 billion tokens/day, roughly 8B input + 2B output):
- Daily: 8M input tokens * $0.50/M + 2M output tokens * $0.80/M = $4.00 + $1.60 = $5.60/day
- Monthly (30 days): ~$168
- Annual: ~$2,016
Wait -- that seems too cheap. Let's check the math against a more realistic sustained production scenario.
10 billion tokens per day assumes 100 concurrent users for 8 hours. A real production system with 100 concurrent users likely processes far more than one request per user per 30 seconds. If average session length is 2 hours and users send 10 messages per session, the token volume is much higher.
A more conservative estimate for a 100-user coding assistant at typical utilization: 500 million to 2 billion tokens per month (not per day). This is what most teams actually see outside of batch workloads.
At 1 billion tokens/month (60/40 input/output split):
- Input: 600M tokens at $0.50/M = $300
- Output: 400M tokens at $0.80/M = $320
- Monthly total: ~$620
- Annual: ~$7,400
At 10 billion tokens/month (a very high-usage team or batch processing):
- Monthly: ~$6,200
- Annual: ~$74,400
At 50 billion tokens/month (a well-established product with significant traffic):
- Monthly: ~$31,000
- Annual: ~$372,000
Break-Even Analysis
To break even on owning hardware, your monthly API spend needs to exceed your monthly amortized ownership cost. With $650,000 in upfront hardware amortized over 3 years, plus $35,500-$52,500 in monthly operating costs:
Monthly ownership cost = ($650,000 / 36) + $43,000 = $18,000 + $43,000 = ~$61,000/month
At $0.65/M tokens (blended input/output rate), break-even requires:
$61,000 / $0.65 per million tokens = ~94 billion tokens per month
That's roughly 3 billion tokens per day, sustained, at this hardware scale. Most teams aren't there. This is the equivalent of running a mid-size commercial LLM product with meaningful traffic -- not an internal tool or early-stage application.
The math shifts when you:
- Achieve high GPU utilization (>70%). If your GPUs sit idle 50% of the time, the effective per-token cost doubles.
- Need larger models or batch workloads. If you're running 100B+ parameter models or processing large document batches overnight, the economics improve.
- Require on-premises for compliance. If your data cannot leave a specific environment (HIPAA, FedRAMP, air-gapped), the calculation is moot -- you have to own hardware regardless of cost.
- Run at sustained scale for 3+ years. Break-even improves as hardware depreciates. Year 3 cost drops significantly once the capital expense is absorbed.
A rough rule of thumb: if your monthly managed inference bill is consistently above $30,000-$40,000 and you expect that to continue for at least two years, a detailed build analysis is worth doing. Below that threshold, managed APIs almost certainly win on total cost when you include engineering time.
Hidden Costs of Building That Usually Get Underestimated
Procurement timelines
H100 and H200 GPUs still have multi-month lead times in most configurations. Ordering today might mean waiting 4-9 months for delivery. Meanwhile, your team either pays for managed inference anyway, delays the product, or makes architecture decisions around constrained capacity.
Utilization is harder than it looks
Running GPUs at 70%+ utilization is genuinely difficult. Training workloads are bursty. Inference traffic follows diurnal patterns. A cluster sized for peak demand sits idle during off-peak hours. Idle GPUs don't reduce your mortgage, staff costs, or power bill.
The software maintenance burden
vLLM, TGI, and similar frameworks release updates frequently. CUDA driver compatibility issues appear regularly, especially when you add new model architectures. Keeping the inference stack healthy and up-to-date is ongoing work that doesn't show up in initial estimates.
Incident response
When a GPU fails (and they do fail), you lose inference capacity until the card is replaced. Lead times for replacement hardware can be days or weeks. Managed APIs handle this transparently.
Hybrid Approaches
Many teams land on a hybrid model that captures the advantages of each approach:
Steady-state traffic on owned hardware, burst on managed APIs. If you have predictable baseline traffic that justifies cluster ownership, run that on-prem or in a dedicated cloud instance. Handle traffic spikes using managed APIs. This requires your inference client to support routing across providers but keeps utilization high on owned hardware.
Development and staging on managed APIs, production on owned hardware. Avoids provisioning dedicated GPUs for non-production environments, which are often underutilized.
Batch jobs on spot/preemptible cloud GPUs, real-time serving on managed APIs. Embedding generation, document indexing, and offline evaluations can tolerate interruptions. Real-time user-facing inference cannot. Use spot pricing for tolerant workloads, reliable managed APIs for user traffic.
Reserved capacity from a provider rather than owning hardware. Several inference providers offer reserved instance pricing with significant discounts over on-demand rates. This gives you predictable costs and capacity guarantees without the procurement, facilities, and staffing burden of self-hosting. For teams in the $10,000-$50,000/month range, this often provides the best economics.
When Building Actually Makes Sense
The cases where owning hardware comes out ahead:
- Compliance-driven on-premises requirements (healthcare, defense, financial services with strict data residency policies)
- Very large-scale batch workloads where you can sustain >80% GPU utilization across the full cluster
- Custom ASIC or specialized hardware that isn't available through any managed provider
- Academic or research institutions that receive hardware grants or have existing data center infrastructure with favorable power rates
- Products with >$50,000/month in managed inference spend and a team capable of running infrastructure reliably
A Decision Framework
Answer these questions in order:
- Do you have a compliance reason to self-host? If yes, the analysis ends here. Build (or use a compliant private cloud).
- What is your current monthly managed inference spend? If under $15,000, managed APIs almost certainly win. If over $50,000 sustained, do the full TCO calculation.
- Do you have an ML infrastructure team already? If not, add the hiring cost and ramp time to the build column.
- What is your expected GPU utilization? Below 60%, managed APIs are very hard to beat on cost.
- How long is your planning horizon? Hardware investments need at least 2-3 years to pay off. If you're uncertain about scale or model architecture, managed APIs preserve optionality.
Conclusion
For most teams, managed inference APIs provide better economics than owning GPU clusters until usage reaches a scale that very few internal applications hit. The capital requirements, procurement delays, staffing overhead, and utilization challenges combine to make self-hosting more expensive than it appears in a back-of-envelope GPU price comparison.
The teams that genuinely benefit from owning hardware are running at significant scale, have dedicated ML infrastructure engineers, can sustain high utilization, and have either compliance requirements or long enough planning horizons to absorb the upfront costs.
If you're evaluating inference providers and want to understand whether your usage profile might eventually justify dedicated capacity, General Compute offers reserved capacity tiers alongside standard on-demand pricing -- you can start on-demand and move to reserved as your traffic patterns stabilize. The API is OpenAI-compatible, so switching costs are low if you want to experiment.