The mechanics of inference cost
How compute utilization and data movement determine inference costs.
Inference costs are determined by how compute is utilized and how data moves. Whether you're analyzing prompt economics or the Levelized Cost of Inference (LCOI), two things govern the outcome: wasted compute is expensive, and data movement sets the ceiling on performance.
Prompt economics
Tool calls function as a cost multiplier. Each tool call re-sends the context window: the prompt, system instructions, and full conversation history. Because that history is rarely cached, it functions as an uncached input furnace. A session with 3–4 tool calls per turn can consume 3–5× the input tokens of a standard chat. Reducing tool calls is the most direct lever on per-turn cost; batching or merging tools produces non-linear savings because each merge eliminates a full context re-send.
Caching is an amortization problem. Sending large system prompts or tool schemas incurs a high prefix cost on the first turn — one that only pays off if the session runs long enough to recover it through cheap cache reads. For sessions of one or two turns, the upfront cache write often costs more than a fresh uncached request. The crossover point depends on prefix size and session length. There is no universal answer; model it for your workflow.
Thinking tokens are a luxury. They are billed at output rates and are not cached. Enabling extended thinking means paying for high-compute generation on every turn, which breaks the caching advantage and can quickly dominate total session cost. Reserve thinking tokens for tasks that genuinely require multi-step reasoning that zero-shot prompting cannot handle.
The Levelized Cost of Inference (LCOI)
Utilization is the dominant lever. A cluster at 40% utilization runs at roughly double the per-token cost of the same cluster at 80%. Before upgrading hardware or renegotiating electricity tariffs, close the utilization gap, through traffic shaping, batch consolidation, or sharing clusters across workloads.
Prefill and decode are distinct compute regimes. Prefill processes the prompt in a single parallel forward pass: fast, compute-bound. Decode generates output one token at a time: slow, memory bandwidth-bound. Output tokens cost more to produce than input tokens because of this asymmetry. If your application is input-heavy, the cost structure differs materially from a decode-heavy one — hardware selection should follow. A GPU optimized for high-bandwidth decode will outperform one optimized for raw FLOPS on chat workloads.
Geography affects cost through more than electricity price. Regions with high energy tariffs often have cool climates that allow low PUE via free-air cooling. Regions with cheap electricity can require expensive active cooling, eroding the tariff advantage. Total unit cost is the product of power tariff and facility PUE; optimizing one without the other often produces a worse outcome than a competitor paying a higher rate in a more efficient building.
Amortization assumptions can overturn raw performance advantages. An H100 offers superior throughput, but if your workload doesn't demand it, the higher CapEx produces a worse LCOI than an older A100. Over a three-year depreciation cycle, utilization rate and residual resale value often matter more than the headline tokens-per-second figure.