Ask where a GPU spends its energy and the answer is counterintuitive: almost none of it on arithmetic. A fused multiply-accumulate — the fundamental operation of every neural network — costs a fraction of a picojoule. Fetching the weight that the operation needs costs several times more, because that weight lives in an HBM stack, and reaching it means a round trip out of the DRAM array, across an interposer, through a memory controller, and back. The math is nearly free. The commute is everything.
At large batch sizes you can hide the commute by amortizing each weight fetch across many operations. At batch one — the regime that governs interactive inference, the chatbot waiting on your next token — you cannot. Every generated token must stream the entire model out of memory exactly once. That is why an HBM-bound flagship decodes an 80-billion-parameter model at roughly 110 tokens per second: not because it lacks compute, but because it is waiting on a few centimeters of package, over and over, billions of times an hour.
Shorten the wire, don't widen the pipe
Every generation of HBM answers this the same way — wider pipe: more stacks, more channels, more bandwidth per pin. Sophon answers it by deleting the distance. When the weight is stored in a 2T0C cell a few hundred nanometers above the MAC that consumes it, the fetch is a single vertical hop through a monolithic inter-tier via. The whole journey is shorter than a wavelength of visible light. A read costs femtojoules; a gradient write lands at 20 fJ per bit; an FP8 multiply-accumulate completes at 0.310 picojoules all-in. Bandwidth stops being a budgeted, fought-over resource and becomes a simple property of where the data sits: 4.2 PB/s in-tile, with nothing ever crossing a package.
The memory wall was never a memory problem. It was always a distance problem — and distance is exactly what monolithic 3D removes.
PhantaField Architecture Team
The clearest place to see the difference is energy per token. Sophon spends 16.3 millijoules to decode a token of that 80B model; an HBM-bound part spends on the order of 6.4 joules at low batch — a roughly 390× gap. No incremental HBM generation closes a gap like that, because the gap is not about how fast memory runs. It is about how far the data has to travel to reach the math. We built the entire architecture around making that distance as close to zero as physics allows.