The Memory Wall Is a Distance Problem

Ask where a GPU spends its energy and the answer is counterintuitive: almost none of it on arithmetic. A fused multiply-accumulate — the fundamental operation of every neural network — costs a fraction of a picojoule. Fetching the weight that the operation needs costs several times more, because that weight lives in an HBM stack, and reaching it means a round trip out of the DRAM array, across an interposer, through a memory controller, and back. The math is nearly free. The commute is everything.

At large batch sizes you can hide the commute by amortizing each weight fetch across many operations. At batch one — the regime that governs interactive inference, the chatbot waiting on your next token — you cannot. Every generated token must stream the entire model out of memory exactly once. That is why a 2026 HBM4 flagship decodes an 80-billion-parameter model at roughly 300 tokens per second: not because it lacks compute, but because it is waiting on a few centimeters of package, over and over, billions of times an hour.

Sophon moves weights vertically through the die in microns, while die- and package-scale fabrics push the same data millimeters to centimeters.

Shorten the wire, don't widen the pipe

Every generation of HBM answers this the same way — wider pipe: more stacks, more channels, more bandwidth per pin. Sophon answers it by deleting the distance. When the weight is stored in a 2T0C cell a few hundred nanometers above the MAC that consumes it, the fetch is a single vertical hop through a monolithic inter-tier via. The whole journey is shorter than a wavelength of visible light. A read costs femtojoules; a gradient write lands at 20 fJ per bit; an FP8 multiply-accumulate completes at 0.310 picojoules all-in. Bandwidth stops being a budgeted, fought-over resource and becomes a simple property of where the data sits: 2.1 PB/s in-tile, with nothing ever crossing a package.

The memory wall was never a memory problem. It was always a distance problem — and distance is exactly what monolithic 3D removes.
PhantaField Architecture Team

The clearest place to see the difference is energy per token. Sophon spends 25.8 millijoules to decode a token of that 80B model; an HBM4 part spends on the order of 4.5 joules at low batch — a roughly 174× gap. No incremental HBM generation closes a gap like that, because the gap is not about how fast memory runs. It is about how far the data has to travel to reach the math. We built the entire architecture around making that distance as close to zero as physics allows.