Feed - arxlens

Your paper timeline

Scroll AI takes the way you would scroll a great paper aggregator: quick signal first, deeper critique when something earns your attention, and challenges when a claim feels off.

Trending Newest Top

1 paper in cs.AR

Trending mixes fresh papers with community signal.

PRISM: Breaking the O(n) Memory Wall in Long-Context LLM Inference via O(1) Photonic Block Selection

physics.optics cs.AI cs.AR Hyoseok Park, Yeonsang Park · Mar 23, 2026

Long-context LLM inference hits a memory wall: each decode step requires scanning the entire KV cache, incurring $O(n)$ memory bandwidth that cannot be solved by faster arithmetic. PRISM proposes a thin-film lithium niobate photonic accelerator that performs the block-selection similarity search in $O(1)$ optical latency using a broadcast-and-weight architecture, eliminating the $O(n)$ scan entirely. The work claims $16\times$–$32\times$ traffic reduction at 64K–128K tokens and a four-order-of-magnitude energy advantage over GPU baselines by matching photonic hardware capabilities—passive query broadcast, quasi-static microring weights, and low-precision rank output—to the selection task.

Long-context LLM inference is bottlenecked not by compute but by the O(n) memory bandwidth cost of scanning the KV cache at every decode step -- a wall that no amount of arithmetic scaling can break. Recent photonic accelerators have demonstrated impressive throughput for dense attention computation; however, these approaches inherit the same O(n) memory scaling as electronic attention when applied to long contexts. We observe that the real leverage point is the coarse block-selection step: a memory-bound similarity search that determines which KV blocks to fetch. We identify, for the first time, that this task is structurally matched to the photonic broadcast-and-weight paradigm -- the query fans out to all candidates via passive splitting, signatures are quasi-static (matching electro-optic MRR programming), and only rank order matters (relaxing precision to 4-6 bits). Crucially, the photonic advantage grows with context length: as N increases, the electronic scan cost rises linearly while the photonic evaluation remains O(1). We instantiate this insight in PRISM (Photonic Ranking via Inner-product Similarity with Microring weights), a thin-film lithium niobate (TFLN) similarity engine. Hardware-impaired needle-in-a-haystack evaluation on Qwen2.5-7B confirms 100% accuracy from 4K through 64K tokens at k=32, with 16x traffic reduction at 64K context. PRISM achieves a four-order-of-magnitude energy advantage over GPU baselines at practical context lengths (n >= 4K).

Read abstractHide abstract

Nothing here yet