DeepSeek DualPath and the Memory-Fabric Bottleneck in Agentic AI Inference
1. Executive Overview
Bottom Line. DeepSeek DualPath reframes agentic LLM inference as a memory-fabric, storage-I/O, and data-movement problem rather than a pure accelerator FLOPS problem. The key production trace is simple but powerful: DeepSeek reports agentic workloads averaging 157 rounds, 32.7K context tokens, only 429 appended tokens, a 98.7% KV-cache hit rate, and roughly 22 GB/PFLOP of cache-compute pressure for DeepSeek-V3.2. That workload shape makes historical context retrieval, not just new-token compute, the limiting path. The central investment conclusion is that agentic inference scales only when the cluster can keep GPUs fed with the right KV blocks at the right time, across HBM, host memory, SSD-backed storage, and the network fabric that connects them.
The practical implication is a broader AI infrastructure stack and a different way to underwrite GPU ROI. HBM remains essential for active execution, DRAM becomes a staging and metadata tier, enterprise SSD/NAND becomes a hot/warm persistent KV-cache tier, HDD stays mostly cold-tier, and RDMA/NIXL/GPUDirect/QoS-capable networking becomes the fabric that determines whether expensive accelerators are productive or waiting on data. The thesis is not that GPUs matter less; it is that agentic AI makes memory hierarchy, storage bandwidth, tail latency, and data movement first-order constraints on inference economics. Primary technical reference: DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference.
DeepSeek DualPath is a system-level paper with direct investment relevance because it reframes agentic LLM inference as a memory-fabric and storage-I/O problem rather than a pure accelerator FLOPS problem. The opening point is not that GPUs matter less; it is that realized GPU ROI increasingly depends on whether the cluster can retrieve, stage, and move persistent KV cache fast enough to keep accelerators busy. The DeepSeek production trace is the key evidence: agentic trajectories average 157 rounds, average context reaches 32.7K tokens, average append length is only 429 tokens, and KV-cache hit rate is 98.7%. That profile is structurally different from single-turn chatbot inference. Each new agent turn reuses a large accumulated context and adds a small delta, so the marginal inference step becomes dominated by retrieving old KV cache rather than recomputing it. DualPath’s reported outcome is material: offline throughput improves by up to 1.87x and online serving throughput improves by an average of 1.96x without violating the paper’s SLOs. The investment implication is that agentic AI raises the strategic value of the entire memory hierarchy: HBM remains essential for active compute, host DRAM remains important for buffering and hot-cache management, NAND/eSSD becomes a latency-sensitive extension of the inference memory system, HDD remains mostly a cold data tier, and networking becomes a memory-fabric enabler rather than a peripheral infrastructure layer.
Memory Hierarchy and Beneficiary Map. The table below clarifies why DeepSeek DualPath is not a simple “more storage” thesis. Each tier has a different function, different latency tolerance, and different public-market exposure.
| Tier | Role in Agentic Inference | Likely Beneficiaries | Directness |
|---|---|---|---|
| HBM | Active KV, attention, decode, model execution, and hot working set that must remain closest to the GPU. | HBM suppliers and GPU platforms: SK hynix, Micron, Samsung, NVIDIA ecosystem. | HIGH |
| DRAM | L2 staging, metadata, prefetched cache, scheduler state, CPU-side orchestration, and host buffers. | Server DRAM suppliers, SOCAMM/CXL/platform vendors: Micron, Samsung, SK hynix, broader memory-expansion ecosystem. | MED |
| Enterprise SSD / NAND | Persistent hot/warm KV cache, large shared capacity, high read bandwidth, IOPS, low tail latency, and GPU-adjacent data movement. | NAND/eSSD suppliers, SSD controllers, and storage vendors: Micron, SK hynix/Solidigm, Samsung, Kioxia/SanDisk, Dell and peers. | HIGH |
| HDD | Cold logs, training data, generated artifacts, archives, checkpoints, corpora, and low-temperature tiers. | HDD suppliers benefit from AI data growth, but not the hot KV-cache serving path: Seagate and Western Digital. | LOW |
The paper’s most important market implication is not that storage replaces memory, or that NAND replaces HBM. The more precise conclusion is that agentic inference expands the addressable performance-sensitive storage tier between DRAM and HDD, and it does so at data-center scale. HBM is still the highest-value memory in the system because model weights, active KV blocks, attention compute, and decode execution remain HBM-bound. However, the paper demonstrates that persistent KV working sets can exceed economic HBM and host DRAM capacity, forcing a larger role for SSD-backed distributed storage. This is directly aligned with recent supplier positioning: Micron is describing NAND demand acceleration from vector databases and KV-cache offload, Samsung is explicitly discussing PCIe Gen6 eSSD products focused on KV-cache storage demand, and Kioxia is developing GPU-accessible “Super High IOPS” SSDs for NVIDIA Storage-Next architectures.
The first-order beneficiaries are suppliers that combine high-bandwidth memory, server DRAM, enterprise SSDs, high-density NAND, and AI networking exposure. SK hynix has the cleanest HBM-plus-server-DRAM-plus-Solidigm-QLC-eSSD narrative. Micron has the cleanest U.S.-listed broad memory exposure across HBM4, SOCAMM2, PCIe Gen6 SSDs, DRAM, and NAND. Samsung has the broadest vertical portfolio across HBM, DRAM, NAND, eSSD, logic base dies, foundry, and packaging, with execution risk still central to investor debate. SanDisk is the most levered NAND/eSSD pure play after the Western Digital flash separation and shows unusually high operating leverage to current AI storage pricing. Western Digital and Seagate are indirect beneficiaries through nearline HDD data growth, but the paper’s live KV-cache tier is structurally flash-centric rather than disk-centric. Networking beneficiaries include NVIDIA, Broadcom, Marvell, Astera Labs, Arista, Credo, Coherent, Lumentum, and the broader optical and high-speed interconnect chain, because DualPath depends on RDMA, QoS, low tail latency, GPU-direct data movement, and high-radix fabrics.
2. DualPath Architecture and Performance Evidence
The paper studies agentic LLM inference where a model participates in many tool-using, reasoning, or environment-interaction rounds. In each round, the request context contains a growing accumulated context, while the newly appended user, tool, or environment text is relatively short. The prefill phase must ingest the full request context, while the decode phase generates the next output tokens. In conventional serving, the KV cache generated from earlier context can be reused instead of recomputed, but the cache must be stored somewhere. For short-context inference, active KV can often stay inside GPU HBM or host DRAM. For long, multi-turn agentic trajectories, the KV cache becomes too large to keep fully resident in GPU memory, and large-scale DRAM-only cache pools become expensive, capacity-constrained, and operationally complex. The paper therefore assumes external SSD-backed storage for persistent KV cache and analyzes the bandwidth imbalance that emerges when prefill engines repeatedly read large KV blocks from storage.
The paper’s most important contribution is the causal proof that agentic inference changes the critical resource. In a single-turn or low-reuse workload, prefill compute and decode execution dominate the serving profile. In a multi-turn agentic trajectory, the model repeatedly receives a large accumulated context plus a small appended delta. The old context becomes a KV-cache retrieval problem, while the new tokens represent only a small amount of incremental computation. At the paper’s reported 64K trace point, average context is 32,721 tokens, average append length is 429 tokens, and the implied KV-cache hit rate is 98.7%. That shifts the bottleneck toward reading historical KV blocks from external storage. In a PD-disaggregated cluster, this storage read pressure is concentrated on prefill engines, even though decode engines retain idle storage-NIC bandwidth. DualPath monetizes that idle bandwidth by letting decode-side storage NICs ingest KV blocks and move them to prefill engines across the compute fabric.
Agentic Workload Trace: Why KV-Cache I/O Dominates. The compact trace table below is the cleanest way to show why this is an I/O thesis rather than a GPU-only thesis.
| Metric | DeepSeek Trace Evidence | Why It Matters | Signal |
|---|---|---|---|
| Mean rounds | 157 rounds per agentic trajectory. | Long-running agents repeatedly revisit accumulated context rather than serving isolated prompts. | HIGH |
| Average context | 32.7K tokens. | The historical prefix becomes large enough to make KV-cache retrieval a first-order serving cost. | HIGH |
| Mean append length | 429 tokens. | Each turn adds a small delta, so marginal compute is small relative to cached state that must be loaded. | HIGH |
| KV-cache hit rate | 98.7%. | High reuse is the paradox: recomputation falls, but external KV-cache read pressure rises. | HIGH |
| Cache-compute ratio | About 22 GB/PFLOP for DeepSeek-V3.2. | The bottleneck shifts from accelerator FLOPS toward SSD, NIC, RDMA, and scheduling bandwidth. | HIGH |
The critical insight is that a high KV-cache hit rate is not purely good news. It reduces recomputation, but it also forces the serving system to retrieve a very large historical prefix from a lower-cost tier on nearly every turn. That is why SSD bandwidth, RDMA, NIXL-style transfer abstraction, QoS, and scheduler placement become first-order determinants of inference economics.
Why DualPath Works: Bottleneck Mechanics. The table below distills the paper’s technical proof into the investable sequence: workload shape, cache reuse, prefill-side storage saturation, stranded decode-side bandwidth, and dual-path loading.
| Mechanism | Paper Evidence | System Implication | Investment Read-Through | Signal |
|---|---|---|---|---|
| Long-context, short-append trajectory | 64K trace averages 157 turns, 32,721 context tokens, 429 appended tokens, and 176 generated tokens. | Most tokens are historical context, not fresh compute. | Persistent KV-cache storage becomes part of inference economics. | HIGH |
| Very high KV-cache hit rate | 98.7% hit rate in the representative DeepSeek trace; the paper notes agentic workloads typically exceed 95% reuse. | Serving shifts from recomputation to cache retrieval. | SSD read bandwidth, IOPS, and tail latency matter more in agentic serving than in conventional chat. | HIGH |
| Prefill-side storage NIC saturation | A standard node has 8x400Gbps compute NICs but only 1x400Gbps storage NIC; baseline concentrates KV reads on prefill-side SNICs. | GPUs idle while waiting for KV blocks despite available accelerator compute. | Balanced AI infrastructure matters more than GPU count alone. | HIGH |
| Decode-side storage NIC underutilization | Existing PD designs leave decode-side storage NIC bandwidth largely idle while prefill-side SNICs saturate. | Aggregate cluster storage bandwidth is stranded. | Scheduling and fabric design can unlock latent throughput without one-for-one hardware additions. | HIGH |
| Dual-path KV loading | DE-read path loads KV through decode storage NIC, then transfers to prefill over RDMA compute network. | Storage ingress becomes globally pooled and schedulable. | Positive for RDMA fabrics, high-radix switches, NICs, DPU/SmartNIC software, SSD QoS, and inference runtimes. | HIGH |
The production trace distribution is worth showing because it demonstrates that the DualPath result is not driven by a single 64K extreme. Across 32K, 48K, and 64K maximum context datasets, the agentic pattern remains consistent: turn count increases materially, context grows rapidly, appended tokens remain in the hundreds, and generated output remains modest. That is the signature of cache-reuse-heavy inference. The more production agents move toward long-running coding, research, browser, tool-use, and enterprise automation sessions, the more relevant this workload shape becomes.
Production Agent Trace Shape. The table below shows the workload distribution the paper uses to motivate SSD-backed KV-cache serving across 32K, 48K, and 64K context settings.
| Max Context | Avg Turns | Avg Append Tokens | Avg Generated Tokens | Avg Total Tokens | Avg Context Tokens | Investment Relevance |
|---|---|---|---|---|---|---|
| 32K | 60 | 608 | 148 | 28,639 | 17,183 | Even shorter agent traces show high context accumulation versus small appends. |
| 48K | 106 | 474 | 172 | 42,607 | 25,120 | Context load grows faster than appended-token compute. |
| 64K | 157 | 429 | 176 | 55,958 | 32,721 | The cleanest paper datapoint for SSD-backed KV pressure. |
| Metric | Reported Detail | Investment Relevance | Priority |
|---|---|---|---|
| Trajectory profile | DeepSeek traces average 157 rounds, 32.7K-token context, 429-token append length, and 98.7% KV-cache hit rate. | Agentic inference becomes a repeated memory-retrieval workload rather than a single-turn compute-only workload. | HIGH |
| Throughput uplift | Offline throughput improves up to 1.87x; online serving throughput improves 1.96x on average while preserving TTFT and TPOT service objectives. | Data-path optimization can materially raise realized GPU utilization and lower effective inference cost. | HIGH |
| Node-level asymmetry | An 8-GPU Hopper node has 8x400Gbps compute NIC bandwidth but only 1x400Gbps storage NIC bandwidth. | Storage ingress, not accelerator FLOPS, can become the marginal bottleneck. | HIGH |
| Working-set scale | DeepSeek 660B KV working set ranges from 69GB at 0.1 agents/s to 681GB at 0.45 agents/s in the reported setup. | Persistent KV state can exceed economical HBM and host DRAM residency, pulling SSDs into the serving path. | HIGH |
| Ablation evidence | Layerwise prefill reduces completion time 17.21%; dual-path loading reduces it 38.19%; scheduling reduces it 45.62%. | The largest gains come from data-path utilization and scheduling, not only faster accelerators. | MED |
The paper’s architectural baseline is important because it matches the topology of many modern AI clusters. Each node has 8 Hopper GPUs, each GPU is paired with a 400Gbps compute NIC, and the node also has a storage NIC with up to 400Gbps bandwidth. The compute and storage networks are isolated. This creates a highly asymmetric I/O structure: a node has 8x400Gbps of compute-network bandwidth but only 1x400Gbps of storage-network bandwidth. In the baseline design, prefill engines read KV cache directly from the storage system through their own storage NICs. When agentic workloads produce large KV-cache reads, the prefill storage NIC saturates, GPUs on prefill engines wait for data, and decode engines are underutilized because their own storage NICs are not serving equivalent read pressure. The paper’s Figure 1 captures the imbalance: the existing path shows 100% storage utilization and roughly 40% GPU utilization, while DualPath uses additional paths to push GPU utilization toward roughly 80% under the illustrated setup.
DualPath’s core mechanism is conceptually simple but operationally nontrivial. KV blocks can be loaded through two paths. The PE-read path loads KV cache from storage directly into the prefill engine. The DE-read path loads KV cache through a decode engine’s storage NIC and then transfers it to the target prefill engine over the compute network using RDMA. This design exploits the otherwise idle storage bandwidth attached to decode nodes and the much larger aggregate bandwidth of the compute fabric. The system also uses layerwise prefill, where only the KV cache for one layer is held in GPU memory at a time, increasing effective batch size by approximately the number of layers and reducing active HBM residency. The traffic manager handles host-to-device, device-to-host, prefill-to-decode, decode-to-prefill, and storage flows; the scheduler assigns prefill and decode work while balancing storage queues and attention execution time.
The key technical insight is that storage bandwidth does not need to be attached only to the node performing prefill. For KV-cache reads, decode nodes can act as distributed I/O ingress points. The prefill engine needs KV data in GPU memory at the time each layer is executed, but the data can be pulled through another node’s storage NIC and moved over the compute fabric, provided the RDMA path is fast, predictable, and isolated from latency-sensitive collectives. This converts the idle decode storage NIC into a useful bandwidth source and uses the compute network as a temporary memory-transfer fabric. Under the paper’s bottleneck-free analysis, for 8 GPUs per node, 1 storage NIC per node, memory bandwidth of roughly 500GB/s, and storage bandwidth of roughly 50GB/s, the prefill-to-decode engine ratio can be bottleneck-free across a wide range from 1/7 to 7/2. That range matters because it implies the architecture can tolerate meaningful scheduling variability without requiring a perfectly fixed prefill/decode mix.
The paper’s performance results are meaningful but should be interpreted as system-specific rather than universal. Evaluation uses an in-house LLM inference stack, 3FS storage, an io_uring-like interface, no DRAM cache inside the storage layer in the storage system, Hopper GPUs, DeepSeek 660B, DeepSeek 27B, and Qwen 32B models, and datasets of 500 trajectories with maximum contexts of 32K, 48K, and 64K tokens. Default configurations include 2P4D for DeepSeek 660B, 1P2D for Qwen 32B, and 1P1D for DeepSeek 27B. Online serving uses TTFT of no more than 4s and TPOT of no more than 50ms as SLO criteria. Offline job completion time improves up to 1.87x for DeepSeek 660B and up to 1.78x for DeepSeek 27B versus the basic baseline; across prefill/decode ratios, DualPath averages 1.64x and reaches up to 2.46x in the reported DS27 experiments; online serving capacity improves 1.67x for DS27 and 2.25x for DS660, with an average of 1.96x.
The evaluation is production-relevant but not universal. DualPath is implemented with roughly 5K lines of changes on a DeepSeek inference stack using FlashMLA, DeepGEMM, DeepEP, 3FS, and an io_uring-like storage interface. The testbed uses 8 Hopper GPUs per node, 8 400Gbps RDMA compute NICs, and one storage NIC connected to 3FS. The 3FS cluster has no DRAM cache inside the storage layer and can saturate the 400Gbps storage NIC. The models are DeepSeek-V3.2 660B, a 27B downscaled DeepSeek model, and Qwen2.5-32B. SGL(MC) uses SGLang with HiCache, Mooncake Store, 3FS, and Mooncake Transfer Engine, but the paper explicitly warns that comparing DualPath and SGL(MC) is not apples-to-apples because implementation differences are material. The cleanest comparison is therefore DualPath versus the Basic DeepSeek framework, with Oracle serving as a zero-I/O upper bound.
The ablation results are critical because they isolate where the gains come from. Layerwise prefill alone reduces job completion time by an average of 17.21%. Dual-path loading reduces job completion time by an average of 38.19%. Adding scheduling reduces job completion time by an average of 45.62%. This means the largest source of improvement is not a model-level change, a memory device change, or a raw GPU upgrade. The largest source is data-path utilization: using idle storage NIC bandwidth and the compute network to feed prefill engines. The scheduler further improves performance by reducing imbalance: storage NIC load balance improves from a max/average ratio of 1.53 to 1.18, while attention execution imbalance reaches as low as 1.06 in the first 5% of the reported trajectory distribution.
The large-scale experiment is also strategically important. The paper reports near-linear scaling from 2P4D with 2K agents and 3,167s offline job completion time to 48P96D with 48K agents and 3,201s job completion time. Online serving scales from 2P4D at 0.4 agents per second to 44P88D at 8.8 agents per second while keeping TTFT, TTST, and TPOT broadly stable. Scheduler CPU usage remains below 10 CPU cores in those experiments. These results support the claim that the architecture is not only a single-node trick; it can scale across a large disaggregated inference cluster when network QoS, storage bandwidth, and scheduler design are properly engineered.
The large-scale result should also be framed carefully. The paper scales to 1,152 GPUs and shows that offline 2P4D with 2K agents completes in 3,167s, while 48P96D with 48K agents completes in 3,201s, indicating near-linear throughput scaling. Online serving scales from 2P4D at 0.4 agents per second to 44P88D at 8.8 agents per second while keeping TTFT, TTST, and TPOT broadly stable. However, the authors note that the large-scale setup does not prove additional JCT or serving-capacity gains versus multiple smaller units of equivalent cost because P/D ratios and parallelism were not exhaustively tuned. The strategic value of scale is therefore flexibility: larger deployments reduce fragmentation, provide more options for P/D-ratio tuning, and create more scheduling opportunities under bursty online arrivals.
The paper’s working-set analysis is particularly relevant to the memory/storage value chain. For the DeepSeek 660B setting, DualPath’s KV working set ranges from 69GB at 0.1 agents per second to 681GB at 0.45 agents per second, even within the paper’s controlled setup. The authors further state that production gaps between turns can expand working set by r² and cost by r³ under their experiment framing. This reinforces the central hardware conclusion: even with efficient KV reuse and scheduling, the persistent KV cache can quickly exceed the economically practical size of GPU HBM and can pressure host DRAM. The natural storage tier for the cold-to-warm but still performance-sensitive portion of KV cache is therefore SSD-backed distributed storage, not HDD and not HBM alone.
3. Ecosystem Validation: KV-Cache Systems Converge
DeepSeek DualPath should not be read as a one-off academic result. The broader inference ecosystem is moving in the same direction: hierarchical KV-cache retention, storage-aware routing, RDMA/GDS/NIXL data movement, and software schedulers that treat memory and storage as part of the serving fabric. NVIDIA Dynamo/NIXL, SGLang HiCache, LMCache, DeepSeek 3FS, Dell Storage Engines, and WEKA each validate a different layer of the same architecture shift.
KV-Cache System Evidence Matrix. The table below summarizes the highest-signal external evidence and keeps the vendor benchmarks compact. Exact speedups vary by model, input length, cache hit rate, interconnect, benchmark design, and SLO target, so the right interpretation is ecosystem convergence rather than a single universal multiplier.
| System | Key Architecture | Reported Evidence | Investment Read-Through |
|---|---|---|---|
| DeepSeek DualPath | Storage-to-prefill plus storage-to-decode-to-prefill via RDMA; scheduler balances prefill/decode bandwidth. | Up to 1.87x offline throughput and 1.96x average online throughput without violating SLO. | Primary proof that agentic serving can become storage-NIC and memory-fabric bound. |
| NVIDIA Dynamo / NIXL | Hierarchical KV manager, smart router, and NIXL movement across HBM, CPU, SSD, object/file/block storage, and fabrics. | NVIDIA cites up to 30x request serving on DeepSeek-R1 on Blackwell and more than 2x Llama 70B throughput on Hopper. | NVIDIA is productizing KV-cache routing/offload as a platform function, not a research curiosity. |
| SGLang HiCache | GPU/CPU/storage hierarchical cache with 3FS, Mooncake, NIXL, and local-file backends. | Reported up to 6x throughput and up to 80% TTFT reduction; 3FS integration doubled throughput and lifted hit rate from 40% to 80%. | Open-source serving stacks are converging on multi-tier KV-cache retention. |
| LMCache | CPU and local-disk KV offload, LRU eviction, async put, blocking get, and disk-to-CPU prefetch. | Documented disk-offload example: cold TTFT 6.314s vs warm TTFT 0.148s. | Developer-level evidence that warm KV reuse can collapse TTFT when recomputation is avoided. |
| DeepSeek 3FS | RDMA + SSD distributed filesystem for training and inference, with KVCache listed as an inference use case. | README benchmarks cite 6.6 TiB/s aggregate read throughput and up to 40 GiB/s KVCache client read throughput. | Makes the NAND/eSSD thesis concrete: persistent KV storage needs high-throughput distributed NVMe. |
| Dell Storage Engines | PowerScale/ObjectScale integrated with vLLM, LMCache, and NVIDIA NIXL for KV-cache offload. | Dell reports roughly 1-second TTFT at 131K context versus more than 17 seconds for standard vLLM. | Direct public-company validation that enterprise storage can sit in the hot inference path. |
The investment consequence is that inference frameworks are turning KV cache into a managed, movable asset. Once cache location, transfer path, and reuse probability become scheduler inputs, value accrues to platforms that can combine GPUs, HBM, DRAM, SSDs, NICs, switches, and software control planes into one low-latency serving system.
4. AI Infrastructure Capex Implications
DualPath implies that agentic inference CAPEX cannot be analyzed by counting GPUs alone. A cluster with more GPU FLOPS can still produce poor realized throughput if prefill engines cannot ingest KV cache fast enough. The paper shows a hardware trend that worsens this issue: from Ampere to Blackwell, GPU compute improves by 28.8x, PCIe bandwidth improves by only 2.0x, GPU memory capacity improves by 2.4x, and the I/O-to-compute ratio deteriorates by 14.4x. This is the same economic pattern seen in other accelerator bottlenecks: as compute scales faster than memory capacity and off-package bandwidth, infrastructure value migrates toward the components that keep accelerators utilized. In this case, that includes HBM, SSDs, storage NICs, compute NICs, switches, PCIe/CXL fabrics, DMA engines, filesystem software, and network QoS.
The practical capex implication is a stack-level purchasing problem. GPUs execute active prefill and decode, HBM holds active weights and KV blocks, host DRAM/SOCAMM/CXL supports metadata and staging, enterprise SSDs hold persistent read-heavy KV, and the compute fabric redistributes KV from decode-side storage ingress to prefill engines. Weakness in any tier can strand accelerator spend. The paper therefore supports a procurement frame based on realized throughput per balanced rack, not only peak FLOPS per accelerator.
Vendor and open-source benchmarks now validate that this is becoming product infrastructure rather than paper-only architecture. NVIDIA Dynamo/NIXL provides the data-movement abstraction, LMCache shows practical CPU/disk offload and prefetch, SGLang HiCache shows GPU/CPU/storage hierarchical caching, DeepSeek 3FS shows distributed RDMA/NVMe storage for KVCache, and Dell’s PowerScale/ObjectScale validation shows that public-company storage systems can participate directly in the hot inference path. Dell reported roughly 1-second TTFT at 131K context versus more than 17 seconds for standard vLLM in its vLLM + LMCache + NVIDIA NIXL testing, which is meaningful public-company evidence even after applying the appropriate vendor-benchmark haircut.
The paper also suggests that inference infrastructure will become more heterogeneous by workload type. Training clusters optimize for all-reduce, model-parallel bandwidth, checkpointing, and dataset streaming. Conventional inference clusters optimize for latency, batching, KV cache residency, and model serving economics. Agentic inference adds repeated storage-to-GPU KV-cache movement, long-context request reuse, fine-grained state persistence, and bursty multi-agent concurrency. This workload mix creates a new performance-sensitive storage tier whose value is based less on $/TB alone and more on a composite of $/TB, GB/s, IOPS, latency distribution, power per TB, endurance, RDMA integration, and GPU-accessible memory semantics. That changes the relative strategic importance of enterprise SSDs versus consumer SSDs, nearline HDDs, and commodity client NAND.
The economic sign of DualPath is nuanced. On one side, the technique increases throughput per installed GPU and therefore could reduce the number of GPUs required for a given agentic inference workload. On the other side, higher GPU utilization improves the return on accelerator CAPEX and can stimulate additional workload deployment by lowering effective inference cost. In most AI infrastructure cycles, utilization-improving innovations have not reduced absolute compute demand; they have increased the set of economically viable applications. The more direct implication is that future clusters will be bought and benchmarked as balanced systems. Accelerator buyers will increasingly evaluate whether storage and fabric can preserve TTFT and TPOT under long-context, high-hit-rate agentic workloads rather than evaluating only peak TFLOPS or HBM capacity.
The investment committee relevance is that the value chain shifts from “GPU scarcity only” to “balanced AI factory scarcity.” A fully utilized agentic inference cluster requires HBM for active compute, DRAM for staging and metadata, NAND/eSSD for persistent hot/warm KV, HDD for colder durable data, and high-bandwidth networks to make these tiers behave like a coherent serving substrate. The paper therefore supports a broad memory and interconnect basket, but with differentiated sensitivity. HBM suppliers benefit from continued accelerator growth. NAND and enterprise SSD suppliers receive the most differentiated incremental signal because KV-cache offload turns flash into part of the inference execution path. HDD suppliers benefit indirectly through persistent AI data growth but are not well suited for live KV-cache serving. Networking suppliers benefit because RDMA, QoS, congestion control, and high-radix switching become determiners of inference throughput and tail latency.
5. HBM Implications
HBM remains central, but DualPath changes the interpretation of HBM scarcity. The paper does not argue that HBM demand weakens because KV can be stored on SSD. It argues that HBM is too valuable and too capacity-constrained to serve as the persistent repository for large multi-turn KV working sets. HBM should be reserved for active model weights, active KV blocks, attention computation, and decode-state execution. Persistent KV must be tiered out to host DRAM and SSD when context length, agent count, and inter-turn gaps grow. This distinction is important for stock selection: HBM demand remains driven by accelerator platforms and model sizes, while the incremental agentic inference bottleneck creates a parallel opportunity in SSD and fabric rather than a substitution away from HBM.
Layerwise prefill is the key HBM optimization in the paper. By holding only one layer’s KV cache in GPU memory at a time during prefill, the system reduces active HBM residency and increases effective batch size. This is a genuine mitigation to HBM capacity pressure, but it does not eliminate HBM importance. Instead, it changes the HBM job from “store all historical KV” to “consume streamed KV at high utilization.” HBM bandwidth and capacity still determine how quickly the GPU can consume the KV data once it arrives, how large the active batch can be, and how much decode state can remain resident. HBM upgrades are therefore complementary to DualPath: more HBM improves active execution capacity, while DualPath improves the ability to feed that execution capacity from external storage.
The cache-compute ratios in the paper show why HBM alone cannot absorb the problem. For 429-token appends and context lengths from 16K to 64K tokens, the paper reports cache-compute ratios ranging from 117GB/PFLOP to 267GB/PFLOP for Qwen2.5-32B FP16, 47GB/PFLOP to 95GB/PFLOP for GPT-OSS-120B, 39GB/PFLOP to 60GB/PFLOP for Qwen3-235B-A22B, 13GB/PFLOP to 36GB/PFLOP for DeepSeek-V3.2 660B, and 4.8GB/PFLOP to 5.8GB/PFLOP for DeepSeek-V3 660B. The wide range by architecture matters: KV footprint optimization can materially reduce storage pressure, but the common direction is still that long-context agentic inference produces large reusable KV loads relative to incremental compute. HBM capacity additions help, but the persistent KV footprint scales beyond local accelerator memory in the workloads the paper targets.
HBM demand should therefore be viewed as structurally positive but not the only memory lever. Micron announced HBM4 36GB 12H volume shipment in 1Q26 for NVIDIA Vera Rubin, with greater than 2.8TB/s bandwidth, over 11Gb/s pin speed, and more than 20% power-efficiency improvement versus its HBM3E; it also sampled 48GB 16H HBM4, increasing capacity per HBM placement by 33% versus 36GB 12H. Samsung announced HBM4 mass production with 11.7Gbps transfer speed, up to 13Gbps capability, 3.3TB/s maximum single-stack bandwidth, and future 16-layer capacity up to 48GB. SK hynix reported record 1Q26 results driven by high-value-added HBM, high-capacity server DRAM modules, and eSSDs, and stated that agentic AI and real-time inference expand demand across DRAM and NAND. These disclosures indicate that all 3 major DRAM vendors are positioning next-generation AI platforms as memory-rich systems rather than simple compute accelerators.
The key bear-case caveat for HBM is algorithmic compression. Multi-query attention, grouped-query attention, multi-head latent attention, lower-precision KV, paging, prefix sharing, and speculative techniques can reduce KV bytes per token. The paper itself shows large differences in cache-compute ratio across model architectures, which implies that model design can partially offset infrastructure demand. However, the likely secular offset is incomplete. Agent count, context length, tool-use depth, persistent memory, multimodal data, and automated workloads can grow faster than bytes-per-token optimizations. The net effect is likely a larger, more tiered memory hierarchy rather than a collapse in HBM intensity.
6. DRAM and Host Memory Implications
The paper is more mixed for host DRAM than for NAND or HBM. On the bullish side, agentic inference increases the need for host-side buffers, scheduler metadata, prefix tries, request state, queueing, spillover cache, CPU-side orchestration, and data staging. On the bearish side, the paper explicitly positions SSD-backed KV storage as necessary because DRAM pools are costly and insufficient for the largest working sets. The appendix is revealing: DualPath uses 80GB DRAM per node for DeepSeek models, while a multi-copy SGL-style DRAM baseline uses 1.5TB DRAM per node; for Qwen32, the appendix lists 320GB DRAM per node. If generalized, this is negative for architectures that assume brute-force DRAM pools for all reusable KV, but positive for architectures that use DRAM as a disciplined hot-cache and staging tier above SSD.
DRAM demand will still rise because agentic AI broadens server memory intensity beyond accelerator HBM. Micron stated that data-center DRAM plus NAND bit TAM will exceed 50% of industry TAM in calendar 2026, that traditional server demand is robust due to agentic AI workloads and server refresh, and that both AI and traditional server demand are constrained by DRAM and NAND supply. SK hynix reported record 1Q26 operating margin of 72% and highlighted demand for high-capacity server DRAM modules alongside HBM and eSSDs. Samsung’s 1Q26 release highlighted all-time-high Memory revenue and profit, high-value AI demand, and SOCAMM2 for NVIDIA Vera Rubin. These statements are consistent with DualPath’s view that host memory remains a critical system tier even when persistent KV cache sits on SSD.
SOCAMM and LPDDR-style server memory become especially relevant in this architecture. The paper’s cluster uses CPU-side traffic management, DRAM buffers, and storage read/write orchestration; large-scale agentic inference requires low-power, high-capacity memory near CPUs and DPUs for state management. Micron’s SOCAMM2 disclosure is directly aligned with this direction: it announced 192GB SOCAMM2 in high-volume production for NVIDIA Vera Rubin and a portfolio spanning 48GB to 256GB capacities. Micron’s prepared remarks also noted sampling of an industry first 256GB LP SOCAMM2 on 1γ, enabling 2TB per CPU. These products are not substitutes for HBM; they are intermediate memory layers that support higher node-level state density and lower power per bit than conventional server DIMM approaches in tightly integrated AI systems.
CXL memory pooling is likely to be complementary but not a complete answer. DualPath shows that persistent KV working sets can become very large and that the access pattern requires high aggregate read bandwidth, low tail latency, and careful QoS. CXL-attached memory can help with hot state, host-side cache expansion, and disaggregated memory pools, but the economics of storing multi-agent, long-context KV entirely in DRAM remain challenging. The more realistic architecture is tiered: HBM for active compute, local DRAM and SOCAMM for staging and hot metadata, CXL DRAM for expandable hot/warm memory when latency requirements justify it, SSD for persistent high-capacity KV, and HDD/object storage for colder datasets and generated artifacts. Marvell’s discussion of scale-up PCIe fabrics, Astera’s focus on rack-scale AI connectivity, and Marvell’s Celestial AI acquisition thesis around optical scale-up interconnects all fit the broader trend toward memory and storage disaggregation.
DualPath should be positioned as complementary to, not a clean replacement for, DRAM KV-cache systems. Mooncake-style distributed DRAM pools and TokenLake-style prefix cache pools target the same broad problem: avoiding recomputation in long-context serving. The difference is economic and architectural. DRAM pools are attractive for hot, latency-critical cache segments, but become expensive and capacity-constrained when working sets expand across long-running, high-concurrency agents. DualPath targets the storage backend directly and uses scheduling to balance traffic across all storage NICs, materially reducing DRAM dependence. The paper states DualPath can be combined with a middle DRAM cache, but the incremental performance gain is marginal in its setup. The bull case for eSSD is not that DRAM disappears, but that DRAM alone is not an economical persistent KV tier at scale.
7. NAND and Enterprise SSD Implications
NAND/eSSD is the most differentiated hardware implication of the paper. DualPath assumes external SSD-based distributed storage for KV cache because the persistent cache is too large for HBM and too costly for pure DRAM pools. That means enterprise SSDs move from a support role into the inference data path. The storage system must deliver high read bandwidth, high IOPS, low tail latency, fine-grained block access, high queue depth, strong QoS, efficient write persistence, predictable garbage collection, and integration with GPU-direct or RDMA-based data movement. A commodity client SSD optimized for burst bandwidth and consumer cost is not sufficient. The paper’s design points toward AI-optimized eSSDs with PCIe Gen5/Gen6, large capacity, QLC density for read-heavy data, TLC or storage-class memory for latency-sensitive tiers, and controllers tuned for many small concurrent reads.
The paper separates two AI storage theses that are often conflated. The first is generic AI data growth: training corpora, checkpoints, logs, retrieval data, synthetic data, and generated artifacts. That thesis supports HDD, object storage, and cold/warm capacity. The second is inference-critical KV-cache storage: persistent, read-heavy, latency-sensitive state that must feed GPUs under SLO. DualPath is much more directly tied to the second thesis. HDD vendors can benefit from the absolute expansion of AI data, but NAND/eSSD suppliers, SSD controllers, NIC vendors, and fabric vendors have cleaner exposure to the live inference bottleneck that DualPath identifies.
The KV-cache write path also matters. The paper persists generated KV when a full block is available, using a block size example of 64 tokens. This creates a write stream that is smaller than the repeated read stream but still operationally important. Flash endurance, write amplification, block layout, garbage collection, and FTL behavior can influence tail latency. QLC can be attractive because KV cache is largely read-reused after being written, and because capacity density lowers cost and power per TB. However, QLC endurance and write amplification must be managed carefully if workloads create high churn, frequent invalidation, or many partial-block writes. This points to a bifurcated SSD opportunity: high-density QLC eSSDs for large read-mostly KV and AI datasets, and specialized high-IOPS/low-latency SSDs for finer-grained GPU-accessible expansion tiers.
The block-layout design strengthens the argument that AI SSDs will be differentiated by firmware, queueing, and layout behavior. Layerwise prefill reduces the KV block size to roughly one layer of the original but increases the number of blocks by roughly the number of layers, creating many fine-grained transfers. DualPath uses Full Blocks for storage and Layer Blocks for layerwise movement into HBM. A Full Block stores all layers for a token block, while a Layer Block stores a single layer. KV cache is organized in a trie where each node corresponds to a Full Block. This design avoids manual KV memory-layout conversion and supports prefix reuse, but it also implies that storage systems must handle large numbers of fine-grained, latency-sensitive reads. The implication is positive for enterprise SSDs and controllers optimized for high queue depth, small-block consistency, low tail latency, telemetry, and workload-specific firmware.
Current supplier roadmaps increasingly validate this interpretation. Micron stated that NAND bit demand is accelerating from vector databases and KV-cache offload, that data-center SSDs are moving from performance to capacity, that it is in high-volume production of G9 NAND-based PCIe Gen6 high-performance data-center SSDs, and that its 122TB high-capacity SSD delivers 16x sequential read throughput per watt versus capacity-matched HDD. Micron also announced the 9650 PCIe Gen6 data-center SSD in high-volume production, delivering up to 2x the read performance of Gen5 at 100% higher performance per watt and optimized for agentic AI workloads on NVIDIA BlueField-4 STX architecture. Samsung stated that it is developing PCIe Gen6 SSDs and products focused on KV-cache storage demand. Kioxia announced a Super High IOPS SSD enabling GPUs to directly access high-speed flash as an expansion to HBM, with 512-byte fine-grained access and evaluation samples expected by the end of 2026.
The market backdrop is already tight. TrendForce reported 4Q25 NAND supplier revenue of $21.17B, up 23.8% QoQ, with enterprise SSD demand driven by AI server deployment, HDD shortages, and longer HDD lead times accelerating the shift to NAND. TrendForce projected 1Q26 NAND prices up 85-90% QoQ. The same report listed 4Q25 NAND revenue shares of 28% for Samsung, 22.1% for SK Group, and approximately $3.03B of revenue each for Micron and SanDisk. TrendForce also noted supplier targeting of high-capacity QLC enterprise SSDs such as 122TB and 245TB for generative AI workloads. This pricing environment magnifies operating leverage for NAND suppliers but also raises cyclicality risk if customer inventory builds ahead of sustainable end demand.
Solidigm’s role is important because it gives SK hynix a differentiated high-density QLC enterprise SSD asset. Solidigm’s D5-P5336 offers capacities up to 122.88TB and targets data-intensive, read-intensive workloads. Solidigm also argues that high-density QLC can reduce storage racks by up to 9:1 versus hybrid HDD/TLC arrays under its modeled comparison, and its 100MW AI data-center study claims QLC SSDs can be up to 19.5% more power efficient than TLC SSDs and up to 79.5% more power efficient than hybrid TLC/HDD storage when isolating storage power. These claims are vendor-produced and should be discounted for commercial bias, but the underlying direction is consistent with hyperscale constraints: power, floor space, and rack density are now binding variables, not just storage device ASP.
SanDisk has become a much cleaner NAND/eSSD equity after Western Digital completed the separation of the flash business on Feb 24 2025. SanDisk’s recent results show extreme operating leverage to this cycle: Q3 FY26 revenue was $5.95B, up 97% QoQ and 251% YoY, with Datacenter revenue up 233% QoQ and 645% YoY, gross margin of 78.4%, operating income of $4.111B, and Q4 FY26 guidance of $7.75B to $8.25B revenue and non-GAAP EPS of $30 to $33. The company also stated that it signed 3 NBM agreements and expected 2 more in fiscal Q4. This is the most direct public evidence that AI-driven NAND demand is being contracted differently than prior commodity cycles, although the durability of margins at this level is inherently uncertain.
The NAND risk is that the same DualPath-style optimization that increases the strategic importance of SSDs can reduce brute-force capacity needs per unit of inference. Efficient scheduling, layerwise prefill, KV compression, better prefetching, and model architectures with lower KV bytes per token can reduce the amount of flash required per agent. However, that risk is likely more than offset if agentic workloads proliferate across coding, enterprise automation, research agents, autonomous workloads, and multi-agent simulations. The more important risk is supply discipline. NAND historically suffers from boom-bust behavior because bit supply additions can overshoot demand. Long-term supply agreements, AI-specific eSSD qualification, controller complexity, power constraints, and high-capacity product differentiation may reduce but not eliminate that cyclicality.
8. HDD Implications
HDDs are not the correct medium for the paper’s live KV-cache path. The paper’s system needs storage reads that can saturate a 400Gbps storage NIC, interact with RDMA-based GPU data paths, and feed prefill engines under tight latency and throughput constraints. HDDs cannot deliver the required random-read latency, IOPS density, or fine-grained access behavior for that tier. Therefore, DualPath is not directly bullish for HDD as a live inference memory-extension medium. It is more likely to shift hot/warm inference state toward eSSD and away from any disk-based active tier.
The indirect HDD implication is still positive because agentic AI creates durable data. Tool traces, logs, embeddings, retrieval corpora, checkpoints, synthetic data, generated code, enterprise documents, observability records, multimodal artifacts, audit trails, and cold KV snapshots all require persistent storage. For colder and larger data, nearline HDD remains advantaged on $/TB and often on deployed supply-chain familiarity. Western Digital’s current HDD-focused narrative is aligned with this indirect demand: after completing the flash separation, the company now emphasizes HDD capacity, TCO, and durable AI workload data, and management stated that virtually every AI workload, from training to inference to agentic AI and physical AI, creates data stored persistently and cost-efficiently on HDDs.
Seagate is similarly exposed to the bulk-data side rather than the hot-KV side. Seagate’s FQ3 2026 revenue was $3.112B versus $2.160B in the prior year, with GAAP gross margin of 46.5% and non-GAAP gross margin of 47.0%. Management described a structural growth era as AI amplifies data creation and sustained storage demand, and emphasized areal density as the path to higher-capacity, energy- and capital-efficient storage. This is directionally positive for nearline HDD demand if hyperscalers continue to build large cold and warm storage repositories, but it is a different thesis from the DualPath SSD thesis.
The key competitive dynamic is tiering. HDD remains best suited to large, colder, sequentially accessed data pools. QLC SSDs increasingly attack workloads where power, rack density, bandwidth, and latency justify a higher $/TB. TrendForce explicitly cited HDD shortages and longer lead times as factors accelerating shift to NAND, while Solidigm’s vendor study argues dense QLC can improve storage power efficiency versus hybrid HDD configurations at high capacities. This does not imply broad HDD obsolescence; the absolute amount of AI data can grow enough for both HDD and NAND to expand. It does imply that HDD vendors benefit most from cold-data proliferation, while NAND vendors benefit more from performance-sensitive AI storage tiers.
9. Networking and Data Transfer Implications
DualPath turns the network into an extension of the memory hierarchy. The paper depends on GPUDirect RDMA through paired compute NICs, InfiniBand virtual lanes, QoS prioritization, congestion isolation, and CNIC-centric traffic management. The authors explicitly avoid relying on GPU copy-engine paths for all I/O because GPUDirect Storage and CUDA copy-engine activity can interfere with latency-sensitive collectives, while RDMA write overhead is roughly 1µs versus 5-7µs for cudaMemcpyAsync in their discussion. This is a crucial point for the networking value chain: under agentic inference, the difference between high raw bandwidth and deterministic, low-jitter, QoS-managed bandwidth becomes economically visible in GPU utilization.
The networking implication is less about peak bandwidth and more about traffic-class control. DualPath deliberately routes H2D/D2H KV-cache traffic through the paired compute NIC so the NIC and fabric can enforce QoS. For InfiniBand, model-execution collectives are placed on a high-priority virtual lane, while KV-cache transfers are placed on a lower-priority virtual lane. The paper reserves roughly 99% of bandwidth for high-priority inference communication and leaves the lower-priority lane to opportunistically use otherwise idle bandwidth. The same concept can extend to RoCE using DSCP markings, traffic classes, PFC, and hardware queues. This is a direct argument for NICs, switches, and fabric software that can provide deterministic traffic isolation, not just headline port speed.
Data-Movement Stack: Why Networking and Software Matter. The table below translates the networking thesis into the concrete layers required for agentic KV-cache serving.
| Layer | Function | Why It Matters for KV-Cache Serving | Signal |
|---|---|---|---|
| RDMA / InfiniBand / RoCE | Low-latency remote memory movement across nodes and fabrics. | Enables decode-side or remote storage bandwidth to be used without turning every KV load into a CPU-mediated copy path. | HIGH |
| GPUDirect Storage | Bypasses CPU bounce buffers for storage-to-GPU or storage-adjacent data paths. | Reduces latency, host overhead, and jitter when KV blocks move between storage and accelerator memory. | HIGH |
| NIXL | Abstraction layer for moving inference data across HBM, CPU memory, SSD, object/file/block storage, and heterogeneous fabrics. | Turns KV-cache movement into a framework-level primitive that can span vLLM, SGLang, Dynamo, and storage plugins. | HIGH |
| QoS / virtual lanes | Traffic-class isolation for KV-cache movement versus latency-critical collectives and model execution traffic. | Prevents storage offload from improving average TTFT while damaging P99 ITL or collective latency. | HIGH |
| NVLink / NVSwitch / Spectrum / Quantum | Intra-node and cluster-scale GPU/network fabric for high-bandwidth, low-latency data movement. | Makes the rack or cluster the real inference system, not an isolated GPU server. | HIGH |
This is why the networking read-through should be framed as deterministic data movement rather than generic bandwidth. Average throughput is necessary, but agentic serving also needs low tail latency, traffic-class isolation, and enough software visibility to avoid turning cache offload into SLO variance.
Networking Features That Matter for Agentic KV-Cache Serving. The table below summarizes why DualPath is a QoS, NIC, and fabric-control thesis, not only a peak-bandwidth thesis.
| Feature | Paper Detail | Why It Matters | Investment Read-Through | Priority |
|---|---|---|---|---|
| InfiniBand virtual lanes | Model inference traffic is assigned to a high-priority VL; KV-cache traffic is assigned to a lower-priority VL. | Prevents KV movement from degrading latency-sensitive collectives. | Supports NVIDIA InfiniBand/Spectrum-X system value and differentiated fabric QoS. | HIGH |
| 99% high-priority bandwidth reservation | Weighted round-robin reserves roughly 99% of bandwidth to high-priority model traffic, with residual bandwidth for KV traffic. | KV traffic uses spare capacity instead of competing with critical model execution. | Points to scheduler-aware fabric control rather than raw bandwidth alone. | HIGH |
| RoCE extension path | Paper says principles extend to RoCE using traffic class, DSCP markings, hardware queues, and PFC. | Broadens relevance from InfiniBand-only clusters to Ethernet AI fabrics. | Positive for Broadcom, Arista, Marvell, merchant Ethernet, and UEC-aligned ecosystems. | HIGH |
| CNIC-assisted H2D/D2H | KV is loaded into host DRAM, then moved to GPU through RDMA Write via paired CNIC; generated KV follows a symmetric D2H/storage path. | Makes the NIC the QoS enforcement point for GPU PCIe traffic. | Raises value of SmartNIC/DPU/NIC software and PCIe topology. | HIGH |
| Small-transfer overhead | cudaMemcpyAsync submission overhead is roughly 5-7µs; RDMA Write submission is roughly 1µs and can benefit from doorbell batching. | Fine-grained layer-block transfers need low per-operation overhead. | Positive for RDMA stack quality, NIC firmware, and GPU-adjacent data-movement software. | MED |
The paper’s topology also changes how storage networking should be valued. A conventional storage fabric attaches data to the node that requested it. DualPath instead treats decode nodes’ storage NICs as cluster-level bandwidth resources and then uses the compute fabric to reposition data. This raises the value of NICs, switches, cables, optics, congestion control, telemetry, and scheduler-aware routing. It also creates a counterintuitive near-term implication: optimized software can reduce the need for incremental storage NIC overbuild by monetizing idle decode-side bandwidth. Longer term, however, the same architecture increases the importance of higher-bandwidth compute fabrics because the compute network becomes the redistribution layer for storage-ingested KV.
NVIDIA is directly exposed because the paper’s assumed cluster resembles NVIDIA AI factory architecture: GPU-direct data movement, 400Gbps-class NICs, InfiniBand/RoCE capabilities, and software-defined QoS are central. NVIDIA’s Quantum-X800 InfiniBand platform provides 144 ports of 800Gb/s per switch, SHARP v4, adaptive routing, telemetry-based congestion control, and performance isolation. NVIDIA’s ConnectX-8 SuperNIC offers up to 800Gb/s of total network bandwidth, while ConnectX-9 reaches up to 1.6Tb/s per GPU. NVIDIA Spectrum-X Ethernet emphasizes performance isolation, deterministic performance, advanced RoCE extensions, storage-fabric use cases, and scaling to 128K GPUs in two tiers under its multiplane architecture. These capabilities map closely to DualPath’s requirement for fast, predictable, GPU-adjacent data movement.
Broadcom is exposed through Ethernet switch ASICs, CPO, and open AI networking. Broadcom’s Tomahawk 6 Davisson announcement describes a 102.4Tbps co-packaged-optics Ethernet switch designed for AI networking, 200Gbps per link, claimed 70% optical interconnect power reduction versus traditional pluggable solutions, and support for scale-up cluster size of 512 XPUs and up to 100,000+ XPUs in 2-tier networks. DualPath is not specifically a Broadcom architecture, and the paper uses InfiniBand virtual lanes in its implementation discussion, but the broader migration of storage and memory traffic onto high-performance Ethernet fabrics is favorable for suppliers that can deliver lossless transport, congestion control, low latency, power efficiency, and open interoperability.
Marvell is exposed through optical DSPs, PCIe/CXL switching, custom silicon, and scale-up interconnect. Marvell describes AI data-center connectivity across scale-up, scale-out, and scale-across infrastructure, including DSPs, SerDes, switching, interconnects, drivers, TIAs, telemetry, and 1.6T optical modules. Its Structera S PCIe switch roadmap includes a PCIe 6.0 260-lane design with over 4TB/s aggregate bidirectional bandwidth, targeting low-latency multi-XPU scale-up architectures. Marvell’s announced acquisition of Celestial AI is also strategically relevant because Celestial’s Photonic Fabric is positioned for package-, system-, and rack-level optical connectivity and pooled-memory-style applications over time. DualPath’s dataflow is not necessarily optical scale-up, but the direction is the same: AI infrastructure bottlenecks are moving toward bandwidth, latency, reach, and power efficiency across memory and storage fabrics.
Astera Labs is more indirectly exposed through rack-scale AI connectivity, PCIe retiming, CXL, and interoperability. The paper’s system requires reliable multi-component data movement across GPUs, CPUs, NICs, storage devices, and fabrics. Astera’s positioning that “the rack is now the unit of compute,” plus its emphasis on open-standard silicon hardware/software and cloud-scale interoperability across hosts, endpoints, and memory vendors, is directionally consistent with this shift. The direct revenue capture depends on platform design wins in PCIe, CXL, retimers, smart cable modules, and memory expansion architectures rather than on KV-cache software itself.
The optical interconnect chain should also be considered. DualPath increases east-west data movement and makes storage-to-GPU bandwidth part of the inference-critical path. As clusters scale, copper reach, power, and signal integrity become binding. NVIDIA, Broadcom, and Marvell are all emphasizing silicon photonics or CPO because pluggable optics and copper can become limiting at 800G, 1.6T, and future data rates. This is positive for optical DSPs, lasers, transceivers, CPO packaging, linear pluggable optics, and active electrical cable suppliers, but the timing and margin capture will vary by architecture and hyperscaler purchasing model.
10. Software, Filesystem, and Scheduling Implications
The paper is as much a systems-software paper as a hardware paper. DualPath depends on a traffic manager that understands multiple data movement paths, a scheduler that assigns prefill/decode pairs and read paths, and a storage system capable of high-throughput, fine-grained reads without relying on a large DRAM cache inside the storage layer. The paper’s implementation uses 3FS with an io_uring-like interface and explicitly states that the storage layer has no DRAM cache inside the storage layer and can saturate the 400Gbps storage NIC. This matters because hardware alone does not create the result. An expensive SSD tier can still underperform if the filesystem, queueing, block layout, and DMA path create latency amplification.
Block layout is an underappreciated value driver. The appendix describes Full Blocks and Layer Blocks, designed to avoid manual KV-cache memory-layout conversion, and stores KV cache in trie nodes corresponding to Full Blocks. This structure is important for prefix reuse, block-level deduplication, and efficient prefill streaming. For storage vendors and controller vendors, the implication is that KV-cache-aware layouts may become a workload target, similar to how databases shaped enterprise SSD QoS and write-endurance requirements. SSD firmware, filesystem layout, and serving runtime could increasingly be co-designed around token blocks, layer slices, prefix tries, and GPU DMA granularity.
The scheduler’s role also affects hardware demand. DualPath’s scheduler decides whether a request should read through the PE path or DE path based on queue lengths and balances attention execution times using token counts as a proxy. This means realized demand for storage and networking components depends on software maturity. Poor scheduling can make a cluster look storage-starved even when idle bandwidth exists elsewhere. Strong scheduling can improve throughput without additional hardware. For investors, this creates two implications: hardware TAM will be shaped by open-source and hyperscaler inference-runtime sophistication, and vendors that provide vertically integrated hardware plus software stacks may capture a larger share of system value than component vendors with undifferentiated products.
The implementation details reinforce that DualPath is not a commodity SSD story. The scheduler uses token count as a proxy for GPU load, disk-read load, and network load, and separates scheduling into inter-engine assignment and intra-engine batch construction. PE assignment classifies engines by unfinished tokens and disk-read queue length, prioritizing engines on nodes with shorter read queues unless token load is already excessive. DE assignment balances token load across groups, then manages HBM availability within a group. After PE and DE are selected, the system chooses the KV-read side with the shorter read queue. The paper explicitly notes that splitting a single request across both read paths is future work, which would further raise scheduling complexity and potentially improve storage-bandwidth pooling.
The appendix gives three particularly important implementation parameters. DualPath allocates 80GB DRAM per node for DeepSeek models, versus 1.5TB DRAM per node for the SGL(MC) DRAM-heavy baseline; Qwen32 requires 320GB DRAM per node because of larger KV-cache size. The short reading-queue threshold α is set to the number of tokens readable in 33 seconds, while the unfinished-token limit β is set to the number of tokens one GPU can compute in 55 seconds. The compute-quota threshold is 300ms across DualPath and Oracle baselines. These parameters underscore that the system’s advantage depends on profiling, queue control, and runtime-specific tuning, not only on buying faster SSDs.
11. Company Implications: Micron
| Company / Group | Clean Exposure | Key Evidence | Main Risk | Priority |
|---|---|---|---|---|
| Micron | U.S.-listed broad memory stack: HBM4, SOCAMM2, DRAM, NAND, and PCIe Gen6 SSDs. | FQ2 FY26 revenue of $23.86B, 74.9% non-GAAP gross margin, DRAM revenue up 207% YoY, NAND revenue up 169% YoY, and FQ3 revenue guide of $33.5B plus or minus $750M. | Memory cycle and capex execution. | HIGH |
| SK hynix / Solidigm | HBM leadership plus high-capacity server DRAM and Solidigm QLC eSSD exposure. | 1Q26 operating margin of 72%; management highlighted HBM, server DRAM, eSSDs, agentic AI, and real-time inference demand. | Peak-margin normalization. | HIGH |
| Samsung Electronics | Broad vertical integration across HBM, DRAM, NAND, eSSD, base dies, foundry, and packaging. | 1Q26 Device Solutions operating profit of KRW 53.7T; HBM4 mass production and PCIe Gen6 eSSD development for KV-cache demand. | HBM execution and mix dilution. | HIGH |
| SanDisk | Highest-beta NAND/eSSD pure play after Western Digital flash separation. | Q3 FY26 revenue of $5.95B, Datacenter revenue up 645% YoY, gross margin of 78.4%, and Q4 FY26 revenue guide of $7.75B to $8.25B. | NAND cyclicality. | HIGH |
| Western Digital / Seagate | Nearline HDD exposure to cold and bulk AI data growth. | Western Digital and Seagate reported strong FY26 storage results and emphasized persistent AI data creation. | Not a live KV-cache medium. | MED |
| NVIDIA / Broadcom / Marvell / Astera / Arista / Optics | AI fabrics, RDMA, Ethernet/InfiniBand, PCIe/CXL, optical DSPs, CPO, high-speed links, and rack-scale connectivity. | DualPath requires deterministic GPU-adjacent data movement, storage ingress, QoS, and low tail latency. | Architecture-specific share capture. | HIGH |
Micron is one of the cleanest public equities for the DualPath thesis because it has exposure across HBM, server DRAM, SOCAMM, NAND, and data-center SSDs. Its FQ2 FY26 results show the scale of the current memory upcycle: revenue was $23.86B versus $13.64B in the prior quarter and $8.05B in the year-ago period, GAAP gross margin was 74.4%, non-GAAP gross margin was 74.9%, GAAP operating income was $16.135B, and adjusted free cash flow was $6.9B. DRAM revenue was $18.8B, up 207% YoY and 79% of revenue, while NAND revenue was $5.0B, up 169% YoY and 21% of revenue. The company guided FQ3 revenue to $33.5B plus or minus $750M and gross margin to roughly 81%.
Micron’s product disclosures map unusually well to the paper’s memory hierarchy. HBM4 addresses active accelerator memory. SOCAMM2 addresses CPU-adjacent high-capacity, low-power memory. PCIe Gen6 SSDs and high-capacity data-center SSDs address the external storage tier that DualPath uses for persistent KV. The company’s own prepared remarks explicitly cite vector databases and KV-cache offload as NAND demand accelerants, and its GTC 2026 announcement states that the 9650 PCIe Gen6 SSD is optimized for agentic AI workloads on NVIDIA BlueField-4 STX architecture. This makes Micron a broad beneficiary if agentic inference shifts procurement from standalone GPUs to balanced memory-plus-storage systems.
The key risks for Micron are cyclical and execution-related. HBM is capacity-intensive and requires customer qualification, advanced packaging, and high capital intensity. Micron stated fiscal 2026 capex would be above $25B and fiscal 2027 capex would step up meaningfully, with construction capex increasing by $10B YoY. High capex can be value-accretive if supply remains tight, but it can also amplify downside if memory supply overshoots. Micron also remains exposed to commodity DRAM and NAND pricing, customer concentration in AI platforms, and geopolitical supply-chain constraints. The paper strengthens the strategic rationale for Micron’s AI memory/storage portfolio, but it does not remove memory-cycle risk.
12. Company Implications: SK hynix and Solidigm
SK hynix is arguably the most strategically advantaged company in the paper’s world because it combines HBM leadership, high-capacity server DRAM, and Solidigm enterprise SSD exposure. In 1Q26, SK hynix reported revenue of KRW 52.5763T, operating profit of KRW 37.6103T, net profit of KRW 40.3459T, and operating margin of 72%. The company attributed sustained performance to high-value-added HBM, high-capacity server DRAM modules, and eSSDs, and it explicitly stated that agentic AI and real-time inference expand memory demand across DRAM and NAND. That language is unusually aligned with DualPath’s workload thesis.
Solidigm is a strategically important part of the SK hynix thesis because it provides high-density QLC eSSD capability. The D5-P5336’s 122.88TB capacity targets read-intensive and data-intensive workloads, and high-density QLC maps well to large read-mostly KV and AI data tiers. SK hynix also highlighted NAND progress with 321-layer QLC client SSDs and high-performance TLC/high-capacity QLC eSSDs, plus synergy with Solidigm. If AI inference storage demand continues shifting toward high-capacity, power-efficient SSD tiers, SK hynix’s combined DRAM/HBM/NAND/eSSD position is structurally strong.
The risks are concentration and competitive normalization. SK hynix’s current profitability reflects extreme tightness in HBM and AI memory. Samsung and Micron are investing aggressively in HBM4, SOCAMM, and AI SSDs. If HBM supply becomes less constrained, if NVIDIA or hyperscalers diversify vendor allocation, or if NAND pricing normalizes from current elevated levels, SK hynix’s margin structure could compress. The company also plans materially increased investment for M15X ramp, the Yongin cluster, and EUV, which is appropriate for demand but increases cycle exposure. DualPath is strategically supportive for SK hynix, but the equity risk is that the market may already capitalize a large share of the near-term AI memory scarcity.
13. Company Implications: Samsung Electronics
Samsung has the broadest vertical exposure in the value chain. It participates in HBM, conventional DRAM, server DRAM, SOCAMM2, NAND, eSSD, logic base die, foundry, and advanced packaging. In 1Q26, Samsung reported consolidated revenue of KRW 133.9T and operating profit of KRW 57.2T; its Device Solutions division delivered revenue of KRW 81.7T and operating profit of KRW 53.7T. Samsung stated that Memory achieved all-time-high quarterly revenue and profit, supported by high-value AI demand and limited supply. It also stated that H2 2026 agentic AI should accelerate demand and that it is developing initial PCIe Gen6 eSSDs focused on KV-cache storage demand. This is a direct validation of the paper’s architecture from the largest memory manufacturer by scale.
Samsung’s HBM4 disclosure also matters. The company announced HBM4 mass production, commercial shipments, 11.7Gbps transfer speed, up to 13Gbps capability, 4nm logic base die, 3.3TB/s maximum single-stack bandwidth, 12-layer 24GB to 36GB products, future 16-layer capacity up to 48GB, 40% power-efficiency improvement, 10% lower thermal resistance, and 30% better heat dissipation. If Samsung executes on HBM4 qualification and ramps AI eSSDs, it can participate in both the HBM and SSD sides of the DualPath thesis. Its ability to supply HBM base dies through its foundry ecosystem and to integrate memory, logic, and NAND gives it architectural optionality.
The key risk is execution and mix. Samsung is not a pure-play memory investment; its valuation and earnings also reflect mobile, consumer electronics, foundry, display exposure, and conglomerate-level complexity. In memory, the company must prove sustained execution in leading-edge HBM and AI SSDs against SK hynix and Micron. The upside case is that Samsung regains share as HBM4 and PCIe Gen6 eSSD demand broadens beyond 1 supplier. The downside case is that share gains come with lower pricing or margin if the market shifts from acute shortage to broader multi-vendor supply. DualPath is strategically bullish for Samsung’s product breadth, but stock impact depends on HBM qualification, NAND cost competitiveness, and ability to convert architecture into high-margin design wins.
14. Company Implications: SanDisk
SanDisk is the most levered pure-play expression of the NAND/eSSD component of the thesis after Western Digital completed the flash separation in Feb 2025. The paper’s live KV-cache tier is SSD-centric, and SanDisk’s product exposure is concentrated in flash rather than diluted by HDD or DRAM. Its recent financials show extraordinary sensitivity to NAND pricing and datacenter demand: Q3 FY26 revenue was $5.95B, Datacenter revenue was $1.467B, gross margin was 78.4%, operating income was $4.111B, and Q4 revenue guidance was $7.75B to $8.25B. These numbers indicate that AI-related storage demand and contract structures are already translating into large earnings power.
SanDisk’s strategic positioning depends heavily on the Kioxia manufacturing alliance. TrendForce described the Kioxia/SanDisk alliance as anchored by Yokkaichi and Kitakami fabs, BiCS8 218-layer technology, and future BiCS10 above 300 layers beginning in 2026 per its reporting. TrendForce also estimated Q3 2025 NAND share at 32.3% for Samsung, 19.3% for SK hynix, 15.3% for Kioxia, and 12.4% for SanDisk. This scale matters because enterprise SSD qualification, high-capacity QLC, and AI storage supply agreements require ingest roadmaps, controller ecosystems, customer support, and manufacturing reliability.
The risks are high. SanDisk lacks HBM and DRAM exposure, so it is not a balanced memory hierarchy company. It is a NAND-cycle equity with significant upside when flash pricing and datacenter demand tighten, and significant downside if capacity additions, customer inventory, or pricing normalization reverse current margins. The company’s long-term agreements may reduce revenue volatility, but price floors and ceilings cannot eliminate cyclicality if industry bit supply eventually overshoots. For an investment committee, SanDisk should be viewed as a high-beta, direct NAND/eSSD expression of the agentic inference storage thesis rather than a diversified AI infrastructure compounder.
15. Company Implications: Western Digital
Western Digital is now primarily an HDD-capacity and storage-TCO play after the flash separation. Its exposure to DualPath’s hot KV-cache tier is limited because the paper’s architecture requires SSD-backed storage and high-speed network ingestion. However, Western Digital remains relevant to the broader AI data estate. In Q3 FY26, Western Digital reported revenue of $3.34B, up 45% YoY, GAAP gross margin of 50.2%, non-GAAP gross margin of 50.5%, GAAP EPS of $8.20, and non-GAAP EPS of $2.72, with Q4 FY26 revenue guidance of $3.65B plus or minus $100M and gross margin of 51-52%. Management’s AI storage narrative is focused on persistent data created by training, inference, agentic AI, and physical AI.
The core HDD thesis is not invalidated by DualPath; it is narrowed. AI systems generate enormous persistent datasets, but hot inference state is not the same as cold durable storage. Western Digital should benefit if hyperscalers continue building large object stores, data lakes, and archival repositories around AI workloads. It is less directly exposed to the inference-time KV-cache bottleneck that DualPath solves. The most important variables are nearline HDD capacity demand, areal-density execution, HAMR/energy-assisted roadmap competitiveness, supply discipline, and the degree to which high-density QLC SSDs take share in warm data tiers where HDD historically participated.
16. Company Implications: Seagate
Seagate is similarly an indirect beneficiary. The company benefits from AI-driven data creation and from hyperscaler demand for efficient bulk storage. Its FQ3 2026 results show strong cycle conditions, with revenue of $3.112B versus $2.160B in the prior year and non-GAAP gross margin of 47.0%. Management emphasized that AI amplifies data creation and sustained storage demand, and that areal-density strategy can deliver higher-capacity, more energy- and capital-efficient storage.
DualPath does not create a live KV-cache role for HDDs, but it strengthens the broader argument that AI workloads will create more persistent data at every stage. The stock implication depends on whether investors are underwriting HDD as a secular AI storage beneficiary or as a cyclical nearline supplier enjoying a supply-constrained upturn. The paper supports the former only for cold and bulk tiers, not for the high-performance inference-memory tier. Seagate remains a disciplined-capacity and areal-density execution story rather than a direct agentic inference bottleneck story.
17. Company Implications: Kioxia, Solidigm, SSD Controllers, and Adjacent Suppliers
Kioxia is strategically relevant because it is explicitly developing SSDs that target the gap between HBM and conventional storage. Its Super High IOPS SSD is designed to let GPUs directly access high-speed flash as an expansion to HBM, with finer-grained 512-byte access and lower power per I/O than conventional TLC SSDs. This is one of the clearest examples of a storage vendor productizing the same architectural pressure that DualPath identifies: HBM capacity is limited, while AI workloads are becoming more data-intensive and require GPU-accessible memory expansion.
SSD controller and firmware suppliers are also important even if less visible in public-market screens. Agentic KV storage requires deterministic latency, high queue-depth performance, read-mostly optimization, small-block access, QoS, namespace management, endurance controls, telemetry, and integration with GPU-direct software paths. Controller companies and IP suppliers that can support PCIe Gen6, NVMe enhancements, QLC management, low-latency paths, and AI-specific firmware may capture a larger share of SSD value than in previous commodity client cycles. Marvell has historical SSD controller exposure plus broader data infrastructure silicon, while other controller suppliers and restricted ASIC vendors may benefit depending on hyperscaler qualification. The paper implies that SSDs optimized for AI inference may become differentiated products rather than undifferentiated commodity NAND carriers.
18. Company Implications: Networking and Optical Suppliers
NVIDIA benefits on multiple levels. Its GPU platforms remain the compute anchor, its HBM suppliers must support its platform roadmap, and its networking assets become more important as inference bottlenecks move toward data movement. DualPath uses RDMA and GPU-adjacent networking concepts that fit NVIDIA’s system-level strategy. Spectrum-X’s explicit AI storage messaging is particularly relevant because the paper blurs the boundary between compute fabric and storage fabric. However, DualPath also increases GPU utilization, which could reduce GPU units required for a fixed workload. The net effect is still likely positive if lower cost per agent stimulates much higher agentic inference demand, but the paper reinforces that NVIDIA’s moat is increasingly system-level rather than GPU-only.
Broadcom benefits from open Ethernet AI networking, CPO, custom silicon, and switch ASIC scale. DualPath does not require Broadcom specifically, but it requires the kind of lossless, congestion-managed, high-bandwidth fabric Broadcom is targeting. Tomahawk 6 Davisson’s 102.4Tbps CPO switching capacity, power-efficiency claims, and 100,000+ XPU 2-tier scaling target are relevant as agentic inference increases east-west and storage-to-compute traffic. Broadcom’s risk is architectural: some hyperscalers may use NVIDIA InfiniBand/Spectrum-X or proprietary fabrics for the highest-performance paths, while Broadcom’s upside is strongest where Ethernet standardization and merchant silicon win.
Marvell benefits where AI data movement requires custom connectivity, optical DSPs, PCIe/CXL switching, and scale-up fabrics. DualPath’s direct storage path is not the same as Marvell’s Celestial AI optical scale-up thesis, but both are responses to the same bottleneck: data must move farther, faster, with lower latency and lower power as AI systems outgrow a single server or rack. Marvell’s stated portfolio across DSPs, SerDes, switching, interconnects, drivers, TIAs, and telemetry maps well to AI infrastructure where memory and storage are disaggregated. Execution risk remains significant because optical scale-up and new switching products require hyperscaler adoption and long qualification cycles.
Astera Labs benefits indirectly from the rack becoming the unit of compute. The paper’s architecture requires reliable PCIe, CXL, host-endpoint, memory, and networking interoperability across racks and nodes. Astera’s positioning around rack-scale AI, open standards, and cloud-scale interoperability is therefore directionally positive. Arista benefits if Ethernet AI fabrics continue taking share in scale-out inference and storage networking. Credo, Coherent, Lumentum, and other optical/interconnect suppliers benefit if 800G, 1.6T, CPO, and linear optical/electrical links proliferate to support data-movement-heavy AI clusters. The caveat is that component-level value capture will depend on hyperscaler architecture choices, pricing pressure, and whether optics transition from pluggable modules to co-packaged or proprietary platform-integrated designs.
19. Memory Market Structure and Pricing Context
The current memory cycle is exceptionally tight. TrendForce reported 4Q25 DRAM revenue of $53.58B, up 29.4% QoQ, with conventional DRAM contract prices up 45-50% QoQ and blended conventional-plus-HBM prices up 50-55% QoQ. TrendForce expected 1Q26 conventional DRAM prices to rise 90-95% QoQ and blended prices to rise 80-85% QoQ. In 4Q25, Samsung had 36% DRAM revenue share, SK hynix had 32.1%, and Micron had 22.4%. This backdrop matters because the paper’s architecture emerges at a time when memory and storage suppliers have unusually strong pricing power, making incremental AI-driven demand more earnings-sensitive than it would be in a normal cycle.
The NAND market is similarly tight. TrendForce reported 4Q25 top-5 NAND supplier revenue of $21.17B, up 23.8% QoQ, driven by enterprise SSD demand from AI server deployments, HDD shortages, and longer HDD lead times. It projected 1Q26 overall NAND prices up 85-90% QoQ. These numbers make the DualPath thesis financially significant: if agentic inference turns persistent KV cache into a recurring enterprise SSD workload, it can support not only bit demand but also mix improvement toward higher-value data-center SSDs. The risk is that current price increases include panic procurement, inventory prebuying, and temporary shortages rather than purely sustainable steady-state consumption.
The strategic shift from commodity bits to contracted AI infrastructure bits is observable in company behavior. SanDisk’s NBM agreements, Micron’s HBM4/SOCAMM2/PCIe Gen6 platform announcements, Samsung’s KV-cache SSD language, SK hynix’s agentic AI demand commentary, and Kioxia’s GPU-accessible SSD roadmap all point toward a more specialized memory/storage cycle. This does not eliminate cyclicality, but it can lengthen duration, improve mix, and create qualification-based barriers in parts of the market. The highest-quality revenue is likely to come from products that are co-designed into accelerator platforms, inference runtimes, or hyperscale storage architectures rather than from undifferentiated spot NAND or commodity DRAM.
20. Risks and Disconfirming Evidence
DualPath should not be read as proof that every inference workload becomes SSD-bound. It is a high-conviction signal for a specific workload shape: long-running, multi-turn, high-reuse, short-append, high-concurrency agentic inference. The paper’s sensitivity work shows that as append length rises, compute pressure becomes more important; as generation length rises, KV-loading pressure can ease; and as model architectures reduce KV bytes per token, storage intensity can decline. The investment conclusion should therefore be framed as a broadening of AI infrastructure bottlenecks, not as a one-variable NAND supercycle thesis.
Vendor and project benchmarks require a bias haircut. Dell, WEKA, NVIDIA, SGLang, and LMCache all provide high-signal numerical evidence, but most of those figures are vendor or project benchmarks rather than neutral third-party studies. The report should use them as ecosystem validation and architecture direction, not as precise market sizing. The thesis would weaken if these gains fail to reproduce under mixed live traffic, strict P99 latency SLOs, lower prefix reuse, or hyperscaler-specific infrastructure constraints.
The most important disconfirming evidence would not be a single faster GPU or a single larger HBM stack. The real bear case would be a production architecture in which agentic KV working sets remain small enough for HBM/DRAM, or in which model architectures reduce KV bytes per token faster than context length, agent count, and concurrency rise. A second bear case would be inference runtimes that achieve similar throughput through DRAM cache pools, compression, prefix deduplication, speculative prefetch, or remote memory without requiring a large SSD-backed KV tier. A third bear case would be operational: if fabric QoS, storage tail latency, and scheduler tuning are too hard outside top hyperscalers, the opportunity shifts from broad component suppliers toward vertically integrated cloud platforms.
The paper is not a universal benchmark for all inference workloads. It is most applicable to multi-turn agentic inference with high KV-cache reuse, short appends, long accumulated contexts, and significant prefill/decode disaggregation. Workloads with low cache hit rates, short contexts, long generation lengths, low concurrency, or architectures with substantially smaller KV footprints will see less benefit. The paper’s own cache-compute table shows that model architecture strongly influences KV pressure. Therefore, the investment implication should be framed around the adoption curve of long-context agentic workloads, not around all LLM inference.
The implementation also assumes sophisticated infrastructure. DualPath requires RDMA, storage and compute path orchestration, QoS isolation, a high-performance distributed storage layer, careful PCIe topology, and scheduler integration with the inference runtime. The paper’s bottleneck-free analysis assumes well-configured PCIe topology, load-balanced scheduling, no compute-network congestion, and fully utilized storage reads. Those assumptions are plausible for top hyperscalers and leading AI labs but less plausible for fragmented enterprise deployments. The result is a likely adoption gradient: hyperscalers and frontier model labs can implement these ideas first, while smaller clouds and enterprises may consume them through integrated vendor platforms or managed inference services.
The performance results are also measured against specific baselines. The paper compares Basic, DualPath, Oracle-style upper bounds, and SGL(MC)-style approaches in its testbed. Other production systems may already use variants of remote KV loading, storage NIC aggregation, prefix-aware scheduling, speculative prefetch, or DRAM cache layers. Therefore, the reported 1.87x and 1.96x gains should not be mechanically applied to all clusters. The stronger conclusion is directional: in high-hit-rate agentic inference, storage bandwidth attached only to prefill nodes becomes a bottleneck, and using decode-side storage NICs plus compute-network RDMA can materially improve utilization.
The paper also does not settle the optimal storage tier. It demonstrates that SSD-backed storage can be made useful for KV-cache serving under the described architecture, but it does not prove that conventional NAND SSDs are always optimal. Some workloads may prefer DRAM cache pools, CXL memory, SCM-like flash, Kioxia-style high-IOPS SSDs, HBF-style future products, or GPU-local capacity increases. The optimal tier will be determined by $/GB, GB/s, IOPS, tail latency, endurance, power, rack density, and software complexity. This is why the value chain implications extend beyond NAND suppliers to controller vendors, CXL/PCIe fabric suppliers, optical interconnect companies, and vertically integrated AI platform vendors.
21. Investment Conclusion, Catalysts, and Watchlist
The paper is structurally bullish for memory, storage, and networking, but the beneficiaries are not uniform. The cleanest incremental thesis is enterprise SSD and NAND, because DualPath explicitly makes SSD-backed KV storage part of the inference-critical path. The next most direct thesis is AI networking and interconnect, because RDMA, QoS, congestion control, and compute-fabric bandwidth determine whether external KV cache can feed GPUs without harming tail latency. HBM remains a powerful secular thesis, but DualPath is more complementary than incremental: it makes HBM utilization higher while confirming that persistent KV working sets must spill beyond HBM. DRAM remains necessary but may be used more efficiently; DRAM-only KV-cache pools face substitution risk from SSD-tiered architectures. HDD remains relevant for cold and bulk AI data, but not for live KV-cache serving.
Micron offers the broadest U.S.-listed exposure to the full hierarchy: HBM4, SOCAMM2, DRAM, NAND, and PCIe Gen6 AI SSDs. SK hynix offers the strongest HBM and eSSD combination through Solidigm, with current profitability already reflecting severe AI memory tightness. Samsung offers unmatched breadth and vertical integration, with upside tied to HBM4 and AI eSSD execution. SanDisk offers the highest-beta direct NAND/eSSD expression and has current numbers that already show dramatic AI storage operating leverage. Western Digital and Seagate are better framed as AI data-growth and nearline HDD beneficiaries, not as direct beneficiaries of the hot KV-cache architecture. NVIDIA, Broadcom, Marvell, Astera, Arista, Credo, Coherent, Lumentum, and related interconnect suppliers benefit as inference architectures become increasingly fabric-bound.
The decisive diligence questions are whether agentic workloads with 95%+ KV-cache reuse become mainstream, whether long-context agent deployments scale into persistent high-concurrency production, whether SSDs become a standard KV-cache tier across major inference stacks, whether QLC endurance and tail latency are sufficient for large read-mostly KV workloads, whether GPU-accessible SSD designs gain platform-level adoption, whether Ethernet can match InfiniBand-like predictability for this traffic, and whether memory suppliers maintain supply discipline after current extraordinary price increases. If the answer to most of these questions is positive, DualPath is an early technical signal that AI inference infrastructure spending will broaden materially from accelerators into HBM, server DRAM, enterprise SSDs, NAND, storage software, RDMA fabrics, high-radix switches, CXL/PCIe fabrics, optics, and HDD cold storage.
| Diligence Question | Why It Matters | Priority |
|---|---|---|
| Do 95%+ KV-cache hit rates become common in production agents? | The storage thesis depends on repeated reuse of long accumulated context with small appended deltas. | HIGH |
| Do major inference stacks standardize SSD-backed KV-cache tiers? | Broad adoption would move eSSD from support storage into the inference execution path. | HIGH |
| Can QLC SSDs meet endurance and tail-latency needs? | QLC density is economically attractive, but churn and partial writes can pressure endurance and QoS. | HIGH |
| Can Ethernet match InfiniBand-like predictability? | RDMA, congestion control, and QoS will determine merchant Ethernet share in data-movement-heavy inference. | MED |
| Do memory suppliers maintain supply discipline? | The current pricing environment is extraordinary; oversupply remains the classic risk to DRAM and NAND equities. | HIGH |
Data sources: Bloomberg, FactSet, S&P Capital IQ, company filings, earnings call transcripts, expert network interviews, SEC EDGAR.
Sources cited: DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference arXiv paper; Western Digital company press releases; Micron Technology investor releases; Marvell Technology blog and company releases; TrendForce DRAM and NAND market reports; Solidigm D5-P5336 product materials; SanDisk fiscal third quarter 2026 financial results; Seagate fiscal third quarter 2026 financial results; NVIDIA Quantum-X800 and Spectrum-X product materials; Broadcom Tomahawk 6 Davisson release; Astera Labs company materials; SK hynix 1Q26 financial results release; Samsung Electronics first quarter 2026 results and HBM4 announcement; Kioxia Super High IOPS SSD announcement; NVIDIA Dynamo and NIXL Technical Blog materials; NVIDIA KV Cache bottlenecks with Dynamo Technical Blog; LMCache local storage documentation; LMSYS SGLang HiCache blog; Mooncake x SGLang HiCache system design documentation; DeepSeek 3FS GitHub README; Dell Storage Engines KV Cache offloading blog; WEKA NVIDIA Dynamo/NIXL blog