Week 6 February 8, 2026

Moxel: Why VRAM Pooling Across Consumer GPUs Is Hard and How We're Approaching It

First real week on Moxel. PCIe bandwidth is the constraint everyone hits. Notes on the architecture approach and where the overhead actually comes from.

GPUMoxelML InfraVRAMNVIDIA

The problem

vm1 has four RTX 3090s, 24GB VRAM each. For the models we want to run (Mixtral 8x7B, some of the 70B class models), a single card isn’t enough. The standard approach is to use device_map="auto" in HuggingFace, which spreads a model across multiple GPUs automatically.

The issue: standard multi-GPU inference without NVLink is slow. The RTX 3090 has no NVLink support (that’s an A/H-series feature on consumer hardware). All inter-GPU communication goes over PCIe. For autoregressive generation, where each token requires a full forward pass with intermediate activations being moved between cards, PCIe becomes the bottleneck quickly.

The other issue: existing tools like accelerate and tensor_parallel assume you own the GPU topology and can plan the model partition ahead of time. We want something more flexible, pool the VRAM, let the runtime decide where things live, without modifying the model or the inference code.

That’s what Moxel is.

PCIe bandwidth: the actual numbers

PCIe Gen 4 x16 (what the 3090s support): ~31.5 GB/s theoretical, roughly 28 GB/s real-world.

For comparison, NVLink on A100s: 600 GB/s bidirectional. That’s a 20x difference.

For batch inference with large batch sizes, where you process many requests together and amortise the memory transfer cost, the PCIe constraint is acceptable. You’re spending most of your time on compute, not data movement.

For autoregressive decoding at small batch sizes, generating tokens one at a time, which is the common single-user inference pattern, you’re doing repeated small transfers for every token. At low batch sizes the bandwidth becomes the bottleneck, not the GPU compute. This is where the overhead shows up.

The theoretical pooling scenario: 4x 3090s = 96GB usable VRAM. If we can run a 70B model across them, the question is whether the per-token latency is acceptable. Based on bandwidth calculations alone, probably 2-3x slower than a native NVLink setup for single-stream generation. For batch processing it’s much better.

Architecture approach

The core insight: instead of modifying how the model runs, intercept at the memory allocation layer. When a CUDA allocation request comes in that exceeds what’s available on the local GPU, transparently allocate it on a peer GPU and handle the data movement behind the scenes.

This is conceptually similar to unified memory (cudaMallocManaged) but with more control over placement policy and prefetching. CUDA’s UVM handles this but makes conservative decisions that don’t suit inference workloads well.

What we’re building:

A memory management layer that tracks allocations across all four GPUs
A placement policy that keeps frequently-accessed tensors local and less-frequent ones remote
Prefetching based on the model’s execution graph, if we know layer N+1 will need certain weights, start fetching them during layer N’s compute

The execution graph part is the hard bit. For transformer-based models the layer sequence is deterministic and predictable, which helps. For more dynamic architectures it’s more complex.

Where I am this week

Got the basic multi-GPU memory pool working, allocations spread across devices, data movement on access. No prefetching yet. Performance without prefetching is poor (as expected), mostly waiting on PCIe transfers.

Initial benchmark running Mixtral 8x7B inference across 2x 3090s vs single-GPU (model doesn’t fit on one card, so this is 2-GPU vs nothing):

Throughput: about 11 tokens/second at batch size 1
Single A100 80GB would do ~45 tokens/second for reference
The gap is almost entirely bandwidth

Not surprising, but useful as a baseline. Prefetching should recover significant ground, the question is how much of the latency we can hide behind compute.

One thing that’s clearer now

The use case for Moxel isn’t to compete with NVLink setups. It’s to make consumer hardware usable for large model inference at all. A 70B model that doesn’t fit on any single consumer GPU becomes runnable, slowly, but runnable. For research and development where cost matters more than latency, that’s worth having.

Next week: back to LLM security. Want to dig into indirect prompt injection and RAG attack surfaces properly, it’s been on my reading list since the Greshake paper.