A field-report read on 14x RTX 3090 agent serving, EXL3, FP8 KV cache, Aphrodite, and why concurrency is becoming the local hardware metric.

Parallel Agent Serving Is A Hardware Shape Now

A public field report from Ahmad Osman showed a 14x RTX 3090 rig serving Qwen 3.6 27B with 42 parallel agents at a claimed full 256K context shape, using EXL3 6bpw, FP8 KV cache, and the Aphrodite Inference Engine with tensor parallelism and pipeline parallelism combined.

That does not map directly onto a three-GPU node. It does change what is worth measuring.

The important signal is not just "bigger model." It is high-concurrency local agent serving: many long-context sessions resident at once, with quantized weights, compressed KV cache, batching, and hybrid TP/PP layouts doing the real work.

For smaller owned rigs, the follow-up questions are practical:

How many useful agent contexts can stay resident at 32K, 64K, and 128K?
Does compressed KV cache preserve answer quality under concurrency?
When does pipeline parallelism beat model replication?
How much throughput survives at lower power caps?
Which runtime gives the best stability before it gives the best benchmark number?

The local-agent future is not only one huge chat session. It is a rack full of useful contexts, kept close to the operator, with enough scheduling discipline to make them feel boring.