Three RTX 3090s, One 32B Model: A Pipeline-Parallel Canary
Nodehome's current three-GPU serving experiment is less about chasing a single headline benchmark and more about finding the shapes that actually work on owned Ampere hardware.
The useful lesson from the latest canary: tensor parallelism is not always the natural fit. For the tested Qwen2.5 32B AWQ checkpoint, the attention layout makes a 3-way tensor split invalid. Pipeline parallelism across three RTX 3090s is the viable way to put all three cards to work on that model.
The canary completed a repeated-request soak with HTTP success, stable worker behavior, and bursty inference loads at the 300W cap. Temperatures peaked in the low 80s C on the hotter cards and then dropped quickly once the request burst ended.
That makes it a real serving-shape signal, but not a universal closure:
- It is a 32B AWQ pipeline-parallel proof, not a 70B proof.
- It is a bursty inference canary, not a sustained training or stress-test pass.
- It shows that three consumer GPUs can be useful even when tensor parallelism is the wrong fit.
- It keeps the next question focused on workload shape: interactive agents, long-context concurrency, and sustained thermal policy are different tests.
The practical takeaway for local builders is simple: count GPUs, but also count model heads, KV layout, runtime constraints, and thermal behavior. "Three GPUs" is not one architecture. It is a menu of possible serving shapes.