RTX 5080 and RTX 3090 Setup: 80 Tok/s on Qwen 3.6 27B Q8
iMil
222 points
76 comments
June 13, 2026
Related Discussions
Found 5 related stories in 95.8ms across 10,416 title embeddings via pgvector HNSW
- We got 207 tok/s with Qwen3.5-27B on an RTX 3090 GreenGames · 162 pts · April 20, 2026 · 74% similar
- Qwen-3.6-Plus is the first model to break 1T tokens processed in a day Alifatisk · 49 pts · April 05, 2026 · 62% similar
- Orthrus-Qwen3: up to 7.8×tokens/forward on Qwen3, identical output distribution FranckDernoncou · 34 pts · May 15, 2026 · 59% similar
- 768GB Intel Optane DIMMs to run 1T-parameter LLM with single GPU at 4tps walterbell · 26 pts · May 30, 2026 · 54% similar
- 10k-watt GPU meet 40-watt lump of meat speckx · 11 pts · April 21, 2026 · 50% similar
Discussion Highlights (18 comments)
ComputerGuru
I would have liked to see a bit more on the theory side of things, explaining optimal weight and inference splits, actual issues with existing drivers, etc instead of what’s essentially just a recipe.
deng
I can understand the joy of running things yourself, and can also see the privacy aspect. However, I pay ~3$ per 1M/tokens for that model on Openrouter, and it's not even quantized. A refurbished 3090 and a 5080 will set you back well over 2k, not to mention the electricity to run them...
avyeed_desa
I just bought a $25 chinese 2x Oculink card and two Minis Forum DEG1, had some spare PSUs lying around, and just installed two cards on each. It works. I saw that there is also a 4x Oculink card, but i don't know it that will work, too.
atlgator
Which "good quality PCIe 4 riser" did you buy?
sieste
That's almost exactly my setup and I'm very happy with its performance. I noticed recently that I started to prefer my local Qwen3.6 35B A3B and pi agent over Claude Code. Both fail at different tasks, and Qwen more so than Claude. But the way Qwen fails is much more straightforward. In writing tasks Qwens hallucinations and bullshitting are much easier to spot because it doesn't have the sleek vocabulary and wordsmithing skills to disguise its ignorance. In coding tasks that Qwen can't solve it often just goes into a tool calling doom loop that the pi harness can catch, whereas Claude attempts ever more convoluted and creative things just making more and more mess that takes forever to clean up. I think part of the story is that the tasks for which I use AI are fairly simple and maybe don't need a frontier model. But I wonder if "proper" developers had similar experience?
ydj
80tp/s with 5080 3090 combo is wild. I’ve been working with a 4090 and two Tenstorrent p150 cards, and manage only about 30 tps utilizing all three for qwen3.6 27b q8. Guess I got more optimization to do. Would like to see the perf of their setup with and without mtp and ngram speculative decoding though, as well as parallel decode performance (once llamacpp mtp plays well with multiple slots). Being in California electricity alone puts this non-competitive with just paying a cloud though.
varispeed
Could 2x RTX5080 work just as well?
triwats
Potential specs: NVIDIA GeForce RTX 5080: https://flopper.io/gpu/nvidia-geforce-rtx-5080-16gb NVIDIA GeForce RTX 3090: https://flopper.io/gpu/nvidia-geforce-rtx-3090-24gb
stared
I really like Qwen 3.6 27B Q8. On Apple Silicon, with MLX-LM, I am getting 20 tok/s with Macbook Max M5. Not sure how it compares to llama.cpp performance. In any case, while it is noticeably slower than this Nvidia RTX setup, being able to run such models on laptop is wild. Though, it heats my laptop rapidly.
well_ackshually
It does come with one tiny little issue: it now draws 700W on full load. Just a single 5080 is enough to measurably heat up a room when loaded (320W draw at the wall on mine), and with that amount of power flowing through, you better have a good PSU as well as checking your power plugs themselves, these are going to get HOT when your entire setup is basically drawing 1kW.
cybertim
I bought two 3080/20gb and one of those MACHINIST X99 mainboards as well (one with two full x16 pcie slots) those boards come with a xeon cpu included (for the pcie lane support) it set me back 800 euros total (had a spare psu, ssd and mem in a drawer) and now im also happily running 80tk/s Qwen 3.6 Q8 (MTP).
tonyrice
If I had an eGPU right now, I'd 100% be using Qwen
skhameneh
Would you mind giving these a try and let me know how they work for you? I’d imagine you would get better results and the latter will fit on a single GPU. https://huggingface.co/easiest-ai-shawn/Qwen3.6-27B-ExCal-EX... https://huggingface.co/easiest-ai-shawn/Qwen3.6-27B-ExCal-Mi... Do be sure to use dflash and/or mtp for the draft: https://huggingface.co/turboderp/Qwen3.6-27B-MTP-exl3 https://huggingface.co/turboderp/Qwen3.6-27B-DFlash-exl3
DiabloD3
The recommended values for Qwen 3.6 in thinking mode is `--temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.00`, and `--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00` for coding/tool calling tasks, and for non-thinking, `--temp 0.7 -top-p 0.8 --top-k 20 --presence-penalty 1.5 --min-p 0.00`. The options listed are none of these. Also, the recommended Qwen MTP settings are `--spec-type draft-mtp --spec-draft-n-max 2`. 3 is not good on Nvidia hardware under different workloads. You can also add `ngram-mod`, but after `draft-mtp`; however, default `ngram-mod` settings aren't well tuned, and you want `--spec-ngram-mod-n-min 12 --spec-ngram-mod-n-max 16 --spec-ngram-mod-n-match 6` (defaults are 48, 64, 24; the ratio is good, the magnitude is suboptimal). Of abliterated Qwen 3.6 27B models, huihui's ends up being the worst. Try heretic instead. https://huggingface.co/mradermacher/Qwen3.6-27B-uncensored-h...
WeylandDarkStar
Sits in silence, watching China as they innovated a new type of ultra-thin gpu board and calling it 5090 "Turbos." Still waiting for Shenzhen listings to post a 5090 official verified with VBIOS crack...
neals
I tried implementing qwen through openrouter and deepinfra. Even without thinking, I had to wait 60s+ for the full result, where haiku or flash would be done in 5 or 6 seconds.
irishcoffee
It is absolutely mind blowing to see some of the responses here. Open source, run-your-own, pay for nothing, we’re-all-nerds-that-buy-the-hardware-anyways ethos seems basically dead. I guess I’m getting old. I own two 16gb cards and I use them for models, for gpu-pasthru for gaming, 3d model rendering, etc. 14 year old me is mortified at this community.
mirekrusin
90 t/s on 2x 4090 256k context