RTX 5080 and RTX 3090 Setup: 80 Tok/s on Qwen 3.6 27B Q8

iMil 222 points 76 comments June 13, 2026
imil.net · View on Hacker News

Discussion Highlights (18 comments)

ComputerGuru

I would have liked to see a bit more on the theory side of things, explaining optimal weight and inference splits, actual issues with existing drivers, etc instead of what’s essentially just a recipe.

deng

I can understand the joy of running things yourself, and can also see the privacy aspect. However, I pay ~3$ per 1M/tokens for that model on Openrouter, and it's not even quantized. A refurbished 3090 and a 5080 will set you back well over 2k, not to mention the electricity to run them...

avyeed_desa

I just bought a $25 chinese 2x Oculink card and two Minis Forum DEG1, had some spare PSUs lying around, and just installed two cards on each. It works. I saw that there is also a 4x Oculink card, but i don't know it that will work, too.

atlgator

Which "good quality PCIe 4 riser" did you buy?

sieste

That's almost exactly my setup and I'm very happy with its performance. I noticed recently that I started to prefer my local Qwen3.6 35B A3B and pi agent over Claude Code. Both fail at different tasks, and Qwen more so than Claude. But the way Qwen fails is much more straightforward. In writing tasks Qwens hallucinations and bullshitting are much easier to spot because it doesn't have the sleek vocabulary and wordsmithing skills to disguise its ignorance. In coding tasks that Qwen can't solve it often just goes into a tool calling doom loop that the pi harness can catch, whereas Claude attempts ever more convoluted and creative things just making more and more mess that takes forever to clean up. I think part of the story is that the tasks for which I use AI are fairly simple and maybe don't need a frontier model. But I wonder if "proper" developers had similar experience?

ydj

80tp/s with 5080 3090 combo is wild. I’ve been working with a 4090 and two Tenstorrent p150 cards, and manage only about 30 tps utilizing all three for qwen3.6 27b q8. Guess I got more optimization to do. Would like to see the perf of their setup with and without mtp and ngram speculative decoding though, as well as parallel decode performance (once llamacpp mtp plays well with multiple slots). Being in California electricity alone puts this non-competitive with just paying a cloud though.

varispeed

Could 2x RTX5080 work just as well?

triwats

Potential specs: NVIDIA GeForce RTX 5080: https://flopper.io/gpu/nvidia-geforce-rtx-5080-16gb NVIDIA GeForce RTX 3090: https://flopper.io/gpu/nvidia-geforce-rtx-3090-24gb

stared

I really like Qwen 3.6 27B Q8. On Apple Silicon, with MLX-LM, I am getting 20 tok/s with Macbook Max M5. Not sure how it compares to llama.cpp performance. In any case, while it is noticeably slower than this Nvidia RTX setup, being able to run such models on laptop is wild. Though, it heats my laptop rapidly.

well_ackshually

It does come with one tiny little issue: it now draws 700W on full load. Just a single 5080 is enough to measurably heat up a room when loaded (320W draw at the wall on mine), and with that amount of power flowing through, you better have a good PSU as well as checking your power plugs themselves, these are going to get HOT when your entire setup is basically drawing 1kW.

cybertim

I bought two 3080/20gb and one of those MACHINIST X99 mainboards as well (one with two full x16 pcie slots) those boards come with a xeon cpu included (for the pcie lane support) it set me back 800 euros total (had a spare psu, ssd and mem in a drawer) and now im also happily running 80tk/s Qwen 3.6 Q8 (MTP).

tonyrice

If I had an eGPU right now, I'd 100% be using Qwen

skhameneh

Would you mind giving these a try and let me know how they work for you? I’d imagine you would get better results and the latter will fit on a single GPU. https://huggingface.co/easiest-ai-shawn/Qwen3.6-27B-ExCal-EX... https://huggingface.co/easiest-ai-shawn/Qwen3.6-27B-ExCal-Mi... Do be sure to use dflash and/or mtp for the draft: https://huggingface.co/turboderp/Qwen3.6-27B-MTP-exl3 https://huggingface.co/turboderp/Qwen3.6-27B-DFlash-exl3

DiabloD3

The recommended values for Qwen 3.6 in thinking mode is `--temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.00`, and `--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00` for coding/tool calling tasks, and for non-thinking, `--temp 0.7 -top-p 0.8 --top-k 20 --presence-penalty 1.5 --min-p 0.00`. The options listed are none of these. Also, the recommended Qwen MTP settings are `--spec-type draft-mtp --spec-draft-n-max 2`. 3 is not good on Nvidia hardware under different workloads. You can also add `ngram-mod`, but after `draft-mtp`; however, default `ngram-mod` settings aren't well tuned, and you want `--spec-ngram-mod-n-min 12 --spec-ngram-mod-n-max 16 --spec-ngram-mod-n-match 6` (defaults are 48, 64, 24; the ratio is good, the magnitude is suboptimal). Of abliterated Qwen 3.6 27B models, huihui's ends up being the worst. Try heretic instead. https://huggingface.co/mradermacher/Qwen3.6-27B-uncensored-h...

WeylandDarkStar

Sits in silence, watching China as they innovated a new type of ultra-thin gpu board and calling it 5090 "Turbos." Still waiting for Shenzhen listings to post a 5090 official verified with VBIOS crack...

neals

I tried implementing qwen through openrouter and deepinfra. Even without thinking, I had to wait 60s+ for the full result, where haiku or flash would be done in 5 or 6 seconds.

irishcoffee

It is absolutely mind blowing to see some of the responses here. Open source, run-your-own, pay for nothing, we’re-all-nerds-that-buy-the-hardware-anyways ethos seems basically dead. I guess I’m getting old. I own two 16gb cards and I use them for models, for gpu-pasthru for gaming, 3d model rendering, etc. 14 year old me is mortified at this community.

mirekrusin

90 t/s on 2x 4090 256k context

Semantic search powered by Rivestack pgvector
10,416 stories · 97,847 chunks indexed