SubQ 1.1 Small
EDM115
118 points
49 comments
June 16, 2026
Related Discussions
Found 5 related stories in 111.1ms across 10,715 title embeddings via pgvector HNSW
- SubQ: a sub-quadratic LLM with 12M-token context mitchwainer · 46 pts · May 05, 2026 · 61% similar
- SubQ – a major breakthrough in LLM intelligence vanni · 20 pts · May 05, 2026 · 60% similar
- SubQ: Sub-quadratic LLM built for 12M-token context gagan2020 · 17 pts · May 05, 2026 · 56% similar
- The Qwen 3.5 Small Model Series armcat · 11 pts · March 02, 2026 · 47% similar
- TurboQuant: Building a Sub-Byte KV Cache Quantizer from Paper to Production wizzense · 13 pts · March 27, 2026 · 46% similar
Discussion Highlights (17 comments)
EDM115
https://subq.ai/docs/subq-1-1-small-model-card.pdf
giancarlostoro
This one's interesting, and I think the next frontier for LLMs should really just be, how can we get something like Opus 4.6 to cost drastically less, for the same output? I say 4.6 because from 4.6 onwards it's been pretty darn good, at least for me, always feels like every model upgrade someone hates it, heck even 4.5 was fine.
aesthesia
Disappointing they don't actually say how their sparse attention mechanism works.
cmogni1
I don’t understand why this lab is allergic to providing details on what they actually made, especially when Chinese labs are more than willing to share architectural specs/code/kernels (eg NSA/FSA, RAMBa, HISA, DSA LightningIndexer, etc). I don’t doubt that they’ve done something here, but the lack of details makes me default not trust this, particularly when this is the second time that they’ve released a “technical report” that just waxes poetic about the concept.
embedding-shape
> SubQ 1.1 Small scores near-perfect at 1M, 2M, 6M, and 12M tokens. The model was trained predominantly at 1M tokens yet the retrieval held near perfectly at 12x that length, despite compressing attention to just 0.13% of relationships. This generalization is a direct consequence of SSA routing attention based on content relevance rather than fixed positional patterns. If the results persists from 1M to 12M, why not 24M or 48M? Sounds almost too good to be true. With back of the napkin math from inside my head, that'd be like 0.5/1 million LOC, depending on language/code density, could just fold the entire codebase into one prompt if it's a small one, that'd be neat :)
chrsw
There was, let's say, significant skepticism the last time they announced something. What's changed?
wxw
> SSA replaces the O(n²) dense attention pass with a learned sparse formulation that scales linearly with context length. > At 1M tokens, SubQ 1.1 Small requires 64.5x less compute than dense attention and runs 56x faster than FlashAttention-2. Awesome stuff. Solving context at the model architecture layer rather than trying to bolt on extra memory is the right direction IMO.
satyarohith
It's been all talk and no action ever since their first announcement.
maz1b
They've done multiple "evaluations" by third parties, but still, it seems that they aren't being fully transparent. I think the approach is quite interesting and novel, but this feels like deja vu. I get why they aren't disclosing all the details, but it seems more hype-train-esque to me for this moment. I don't disagree that this could be big.
Depurator
What kind of hardware would be needed to serve an instance with the full 12m context? And what kind of speeds can one expwct at those extremes at 10m+?
samber
According to Subquadratic, Needle in a Haystack is strong up to 12m tokens, but RULER has not been tested above 128k tokens ??
samber
Comparing compute cost versus FlashAttention-2 is not very honest to me. FlashAttention-2 is not used anymore for at least 2y. This architecture would have been a massive improvement 3 years ago, but it is a ~solved~ problem IMO.
ballon_monkey
Its funny that some people on HN think this whole thing is legit. The company is started by a bunch of no-bodies with 0 experience in AI in general let alone ML/Data. Edit: Typical HN "I can downvote but I cannot dispute facts"
kristjansson
It's easy[1] to promise, it's hard to deliver. I hope the best for them. [1]: https://magic.dev/blog/ltm-1 (note the date)
bthornbury
we need some better standard long-context benchmarks. needle in a haystack is not good for this, yes it proves the model can attend to its context, but in its usual form, somewhat trivializes the query-key relationship. something like long-form Q&A would be more ideal. Like reading a book and answering questions that require synthesizing information derived from either the whole thing or disparate portions of it. Like describing an entire character arc in a 1000 page novel with examples and evidential moments.
mark_l_watson
Interesting idea but until I get my grubby little fingers in it, to try it - difficult to have an opinion. I am hopefully expectant that we will see all sorts of optimizations in the next few years that will enable even more local model use and slash commercial API costs. I get excited by the results when I enjoy one or two short coding sessions a week with Claude Opus but it is even more exciting to get a major task done and see that I only used $0.05 for DeepSeek v4 Flash or perhaps $0.15 for DeepSeek v4 Pro. It was exciting in even a different way when I two shotted a complete TypeScript/Tauri app using gemma-12b-qat with little-coder on a cheap laptop a few days ago.
dundunUp
What is this?