Nvidia Cosmos 3

tosh 145 points 27 comments June 01, 2026
developer.nvidia.com · View on Hacker News

Discussion Highlights (8 comments)

aabdi

SOTA open source model for image and vid generation. Beats all others but is too big to run on most people’s computers at 64b params. Still impressive nonetheless given its artificially generated training sets. Beats nano banana 1 but not yet competitive with 2 or seedance2, grok imagine,etc.

causal

I'm struggling to understand what this does. > Generates future observations and action sequences. Is that just a complicated way of saying video gen?

darth_avocado

> Cosmos 3 Nano is the compact version with 16B parameters and optimized for efficient inference. It’s designed to run on workstation-grade compute, like the NVIDIA RTX PRO 6000 GPU for real-time robotics inference and physical AI applications. Looking forward to trying this out on my $10000+ workstation grade GPU that I need an equally expensive set up to run.

BugsJustFindMe

The warehouse safety video example is really funny, because the people don't react at all.

sosodev

Most of the examples they've chosen seem.. not good? What an odd mix of bad game engine and AI slop. I can't imagine that this stuff makes good training data for real-world applications.

mangoman

This release unifies those capabilities with a Mixture-of-Transformers (MoT) architecture built around two towers. Reasoner tower: A vision-language model (VLM) ... This serves as the ‘brain’ that reasons about the world before any generation happens. Generator tower: Generates future observations and action sequences. This tower uses a diffusion-based process to generate physics-aware video and action outputs that are conditioned on the reasoner tower’s understanding. This sort of approach (and others i've seen like it) always appeal to my inner engineer, trying to optimize and balance tradeoffs between model architectures and combine two things to yield the best of both worlds But based on my understanding of the Bitter Lesson ( http://www.incompleteideas.net/IncIdeas/BitterLesson.html ), this is precisely the wrong approach in the long term. I'm linking the actual text of the bitter lesson because I think it's misunderstood (or I just don't agree with how i've seen it used in discourse). Specifically: The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning. The eventual success is tinged with bitterness, and often incompletely digested, because it is success over a favored, human-centric approach. This architecture feels specifically like "trying to build knowlege into the agent that will help in the short term" but will plateau long term. That's not to say that there won't be some interesting learnings or things built on top of it, but I doubt that there's a lot of juice to squeeze with this kind of approach IMO.

cesarvarela

It is funny that after all their tech advancements, the site is struggling under heavy load.

ramaseshanms

The two-tower Mixture-of-Transformers design (autoregressive reasoner feeding a diffusion generator) is an interesting architectural bet.

Semantic search powered by Rivestack pgvector
9,294 stories · 87,504 chunks indexed