Columnar Storage Is Normalization

ibobev 108 points 37 comments April 22, 2026
buttondown.com · View on Hacker News

Discussion Highlights (10 comments)

immanuwell

The normalization analogy is genuinely clever as a teaching tool, but it quietly papers over the fact that normalization is a logical design concept while columnar storage is a physical one - treating them as the same thing can mislead more than it clarifies, I think

orangepanda

Is this meant to be a poor explanation of sixth normal form?

Lucasoato

This is an interesting thought, even if it doesn’t come with practical consequences. A person could argue that if you happen to encode your table with a columnar format, you very likely won’t use indexes for every “value” but the order itself of that specific block. But this would mean that if you’re using the data order meaningfully, you’d probably going against the principles of table normalization. But, again, this one as well can be considered the result of excessive overthinking rather something practical that can be used.

parpfish

I always thought that the biggest benefit of normalization was deduplicating mutable values so you only need to update values in one place and everything stays nicely in sync. Classic example being something like a “users” table that tracks account id, display name (mutable), and profile picture (mutable). And then a “posts” table that has post id, account id, and message text. This allows you to change the display name/picture in one place and it can be used across all posts

pwndByDeath

None-or-many?

juancn

It is possible to treat as purely relational but it can be suboptimal on data access if you follow through with it. The main cost is on the join when you need to access several columns, it's flexible but expensive. To take full advantage of columnar, you have to have that join usually implicitly made through data alignment to avoid joining. For example, segment the tables in chunks of up to N records, and keep all related contiguous columns of that chunk so they can be independently accessed: r0, r1 ... rm; f0, f0 ... f0; f1, f1 ... f1; fn, fn ... fn That balances pointer chasing and joining, you can avoid the IO by only loading needed columns from the segment, and skip the join because the data is trivially aligned.

remywang

This is exactly domain key normal form! https://en.wikipedia.org/wiki/Domain-key_normal_form

data-ottawa

The Apache Arrow array format docs are a great read if you're interested by this blog post.

notepad0x90

I don't fully agree with this, for large nested datasets and arrays. Especially with arrays, what could be one line of JSON, in a CSV you'd have non-normalized array as a string in a single cell, or you expand the array and create a single value for the cell, creating $array_size number of rows. You can normalize data in just about any structured format, but columns aren't the end-all-be-all normalization format. I think pandas uses "frames".

wpollock

My mental model of columnar storage is as the old notion of parallel arrays, which I used in the 1970s with FORTRAN. Whatever you learned first sticks with you and you end up translating everything to that, or at least I do. I believe this is known as the baby duck syndrome.

Semantic search powered by Rivestack pgvector
5,335 stories · 50,170 chunks indexed