Unlimited OCR: One-shot long-horizon parsing

ingve 460 points 104 comments June 23, 2026

Discussion Highlights (18 comments)

Oras

OCR has been solved long time ago with vision models. Solutions are consistent, reliable, and stable. What is the point of reinventing the wheel? I would definitely understand post processing, like extracting data, answering question .. etc, but why re-doing the OCR engine itself?

robotswantdata

Very interesting. The way I understand this works is that the researchers found a clever architectural hack to stop AI from hoarding memory when reading long documents. Normally, when an AI transcribes a 100 page PDF, it tries to remember every single word it has already ingested. This short-term memory (the KV cache) grows linearly O(N) until the model runs out of VRAM and crashes (or caps it) To avoid this, developers are forced to build janky code that chops PDFs into individual pages, processes them one by one, and glues the text back together. Unlimited OCR uses Reference Sliding Window Attention (R-SWA) to split the AI's focus into two paths: Global Reference: The AI keeps full, uncompromised sight of the original document image so it never loses context. Local Generation: The AI restricts its memory of its own typed text to a tight, moving window (like the last 128 words) and safely forgets the rest. Will be very interesting for local AI and can’t wait to see what the community builds and extends with it!

KitN

"We would like to thank Deepseek-OCR, Deepseek-OCR-2, PaddleOCR for their valuable models and ideas." Class Act.

manipalite

Whatever happened to Reducto, was very promising 12-15 months ago

ramon156

I love that the entire goal is to push Deepseek OCR further. The west can learn greatly from these companies

pmarreck

my attempts at using AI to do OCR have always resulted in invented artifacts, which is not production feasible. does this suffer from that as well? A simple example is words that are supposed to be in other languages being automatically translated to English, which ruins the effect

overflowy

What are the requirements for running this locally?

alansaber

We've invented chunking? We are so back.

peatmoss

I recently bought a tablet for sheet music, mostly to replace a stack of jazz "Real Books" at jam sessions. And the phone camera scans I made are okay, but fixed in size and have a lot of artifacts. And it would be great to transpose on the fly for e.g. Bb or Eb instruments, but being a scan this is obviously not possible. I got digging into the state of optical music recognition and came away concluding that music is basically a greenfield for AI wherever you look. Optical music recognition is pretty terrible. AI understanding of music theory is terrible (actually looking at music that is; LLMs do okay at text descriptions of theory concepts where you can imagine some online texts making it in). I think the issue is that we still don't have great digital formats that encode the dots on paper that musicians read. Music notation is pretty rich. Midi doesn't capture all of what's needed for symbolic understanding, because it was mostly made for capturing aspects relevant for playback or performance. MusicXML seems to be the closest for a digital format that encodes the information a musician would want, but there aren't great corpora of training data that would connect a MusicXML representation to sheet music images or to audio. I think that's because MusicXML falls short of encoding enough information to engrave music. Tools like MuseScore need to track a bunch of layout information that isn't encodable in MusicXML. Lilypond format is less verbose that MusicXML and contains a bit more information that is useful to the score creators, but most people don't create sheet music in lilypond. (As an aside, Lilypond bums me out with the state of jazz fonts. I hate looking at "legit" scores in jazz context) I realize this is mildly off topic, but every time I see people making incremental gains on OCR, which to my mind is pretty good, I am reminded of how abysmal OMR is.

shevy-java

Is this an academic paper that is published in year xyz, but in +5 years nobody will remember it anymore?

janpeuker

Paper under https://arxiv.org/abs/2606.23050 (As a side note, I do OCR locally as a small RAG for citations I read in books and also chunk input, but merely to save RAM - interesting this natural approach also work in a streaming model)

piterrro

can someone explain how is this different than feeding the VLM model one page at a time?

novoreorx

FYI, "Unlimited OCR Works" is a Fate/stay night reference. The original "Unlimited Blade Works" is a magic whose entire premise is copying weapons other people forged

gettingoverit

How does it compare against Finereader? Comparisons against transformer-based OCRs don't really tell anything. The last time I checked, neither of them were of "OCR this legal document" quality.

aliljet

How does this compare with infinty parser 2 which seemed to be running the table on every other OCR tool ( https://huggingface.co/datasets/allenai/olmOCR-bench ). To be fair, there's no single winning OCR benchmark and this isn't showing up anywhere yet..

lacoolj

This looks more promising than what Mistral just launched (coincidence?????? i think not.) This approach feels like it could be used for image gen as well (in some combination). Read/view image, start drawing image using illustrator/inkscape/etc (or just SVG), then fill in with what was missed after

arboles

I'm going to sound like I live under a rock, but what is the true reason companies open-source genuinely good software? Shouldn't Baidu (or Google) hoard it for themselves to extract the value in a way the competition isn't be able to imitate?

jbarrow

I'm always glad to see more multi-page work in VLM-based OCR. Especially single-pass. One of the few other multi-page papers from recently, MinerU-Popo, treats fixing up multi-page outputs as a post-processing correction step ( https://arxiv.org/abs/2605.24973 ). Interesting to see the drop-off in quality as you up page count, though. I also think the attention approach (always attend to the image/prefix, with a sliding window for local context) is neat! I do wish they updated their comparison table to include more recent work (that scores marginally better on OmniDocBench), like dots.mocr.

Unlimited OCR: One-shot long-horizon parsing

Discussion Highlights (18 comments)

Related Discussions