The way I understand this works is that the researchers found a clever architectural hack to stop AI from hoarding memory when reading long documents.
Normally, when an AI transcribes a 100 page PDF, it tries to remember every single word it has already ingested. This short-term memory (the KV cache) grows linearly O(N) until the model runs out of VRAM and crashes (or caps it) To avoid this, developers are forced to build janky code that chops PDFs into individual pages, processes them one by one, and glues the text back together.
Unlimited OCR uses Reference Sliding Window Attention (R-SWA) to split the AI's focus into two paths:
Global Reference: The AI keeps full, uncompromised sight of the original document image so it never loses context.
Local Generation: The AI restricts its memory of its own typed text to a tight, moving window (like the last 128 words) and safely forgets the rest.
Will be very interesting for local AI and can’t wait to see what the community builds and extends with it!
robotswantdata
I recently bought a tablet for sheet music, mostly to replace a stack of jazz "Real Books" at jam sessions. And the phone camera scans I made are okay, but fixed in size and have a lot of artifacts. And it would be great to transpose on the fly for e.g. Bb or Eb instruments, but being a scan this is obviously not possible.
I got digging into the state of optical music recognition and came away concluding that music is basically a greenfield for AI wherever you look. Optical music recognition is pretty terrible. AI understanding of music theory is terrible (actually looking at music that is; LLMs do okay at text descriptions of theory concepts where you can imagine some online texts making it in).
I think the issue is that we still don't have great digital formats that encode the dots on paper that musicians read. Music notation is pretty rich. Midi doesn't capture all of what's needed for symbolic understanding, because it was mostly made for capturing aspects relevant for playback or performance. MusicXML seems to be the closest for a digital format that encodes the information a musician would want, but there aren't great corpora of training data that would connect a MusicXML representation to sheet music images or to audio. I think that's because MusicXML falls short of encoding enough information to engrave music. Tools like MuseScore need to track a bunch of layout information that isn't encodable in MusicXML. Lilypond format is less verbose that MusicXML and contains a bit more information that is useful to the score creators, but most people don't create sheet music in lilypond. (As an aside, Lilypond bums me out with the state of jazz fonts. I hate looking at "legit" scores in jazz context)
I realize this is mildly off topic, but every time I see people making incremental gains on OCR, which to my mind is pretty good, I am reminded of how abysmal OMR is.
peatmoss
"We would like to thank Deepseek-OCR, Deepseek-OCR-2, PaddleOCR for their valuable models and ideas."
Class Act.
KitN
FYI, "Unlimited OCR Works" is a Fate/stay night reference. The original "Unlimited Blade Works" is a magic whose entire premise is copying weapons other people forged
novoreorx
This looks more promising than what Mistral just launched (coincidence?????? i think not.)
This approach feels like it could be used for image gen as well (in some combination). Read/view image, start drawing image using illustrator/inkscape/etc (or just SVG), then fill in with what was missed after
(As a side note, I do OCR locally as a small RAG for citations I read in books and also chunk input, but merely to save RAM - interesting this natural approach also work in a streaming model)
janpeuker
How does this compare with infinty parser 2 which seemed to be running the table on every other OCR tool (https://huggingface.co/datasets/allenai/olmOCR-bench). To be fair, there's no single winning OCR benchmark and this isn't showing up anywhere yet..
aliljet
I'm going to sound like I live under a rock, but what is the true reason companies open-source genuinely good software?
Shouldn't Baidu (or Google) hoard it for themselves to extract the value in a way the competition isn't be able to imitate?
arboles
my attempts at using AI to do OCR have always resulted in invented artifacts, which is not production feasible. does this suffer from that as well?
A simple example is words that are supposed to be in other languages being automatically translated to English, which ruins the effect
pmarreck
Whatever happened to Reducto, was very promising 12-15 months ago
comments (10)
The way I understand this works is that the researchers found a clever architectural hack to stop AI from hoarding memory when reading long documents.
Normally, when an AI transcribes a 100 page PDF, it tries to remember every single word it has already ingested. This short-term memory (the KV cache) grows linearly O(N) until the model runs out of VRAM and crashes (or caps it) To avoid this, developers are forced to build janky code that chops PDFs into individual pages, processes them one by one, and glues the text back together.
Unlimited OCR uses Reference Sliding Window Attention (R-SWA) to split the AI's focus into two paths:
Global Reference: The AI keeps full, uncompromised sight of the original document image so it never loses context.
Local Generation: The AI restricts its memory of its own typed text to a tight, moving window (like the last 128 words) and safely forgets the rest.
Will be very interesting for local AI and can’t wait to see what the community builds and extends with it!
robotswantdata
I got digging into the state of optical music recognition and came away concluding that music is basically a greenfield for AI wherever you look. Optical music recognition is pretty terrible. AI understanding of music theory is terrible (actually looking at music that is; LLMs do okay at text descriptions of theory concepts where you can imagine some online texts making it in).
I think the issue is that we still don't have great digital formats that encode the dots on paper that musicians read. Music notation is pretty rich. Midi doesn't capture all of what's needed for symbolic understanding, because it was mostly made for capturing aspects relevant for playback or performance. MusicXML seems to be the closest for a digital format that encodes the information a musician would want, but there aren't great corpora of training data that would connect a MusicXML representation to sheet music images or to audio. I think that's because MusicXML falls short of encoding enough information to engrave music. Tools like MuseScore need to track a bunch of layout information that isn't encodable in MusicXML. Lilypond format is less verbose that MusicXML and contains a bit more information that is useful to the score creators, but most people don't create sheet music in lilypond. (As an aside, Lilypond bums me out with the state of jazz fonts. I hate looking at "legit" scores in jazz context)
I realize this is mildly off topic, but every time I see people making incremental gains on OCR, which to my mind is pretty good, I am reminded of how abysmal OMR is.
peatmoss
Class Act.
KitN
novoreorx
This approach feels like it could be used for image gen as well (in some combination). Read/view image, start drawing image using illustrator/inkscape/etc (or just SVG), then fill in with what was missed after
lacoolj
(As a side note, I do OCR locally as a small RAG for citations I read in books and also chunk input, but merely to save RAM - interesting this natural approach also work in a streaming model)
janpeuker
aliljet
Shouldn't Baidu (or Google) hoard it for themselves to extract the value in a way the competition isn't be able to imitate?
arboles
A simple example is words that are supposed to be in other languages being automatically translated to English, which ruins the effect
pmarreck
manipalite