Take a look at the python layout-parser package. It works very well for me.

<https://layout-parser.github.io/>

On August 19, 2024 8:03:53 AM PDT, tesseract-ocr@googlegroups.com wrote:
>=============================================================================
>Today's topic summary
>=============================================================================
>
>Group: tesseract-ocr@googlegroups.com
>Url: 
>https://groups.google.com/forum/?utm_source=digest&utm_medium=email#!forum/tesseract-ocr/topics
>
>  - Text-wrap recognition [2 Updates]
>    http://groups.google.com/group/tesseract-ocr/t/cc356d9a368e220c
>
>
>=============================================================================
>Topic: Text-wrap recognition
>Url: http://groups.google.com/group/tesseract-ocr/t/cc356d9a368e220c
>=============================================================================
>
>---------- 1 of 2 ----------
>From: Ajg <ajg749...@gmail.com>
>Date: Aug 18 09:48AM -0700
>Url: http://groups.google.com/group/tesseract-ocr/msg/13d38085da0f4
>
>I have data that comes in from various old (1920) magazines that has 
>multiple blocks of text on a single page. Right now, OCR recognition 
>interprets the text lines across the page so the output is interspersed 
>rather than word-wrapped to the next column.  Is there any way to get the 
>OCR scanned text concatenated with one block following the next block?  
>Note- these are not all fixed size columns.  I tried all the pagesegmodes 
>but the best I get is interspersed text.
>
>
>---------- 2 of 2 ----------
>From: Ger Hobbelt <ger.hobb...@gmail.com>
>Date: Aug 19 11:25AM +0200
>Url: http://groups.google.com/group/tesseract-ocr/msg/173a48a645004
>
>Regrettably the only way I know with current tesseract is to work around
>the issue, i.e. create a column mask and apply that in a preprocess, hence
>feeding tesseract several images for a single page, one for each column
>where the other columns are tipexed (white-out, replaced by background
>color rectangles) so tesseract hour and tsv outputs will produce
>coordinates matching the entire page. Then collect the tesseract results
>for each image and stitch them together to reflow the text in a
>postprocess.
>
>Tesseract doesn't have a sophisticated page layout analysis module on board
>so one is forced to use external means for that.
>
>HTH,
>
>Ger
>
>
>
>
>
>
>
>--
>You received this digest because you're subscribed to updates for this group. 
>You can change your settings on the group membership page: 
>https://groups.google.com/forum/?utm_source=digest&utm_medium=email#!forum/tesseract-ocr/join.
>To unsubscribe from this group and stop receiving emails from it send an email 
>to tesseract-ocr+unsubscr...@googlegroups.com.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/35066E65-2B68-4577-A432-AE21EECEFD92%40gmail.com.

Reply via email to