Take a look at the python layout-parser package. It works very well for me.
<https://layout-parser.github.io/> On August 19, 2024 8:03:53 AM PDT, tesseract-ocr@googlegroups.com wrote: >============================================================================= >Today's topic summary >============================================================================= > >Group: tesseract-ocr@googlegroups.com >Url: >https://groups.google.com/forum/?utm_source=digest&utm_medium=email#!forum/tesseract-ocr/topics > > - Text-wrap recognition [2 Updates] > http://groups.google.com/group/tesseract-ocr/t/cc356d9a368e220c > > >============================================================================= >Topic: Text-wrap recognition >Url: http://groups.google.com/group/tesseract-ocr/t/cc356d9a368e220c >============================================================================= > >---------- 1 of 2 ---------- >From: Ajg <ajg749...@gmail.com> >Date: Aug 18 09:48AM -0700 >Url: http://groups.google.com/group/tesseract-ocr/msg/13d38085da0f4 > >I have data that comes in from various old (1920) magazines that has >multiple blocks of text on a single page. Right now, OCR recognition >interprets the text lines across the page so the output is interspersed >rather than word-wrapped to the next column. Is there any way to get the >OCR scanned text concatenated with one block following the next block? >Note- these are not all fixed size columns. I tried all the pagesegmodes >but the best I get is interspersed text. > > >---------- 2 of 2 ---------- >From: Ger Hobbelt <ger.hobb...@gmail.com> >Date: Aug 19 11:25AM +0200 >Url: http://groups.google.com/group/tesseract-ocr/msg/173a48a645004 > >Regrettably the only way I know with current tesseract is to work around >the issue, i.e. create a column mask and apply that in a preprocess, hence >feeding tesseract several images for a single page, one for each column >where the other columns are tipexed (white-out, replaced by background >color rectangles) so tesseract hour and tsv outputs will produce >coordinates matching the entire page. Then collect the tesseract results >for each image and stitch them together to reflow the text in a >postprocess. > >Tesseract doesn't have a sophisticated page layout analysis module on board >so one is forced to use external means for that. > >HTH, > >Ger > > > > > > > >-- >You received this digest because you're subscribed to updates for this group. >You can change your settings on the group membership page: >https://groups.google.com/forum/?utm_source=digest&utm_medium=email#!forum/tesseract-ocr/join. >To unsubscribe from this group and stop receiving emails from it send an email >to tesseract-ocr+unsubscr...@googlegroups.com. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/35066E65-2B68-4577-A432-AE21EECEFD92%40gmail.com.