First of all - this input is a regular pdf (e.g. there is text instead of an image) - IMO it should be easier to extract accurate text from the file instead of OCRing it...
Next: tesseract can handle simple layout analysis (e.g. book pages), but for complex layouts like that pdf, you need to use custom page layout analysis/segmentation (e.g. to split input image to homogeneous text blocks/paragraphs/lines). For example when I OCR just description on the page 2 (where you mentioned errors) I got this output: > tesseract page2_description.png - --psm 11 1- Exploitant / Operator : 6 - Hangars disponibles / Hangars available : NIL Centre hospitalier FANNONAY = /FAX : 04 75 67 35 00 7 - Réparations / Repairs facility : NIL 2 - CAA : DSAC Centre-Est (voir/see GEN) 8 -Type de surface / Surface : béton /concrete 3-AVT:NIL 9 - Force portante / Strength: 4 1. 4 - RFFS : 5 extincteurs poudre/powder fire extinguishers 50 kg. 5 - Police - Douanes / Police - Customs : NIL [image: page2_description.png] Zdenko ut 25. 4. 2023 o 8:22 Scaly Green Orc <npc1...@gmail.com> napísal(a): > Hi there hello, > > I'm trying to OCR VA charts such as this one: < > https://www.sia.aviation-civile.gouv.fr/dvd/eAIP_20_APR_2023/Atlas-VAC/PDF_AIPparSSection/VACH/AD/AD-3.HANO.pdf > > > (the text layer is FUBAR so I'm resorting to OCR). > > I'm running in sparse text mode (PSM=11). There's a lot of text but I care > only about a small subset. I'm running the recognition on grayscale images > taken from the PDF. I reckon I shouldn't have a problem with image quality, > although I do notice different results depending on how much DPI I allow. > It works mostly fine. > > But I'm having issues with bits being chopped off / not recognised when (I > think) there's too much space or too little text. In the chart linked > above, for instance, in the text at the bottom of the second page (numbered > list), the numbers of the first column do not get recognised. So, for > instance, I get "Exploitant /Operator" instead of "1 - Exploitant > /Operator". Then it will work if it's, say "10 - Exploitant /Operator" (two > digits). Which leads me to believe that my problem is with small blocks > and/or lots of space. > > I've tried using parameters `preserve_interword_spaces` and > `textord_space_size_is_variable`, seemingly to no avail. > > *Could someone please tell me which parameters I could play with to > improve the detection of sparse chunks or increase the engine's tolerance > for whitespace?* > > If you have any other suggestion as to how to improve the OCR, I'll gladly > take it as well. > > Kind regards, > Orc. > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/4b8913d4-8c37-42e3-8f8d-04be241cd45fn%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/4b8913d4-8c37-42e3-8f8d-04be241cd45fn%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xGGJVQczwbP3FvnwsGaFRheSEngRvP43iF-X%2BWH3o9yg%40mail.gmail.com.