First of all - this input is a regular pdf (e.g. there is text instead of
an image) - IMO it should be easier to extract accurate text from the file
instead of OCRing it...

Next: tesseract can handle simple layout analysis (e.g. book pages), but
for complex layouts like that pdf, you need to use custom page layout
analysis/segmentation (e.g. to split input image to homogeneous text
blocks/paragraphs/lines). For example when I OCR just description on the
page 2 (where you mentioned errors) I got this output:

> tesseract page2_description.png - --psm 11
1- Exploitant / Operator :

6 - Hangars disponibles / Hangars available : NIL

Centre hospitalier FANNONAY = /FAX : 04 75 67 35 00

7 - Réparations / Repairs facility : NIL

2 - CAA : DSAC Centre-Est (voir/see GEN)

8 -Type de surface / Surface : béton /concrete

3-AVT:NIL

9 - Force portante / Strength: 4 1.

4 - RFFS : 5 extincteurs poudre/powder fire extinguishers 50 kg.

5 - Police - Douanes / Police - Customs : NIL

[image: page2_description.png]

Zdenko


ut 25. 4. 2023 o 8:22 Scaly Green Orc <npc1...@gmail.com> napísal(a):

> Hi there hello,
>
> I'm trying to OCR VA charts such as this one: <
> https://www.sia.aviation-civile.gouv.fr/dvd/eAIP_20_APR_2023/Atlas-VAC/PDF_AIPparSSection/VACH/AD/AD-3.HANO.pdf
> >
> (the text layer is FUBAR so I'm resorting to OCR).
>
> I'm running in sparse text mode (PSM=11). There's a lot of text but I care
> only about a small subset. I'm running the recognition on grayscale images
> taken from the PDF. I reckon I shouldn't have a problem with image quality,
> although I do notice different results depending on how much DPI I allow.
> It works mostly fine.
>
> But I'm having issues with bits being chopped off / not recognised when (I
> think) there's too much space or too little text. In the chart linked
> above, for instance, in the text at the bottom of the second page (numbered
> list), the numbers of the first column do not get recognised. So, for
> instance, I get "Exploitant /Operator" instead of "1 - Exploitant
> /Operator". Then it will work if it's, say "10 - Exploitant /Operator" (two
> digits). Which leads me to believe that my problem is with small blocks
> and/or lots of space.
>
> I've tried using parameters `preserve_interword_spaces` and
> `textord_space_size_is_variable`, seemingly to no avail.
>
> *Could someone please tell me which parameters I could play with to
> improve the detection of sparse chunks or increase the engine's tolerance
> for whitespace?*
>
> If you have any other suggestion as to how to improve the OCR, I'll gladly
> take it as well.
>
> Kind regards,
>  Orc.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/4b8913d4-8c37-42e3-8f8d-04be241cd45fn%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/4b8913d4-8c37-42e3-8f8d-04be241cd45fn%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xGGJVQczwbP3FvnwsGaFRheSEngRvP43iF-X%2BWH3o9yg%40mail.gmail.com.

Reply via email to