[tesseract-ocr] Parameters to improve detection of sparse text

Scaly Green Orc Mon, 24 Apr 2023 23:22:32 -0700

Hi there hello,

I'm trying to OCR VA charts such as this one: 
<https://www.sia.aviation-civile.gouv.fr/dvd/eAIP_20_APR_2023/Atlas-VAC/PDF_AIPparSSection/VACH/AD/AD-3.HANO.pdf>
(the text layer is FUBAR so I'm resorting to OCR).

I'm running in sparse text mode (PSM=11). There's a lot of text but I care
only about a small subset. I'm running the recognition on grayscale images
taken from the PDF. I reckon I shouldn't have a problem with image quality,
although I do notice different results depending on how much DPI I allow.
It works mostly fine.

But I'm having issues with bits being chopped off / not recognised when (I
think) there's too much space or too little text. In the chart linked
above, for instance, in the text at the bottom of the second page (numbered
list), the numbers of the first column do not get recognised. So, for
instance, I get "Exploitant /Operator" instead of "1 - Exploitant
/Operator". Then it will work if it's, say "10 - Exploitant /Operator" (two
digits). Which leads me to believe that my problem is with small blocks
and/or lots of space.

I've tried using parameters `preserve_interword_spaces` and
`textord_space_size_is_variable`, seemingly to no avail.

*Could someone please tell me which parameters I could play with to improve
the detection of sparse chunks or increase the engine's tolerance for
whitespace?*

If you have any other suggestion as to how to improve the OCR, I'll gladly
take it as well.

Kind regards,
Orc.

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/4b8913d4-8c37-42e3-8f8d-04be241cd45fn%40googlegroups.com.

[tesseract-ocr] Parameters to improve detection of sparse text

Reply via email to