Hi there hello,

I'm trying to OCR VA charts such as this one: 
<https://www.sia.aviation-civile.gouv.fr/dvd/eAIP_20_APR_2023/Atlas-VAC/PDF_AIPparSSection/VACH/AD/AD-3.HANO.pdf>
(the text layer is FUBAR so I'm resorting to OCR).

I'm running in sparse text mode (PSM=11). There's a lot of text but I care 
only about a small subset. I'm running the recognition on grayscale images 
taken from the PDF. I reckon I shouldn't have a problem with image quality, 
although I do notice different results depending on how much DPI I allow. 
It works mostly fine.

But I'm having issues with bits being chopped off / not recognised when (I 
think) there's too much space or too little text. In the chart linked 
above, for instance, in the text at the bottom of the second page (numbered 
list), the numbers of the first column do not get recognised. So, for 
instance, I get "Exploitant /Operator" instead of "1 - Exploitant 
/Operator". Then it will work if it's, say "10 - Exploitant /Operator" (two 
digits). Which leads me to believe that my problem is with small blocks 
and/or lots of space.

I've tried using parameters `preserve_interword_spaces` and 
`textord_space_size_is_variable`, seemingly to no avail. 

*Could someone please tell me which parameters I could play with to improve 
the detection of sparse chunks or increase the engine's tolerance for 
whitespace?*

If you have any other suggestion as to how to improve the OCR, I'll gladly 
take it as well.

Kind regards,
 Orc.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/4b8913d4-8c37-42e3-8f8d-04be241cd45fn%40googlegroups.com.

Reply via email to