Hi there hello, I'm trying to OCR VA charts such as this one: <https://www.sia.aviation-civile.gouv.fr/dvd/eAIP_20_APR_2023/Atlas-VAC/PDF_AIPparSSection/VACH/AD/AD-3.HANO.pdf> (the text layer is FUBAR so I'm resorting to OCR).
I'm running in sparse text mode (PSM=11). There's a lot of text but I care only about a small subset. I'm running the recognition on grayscale images taken from the PDF. I reckon I shouldn't have a problem with image quality, although I do notice different results depending on how much DPI I allow. It works mostly fine. But I'm having issues with bits being chopped off / not recognised when (I think) there's too much space or too little text. In the chart linked above, for instance, in the text at the bottom of the second page (numbered list), the numbers of the first column do not get recognised. So, for instance, I get "Exploitant /Operator" instead of "1 - Exploitant /Operator". Then it will work if it's, say "10 - Exploitant /Operator" (two digits). Which leads me to believe that my problem is with small blocks and/or lots of space. I've tried using parameters `preserve_interword_spaces` and `textord_space_size_is_variable`, seemingly to no avail. *Could someone please tell me which parameters I could play with to improve the detection of sparse chunks or increase the engine's tolerance for whitespace?* If you have any other suggestion as to how to improve the OCR, I'll gladly take it as well. Kind regards, Orc. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/4b8913d4-8c37-42e3-8f8d-04be241cd45fn%40googlegroups.com.