On Tuesday, 25 April 2023 at 09:06:20 UTC+2 zdenop wrote: First of all - this input is a regular pdf (e.g. there is text instead of an image) - IMO it should be easier to extract accurate text from the file instead of OCRing it...
Next: tesseract can handle simple layout analysis (e.g. book pages), but for complex layouts like that pdf, you need to use custom page layout analysis/segmentation (e.g. to split input image to homogeneous text blocks/paragraphs/lines). For example when I OCR just description on the page 2 (where you mentioned errors) I got this output: > tesseract page2_description.png - --psm 11 1- Exploitant / Operator : 6 - Hangars disponibles / Hangars available : NIL Centre hospitalier FANNONAY = /FAX : 04 75 67 35 00 7 - Réparations / Repairs facility : NIL 2 - CAA : DSAC Centre-Est (voir/see GEN) 8 -Type de surface / Surface : béton /concrete 3-AVT:NIL 9 - Force portante / Strength: 4 1. 4 - RFFS : 5 extincteurs poudre/powder fire extinguishers 50 kg. 5 - Police - Douanes / Police - Customs : NIL [image: page2_description.png] Zdenko Zdenko, Thank you for your reply. Yes, it's a regular PDF. But most (not always all) of the text is borked. Try to copy/paste text from it and you'll see. I looked around for solutions to salvage it, and it seemed like OCR was what was most consistently recommended in such cases. I need to treat a stack of these programmatically. I hear you on the segmentation as a solution, i.e.extracting relevant blocks and ocr'ing those. I was hoping I could avoid that additional effort. What I find vexing is that it /almost/ works. I was hoping there might be things I could tweak about tesseract's analysis. For instance, isn't there a threshold setting somewhere that makes it ignore the "1 - " in [image: Screenshot 2023-04-25 142634.png] when it has to consider it as part of the whole page? As in, how much whitespace is acceptable? I've gone through the whole list of tesseract parameters (tesseract --print-parameters) and tried to tweak those that seemed promising... but hardly any seemed to make any difference. It's not readily clear which parameters are relevant for what usage. Orc. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/bfda595d-dc1a-467a-a751-624876d813bfn%40googlegroups.com.