On Tuesday, 25 April 2023 at 09:06:20 UTC+2 zdenop wrote:

First of all - this input is a regular pdf (e.g. there is text instead of 
an image) - IMO it should be easier to extract accurate text from the file 
instead of OCRing it...

Next: tesseract can handle simple layout analysis (e.g. book pages), but 
for complex layouts like that pdf, you need to use custom page layout 
analysis/segmentation (e.g. to split input image to homogeneous text 
blocks/paragraphs/lines). For example when I OCR just description on the 
page 2 (where you mentioned errors) I got this output:

> tesseract page2_description.png - --psm 11
1- Exploitant / Operator :

6 - Hangars disponibles / Hangars available : NIL

Centre hospitalier FANNONAY = /FAX : 04 75 67 35 00

7 - Réparations / Repairs facility : NIL

2 - CAA : DSAC Centre-Est (voir/see GEN)

8 -Type de surface / Surface : béton /concrete

3-AVT:NIL

9 - Force portante / Strength: 4 1.

4 - RFFS : 5 extincteurs poudre/powder fire extinguishers 50 kg.

5 - Police - Douanes / Police - Customs : NIL

[image: page2_description.png]

Zdenko


Zdenko,

Thank you for your reply.

Yes, it's a regular PDF. But most (not always all) of the text is borked. 
Try to copy/paste text from it and you'll see. I looked around for 
solutions to salvage it, and it seemed like OCR was what was most 
consistently recommended in such cases. I need to treat a stack of these 
programmatically.

I hear you on the segmentation as a solution, i.e.extracting relevant 
blocks and ocr'ing those. I was hoping I could avoid that additional 
effort. What I find vexing is that it /almost/ works. I was hoping there 
might be things I could tweak about tesseract's analysis. For instance, 
isn't there a threshold setting somewhere that makes it ignore the "1 - " 
in [image: Screenshot 2023-04-25 142634.png] when it has to consider it as 
part of the whole page? As in, how much whitespace is acceptable? I've gone 
through the whole list of tesseract parameters (tesseract --print-parameters) 
and tried to tweak those that seemed promising... but hardly any seemed to 
make any difference. It's not readily clear which parameters are relevant 
for what usage.

Orc.
  

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/bfda595d-dc1a-467a-a751-624876d813bfn%40googlegroups.com.

Reply via email to