Re: [tesseract-ocr] Parameters to improve detection of sparse text

Scaly Green Orc Tue, 25 Apr 2023 05:30:58 -0700

On Tuesday, 25 April 2023 at 09:06:20 UTC+2 zdenop wrote:

First of all - this input is a regular pdf (e.g. there is text instead of 
an image) - IMO it should be easier to extract accurate text from the file 
instead of OCRing it...

Next: tesseract can handle simple layout analysis (e.g. book pages), but
for complex layouts like that pdf, you need to use custom page layout
analysis/segmentation (e.g. to split input image to homogeneous text
blocks/paragraphs/lines). For example when I OCR just description on the
page 2 (where you mentioned errors) I got this output:

> tesseract page2_description.png - --psm 11
1- Exploitant / Operator :

6 - Hangars disponibles / Hangars available : NIL

Centre hospitalier FANNONAY = /FAX : 04 75 67 35 00

7 - Réparations / Repairs facility : NIL

2 - CAA : DSAC Centre-Est (voir/see GEN)

8 -Type de surface / Surface : béton /concrete

3-AVT:NIL

9 - Force portante / Strength: 4 1.

4 - RFFS : 5 extincteurs poudre/powder fire extinguishers 50 kg.

5 - Police - Douanes / Police - Customs : NIL

[image: page2_description.png]

Zdenko

Zdenko,

Thank you for your reply.

Yes, it's a regular PDF. But most (not always all) of the text is borked.
Try to copy/paste text from it and you'll see. I looked around for
solutions to salvage it, and it seemed like OCR was what was most
consistently recommended in such cases. I need to treat a stack of these
programmatically.

I hear you on the segmentation as a solution, i.e.extracting relevant
blocks and ocr'ing those. I was hoping I could avoid that additional
effort. What I find vexing is that it /almost/ works. I was hoping there
might be things I could tweak about tesseract's analysis. For instance,
isn't there a threshold setting somewhere that makes it ignore the "1 - "
in [image: Screenshot 2023-04-25 142634.png] when it has to consider it as
part of the whole page? As in, how much whitespace is acceptable? I've gone
through the whole list of tesseract parameters (tesseract --print-parameters)
and tried to tweak those that seemed promising... but hardly any seemed to
make any difference. It's not readily clear which parameters are relevant
for what usage.

Orc.

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/bfda595d-dc1a-467a-a751-624876d813bfn%40googlegroups.com.

Re: [tesseract-ocr] Parameters to improve detection of sparse text

Reply via email to