I'm looking for help to get a better understanding why the OCR engine 
reports the error "Couldn't find a matching blob" when training the engine 
using a box/tiff file pair. I understand it's important to have a 
high-resolution, binarized scan source for the lang/fonts you wish to train 
Tesseract to recognize.

I'm starting out creating my first training data file to read receipts. 
I've started with a small trial before I build a complete training file. My 
first trial resulted in one failure message as you can see in the output 
below.

```
tesseract xxx.supermarche-pa.exp01.tif xxx.supermarche-pa.exp01 nobatch 
box.train.stderr
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
FAIL!
APPLY_BOXES: boxfile line 55/N ((98,803),(113,836)): FAILURE! Couldn't find 
a matching blob
APPLY_BOXES:
   Boxes read from boxfile:     351
   Boxes failed resegmentation:       1
   Found 350 good blobs.
   Leaving 1 unlabelled blobs in 0 words.
TRAINING ... Font name = supermarche-pa
Generated training data for 89 words
```

When I first generated the box file, I had to correct the box dimensions 
for the box and correct the character it detected. Looking at the character 
itself, it appears to be poor sample. It's a rather broken, disjointed 
glyph. So I guess I have a cursory understanding of why it couldn't find 
the blob, but I couldn't explain the reason for the failure in an 
intelligent manner.

If I have defined the box, why couldn't the blob? Is it the number of 
pixels within the boxed area is too low? Is there some sort of threshold?

For completeness I have included my files and highlighted the offending box 
in the JPG version. I'm not asking any one to run Tesseract against these 
files.

https://www.dropbox.com/s/g10c4twgjkf1kvu/xxx.supermarche-pa.exp01.box
https://www.dropbox.com/s/gqrtqtaiz8wtcy8/xxx.supermarche-pa.exp01.jpg
https://www.dropbox.com/s/ezoumgvdtembnqo/xxx.supermarche-pa.exp01.tif

Thank you in advance for you help,
Nicholas

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to