I'm looking for help to get a better understanding why the OCR engine reports the error "Couldn't find a matching blob" when training the engine using a box/tiff file pair. I understand it's important to have a high-resolution, binarized scan source for the lang/fonts you wish to train Tesseract to recognize.
I'm starting out creating my first training data file to read receipts. I've started with a small trial before I build a complete training file. My first trial resulted in one failure message as you can see in the output below. ``` tesseract xxx.supermarche-pa.exp01.tif xxx.supermarche-pa.exp01 nobatch box.train.stderr Tesseract Open Source OCR Engine v3.02.02 with Leptonica FAIL! APPLY_BOXES: boxfile line 55/N ((98,803),(113,836)): FAILURE! Couldn't find a matching blob APPLY_BOXES: Boxes read from boxfile: 351 Boxes failed resegmentation: 1 Found 350 good blobs. Leaving 1 unlabelled blobs in 0 words. TRAINING ... Font name = supermarche-pa Generated training data for 89 words ``` When I first generated the box file, I had to correct the box dimensions for the box and correct the character it detected. Looking at the character itself, it appears to be poor sample. It's a rather broken, disjointed glyph. So I guess I have a cursory understanding of why it couldn't find the blob, but I couldn't explain the reason for the failure in an intelligent manner. If I have defined the box, why couldn't the blob? Is it the number of pixels within the boxed area is too low? Is there some sort of threshold? For completeness I have included my files and highlighted the offending box in the JPG version. I'm not asking any one to run Tesseract against these files. https://www.dropbox.com/s/g10c4twgjkf1kvu/xxx.supermarche-pa.exp01.box https://www.dropbox.com/s/gqrtqtaiz8wtcy8/xxx.supermarche-pa.exp01.jpg https://www.dropbox.com/s/ezoumgvdtembnqo/xxx.supermarche-pa.exp01.tif Thank you in advance for you help, Nicholas -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

