[tesseract-ocr] Re: Criminal record JPGs: Improving image quality

jm Thu, 25 Jan 2018 03:07:38 -0800

I allow myself to elaborate in this thread on general image processing 
questions in this forum. On the other hand I also include one example 
solution at the end to justify this email.

Personally, I do not think that these questions should be posted exactly 
into this forum because tesseract is already doing a great job in 
segmentation when you do not have additional information about the input 
document set. Can it be improved? Definitely, but the price performance 
ratio is too high and I would rather see the authors/committers focusing on 
other things than handling of very specific documents. 

That being said, to if you really want to have high(er) precision you 
simply have to do image processing. 
I have seen references to opencv quite a lot but no matter how great that 
library is, for document image processing my suggestion is to use Leptonica 
(https://github.com/DanBloomberg/leptonica/). Yes, the one tesseract is 
using internally. That library is very powerful, super fast even without 
cpu/gpu magic. I have to admit that I do not understand why it is not much 
more popular and more widely used if you are/have to be at least a bit 
serious with document image processing. 

The basic keywords you should understand before even trying any processing 
are: connected components, basic morphological operations (dilate, erode, 
open, close), structuring elements and seed fills. With their rather simple 
usage, many questions in this forum could be answered (at least in a 
hardcoded way). The reason for only a few helpful answers might be that it 
takes a considerable amount of time and I believe some people have their 
internal frameworks where it can be done super easily but cannot share it. 

Furthermore, the current (lstm based) traineddata are very good but you 
will find (even simple) examples where they are not performing well and you 
have to either do image processing or retrain (or use older version that 
relies on different properties). Have a look at these simple images:
1. https://github.com/mazoea/tesseract-samples/blob/master/bitchanges/t1.png
2. https://github.com/mazoea/tesseract-samples/blob/master/bitchanges/t2.png
3. they slightly differ in the value of one pixel - (red dot 
in https://github.com/mazoea/tesseract-samples/blob/master/bitchanges/diff.png)
4. download Latin best and execute do OCR for both images e.g., 
tesseract -l Latin --psm 8 --oem 1 ./t1.png stdout
and you should get `MMEA` vs `MEA`. 
Well, this might not be the best example but I hope it illustrates the 
point.

Answer to original question

In order to keep this message "short", I will stop here and point you to a 
https://github.com/mazoea/tesseract-samples/blob/master/lines/main.cc
and 
https://github.com/mazoea/tesseract-samples/blob/master/lines/test.sh

The code users leptonica and it prepares the image by scaling and deskewing 
it, binarizing it and then it (very) roughly tries to find possible letter 
descenders of latin text on a line (here you could traverse the lines by 
columns and look for black pixels above/below), finds lines and computes 
the result. It looks far from perfect but the result is usable.

Kind Regards,
Jozef

On Thursday, January 18, 2018 at 12:49:22 PM UTC, brad.sol...@gmail.com 
wrote:
>
> Hello--I am attempting to pull full text from a few hundred JPGs that 
> contain information on death row executions hosted by the Texas Department 
> of Criminal Justice (TDCJ).
>
> Here's one example: 
> http://www.tdcj.state.tx.us/death_row/dr_info/ruizroland.jpg; another: 
> http://www.tdcj.state.tx.us/death_row/dr_info/rodrigezlionell.jpg.
>
> In raw form, the images are mostly ~840x1100, 139 KB, grayscale, with a 
> fair amount of whitespace.  
>
> Tesseract has been able to capture the field names quite well, but has had 
> trouble with the values/sequences corresponding to each field/key.  For 
> example, on the jpg above, I get:
>
> *Co-Defendants'*
> *U-l {IAIN .I'i. ‘ III! [.03 'I‘ I - I95 w. I .-II vII A I I*
> *II I U i I I o. '4 I99 0' .1“, DA. 3 I I ‘ v 9 3.), I .‘aI vlh. I*
> *II M I. {?HJI 0 I: III; '403‘I0 v. IIJ' HI. I IO.“ I II I-!*
> *{.A.‘l. .' I Ilu 'J: -. I' 3. I IIvIII I .III II*
> *0 Inn . I II I*
>
> What I have tried thus far:
> - Increasing image size & dpi significantly.
> - Pixel thresholding (from opencv 
> <https://docs.opencv.org/3.3.1/d7/d4d/tutorial_py_thresholding.html>)
> - Median blurring (from opencv 
> <https://docs.opencv.org/3.1.0/d4/d86/group__imgproc__filter.html#ga564869aa33e58769b4469101aac458f9>)
>  
> - both through Python interface
> - Went through the Improve Quality 
> <https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality> page, 
> but it is clear i am flailing around helplessly.
>
> Appreciate any suggestions for next steps; based on the characteristics of 
> the jpgs, what transformations would be most or least useful?
>
> Thank you.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/44423aa4-ed2e-46a8-a31c-a90489bf9f6a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Re: Criminal record JPGs: Improving image quality

Reply via email to