Re: [tesseract-ocr] Re: Criminal record JPGs: Improving image quality

ShreeDevi Kumar Fri, 26 Jan 2018 05:23:58 -0800

Jozef,

Thank you for your detailed answer and sample.


Do you have a sample which can handle an image with tables using leptonica
and tesseract?

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Jan 25, 2018 at 3:24 PM, <j...@mazoea.com> wrote:

> I allow myself to elaborate in this thread on general image processing
> questions in this forum. On the other hand I also include one example
> solution at the end to justify this email.
>
> Personally, I do not think that these questions should be posted exactly
> into this forum because tesseract is already doing a great job in
> segmentation when you do not have additional information about the input
> document set. Can it be improved? Definitely, but the price performance
> ratio is too high and I would rather see the authors/committers focusing on
> other things than handling of very specific documents.
>
> That being said, to if you really want to have high(er) precision you
> simply have to do image processing.
> I have seen references to opencv quite a lot but no matter how great that
> library is, for document image processing my suggestion is to use Leptonica
> (https://github.com/DanBloomberg/leptonica/). Yes, the one tesseract is
> using internally. That library is very powerful, super fast even without
> cpu/gpu magic. I have to admit that I do not understand why it is not much
> more popular and more widely used if you are/have to be at least a bit
> serious with document image processing.
>
> The basic keywords you should understand before even trying any processing
> are: connected components, basic morphological operations (dilate, erode,
> open, close), structuring elements and seed fills. With their rather simple
> usage, many questions in this forum could be answered (at least in a
> hardcoded way). The reason for only a few helpful answers might be that it
> takes a considerable amount of time and I believe some people have their
> internal frameworks where it can be done super easily but cannot share it.
>
> Furthermore, the current (lstm based) traineddata are very good but you
> will find (even simple) examples where they are not performing well and you
> have to either do image processing or retrain (or use older version that
> relies on different properties). Have a look at these simple images:
> 1. https://github.com/mazoea/tesseract-samples/blob/master/
> bitchanges/t1.png
> 2. https://github.com/mazoea/tesseract-samples/blob/master/
> bitchanges/t2.png
> 3. they slightly differ in the value of one pixel - (red dot in
> https://github.com/mazoea/tesseract-samples/blob/master/
> bitchanges/diff.png)
> 4. download Latin best and execute do OCR for both images e.g.,
> tesseract -l Latin --psm 8 --oem 1 ./t1.png stdout
> and you should get `MMEA` vs `MEA`.
> Well, this might not be the best example but I hope it illustrates the
> point.
>
>
> Answer to original question
>
> In order to keep this message "short", I will stop here and point you to a
> https://github.com/mazoea/tesseract-samples/blob/master/lines/main.cc
> and
> https://github.com/mazoea/tesseract-samples/blob/master/lines/test.sh
>
> The code users leptonica and it prepares the image by scaling and
> deskewing it, binarizing it and then it (very) roughly tries to find
> possible letter descenders of latin text on a line (here you could traverse
> the lines by columns and look for black pixels above/below), finds lines
> and computes the result. It looks far from perfect but the result is usable.
>
>
> Kind Regards,
> Jozef
>
>
>
>
>
> On Thursday, January 18, 2018 at 12:49:22 PM UTC, brad.sol...@gmail.com
> wrote:
>>
>> Hello--I am attempting to pull full text from a few hundred JPGs that
>> contain information on death row executions hosted by the Texas Department
>> of Criminal Justice (TDCJ).
>>
>> Here's one example: http://www.tdcj.state.tx.us/death_row/dr_info/ruizr
>> oland.jpg; another: http://www.tdcj.state.tx.us/death_row/dr_info/rodri
>> gezlionell.jpg.
>>
>> In raw form, the images are mostly ~840x1100, 139 KB, grayscale, with a
>> fair amount of whitespace.
>>
>> Tesseract has been able to capture the field names quite well, but has
>> had trouble with the values/sequences corresponding to each field/key.  For
>> example, on the jpg above, I get:
>>
>> *Co-Defendants'*
>> *U-l {IAIN .I'i. ‘ III! [.03 'I‘ I - I95 w. I .-II vII A I I*
>> *II I U i I I o. '4 I99 0' .1“, DA. 3 I I ‘ v 9 3.), I .‘aI vlh. I*
>> *II M I. {?HJI 0 I: III; '403‘I0 v. IIJ' HI. I IO.“ I II I-!*
>> *{.A.‘l. .' I Ilu 'J: -. I' 3. I IIvIII I .III II*
>> *0 Inn . I II I*
>>
>> What I have tried thus far:
>> - Increasing image size & dpi significantly.
>> - Pixel thresholding (from opencv
>> <https://docs.opencv.org/3.3.1/d7/d4d/tutorial_py_thresholding.html>)
>> - Median blurring (from opencv
>> <https://docs.opencv.org/3.1.0/d4/d86/group__imgproc__filter.html#ga564869aa33e58769b4469101aac458f9>)
>> - both through Python interface
>> - Went through the Improve Quality
>> <https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality> page,
>> but it is clear i am flailing around helplessly.
>>
>> Appreciate any suggestions for next steps; based on the characteristics
>> of the jpgs, what transformations would be most or least useful?
>>
>> Thank you.
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/44423aa4-ed2e-46a8-a31c-a90489bf9f6a%
> 40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/44423aa4-ed2e-46a8-a31c-a90489bf9f6a%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXAy3L%3Dj%2BJMnEZpPgy-KzJ-UbBk-zSM8W%3DpUR1APNxu4Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: Criminal record JPGs: Improving image quality

Reply via email to