Re: [tesseract-ocr] Re: Criminal record JPGs: Improving image quality

ShreeDevi Kumar Tue, 30 Jan 2018 10:03:49 -0800

Thanks for your response and the link to leptonica's table detection
routines.


Yes, my query was generic in nature, because I have seem many posts related
to OCR of tables, but hadn't come across any method addressing the same.

You have correctly pointed out the reasons why it is so.


On 28-Jan-2018 8:20 PM, <j...@mazoea.com> wrote:



On Friday, January 26, 2018 at 2:23:39 PM UTC+1, shree wrote:
>
> Jozef,
>
> Thank you for your detailed answer and sample.
>
> Do you have a sample which can handle an image with tables using leptonica
> and tesseract?
>


Dear Shree,

your request is simply too generic. First of all, if you identify a table,
what next? Imagine invoice tables with multi line lines and completely
different columns etc., removing horizontal/vertical lines does not help
much (it can even make quality worse). You would need to find the contents
of a cell which is again non trivial with touching or even overlapping
letters (like in a form). Furthermore, not all tables have
horizontal/vertical lines and many other specifics.

However, to be at least somehow helpful I suggest to start by looking at
(and modifying)
https://github.com/DanBloomberg/leptonica/blob/
45f5dbb78e5ac742312b85b21a79dedc726bb23b/src/pageseg.c#L1585

Best,
Jozef







>
> ShreeDevi
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Thu, Jan 25, 2018 at 3:24 PM, <j...@mazoea.com> wrote:
>
>> I allow myself to elaborate in this thread on general image processing
>> questions in this forum. On the other hand I also include one example
>> solution at the end to justify this email.
>>
>> Personally, I do not think that these questions should be posted exactly
>> into this forum because tesseract is already doing a great job in
>> segmentation when you do not have additional information about the input
>> document set. Can it be improved? Definitely, but the price performance
>> ratio is too high and I would rather see the authors/committers focusing on
>> other things than handling of very specific documents.
>>
>> That being said, to if you really want to have high(er) precision you
>> simply have to do image processing.
>> I have seen references to opencv quite a lot but no matter how great that
>> library is, for document image processing my suggestion is to use Leptonica
>> (https://github.com/DanBloomberg/leptonica/). Yes, the one tesseract is
>> using internally. That library is very powerful, super fast even without
>> cpu/gpu magic. I have to admit that I do not understand why it is not much
>> more popular and more widely used if you are/have to be at least a bit
>> serious with document image processing.
>>
>> The basic keywords you should understand before even trying any
>> processing are: connected components, basic morphological operations
>> (dilate, erode, open, close), structuring elements and seed fills. With
>> their rather simple usage, many questions in this forum could be answered
>> (at least in a hardcoded way). The reason for only a few helpful answers
>> might be that it takes a considerable amount of time and I believe some
>> people have their internal frameworks where it can be done super easily but
>> cannot share it.
>>
>> Furthermore, the current (lstm based) traineddata are very good but you
>> will find (even simple) examples where they are not performing well and you
>> have to either do image processing or retrain (or use older version that
>> relies on different properties). Have a look at these simple images:
>> 1. https://github.com/mazoea/tesseract-samples/blob/master/bitc
>> hanges/t1.png
>> 2. https://github.com/mazoea/tesseract-samples/blob/master/b
>> itchanges/t2.png
>> 3. they slightly differ in the value of one pixel - (red dot in
>> https://github.com/mazoea/tesseract-samples/blob/master/b
>> itchanges/diff.png)
>> 4. download Latin best and execute do OCR for both images e.g.,
>> tesseract -l Latin --psm 8 --oem 1 ./t1.png stdout
>> and you should get `MMEA` vs `MEA`.
>> Well, this might not be the best example but I hope it illustrates the
>> point.
>>
>>
>> Answer to original question
>>
>> In order to keep this message "short", I will stop here and point you to
>> a
>> https://github.com/mazoea/tesseract-samples/blob/master/lines/main.cc
>> and
>> https://github.com/mazoea/tesseract-samples/blob/master/lines/test.sh
>>
>> The code users leptonica and it prepares the image by scaling and
>> deskewing it, binarizing it and then it (very) roughly tries to find
>> possible letter descenders of latin text on a line (here you could traverse
>> the lines by columns and look for black pixels above/below), finds lines
>> and computes the result. It looks far from perfect but the result is usable.
>>
>>
>> Kind Regards,
>> Jozef
>>
>>
>>
>>
>>
>> On Thursday, January 18, 2018 at 12:49:22 PM UTC, brad.sol...@gmail.com
>> wrote:
>>>
>>> Hello--I am attempting to pull full text from a few hundred JPGs that
>>> contain information on death row executions hosted by the Texas Department
>>> of Criminal Justice (TDCJ).
>>>
>>> Here's one example: http://www.tdcj.state.tx.us/death_row/dr_info/ruizr
>>> oland.jpg; another: http://www.tdcj.state.tx.us/death_row/dr_info/rodri
>>> gezlionell.jpg.
>>>
>>> In raw form, the images are mostly ~840x1100, 139 KB, grayscale, with a
>>> fair amount of whitespace.
>>>
>>> Tesseract has been able to capture the field names quite well, but has
>>> had trouble with the values/sequences corresponding to each field/key.  For
>>> example, on the jpg above, I get:
>>>
>>> *Co-Defendants'*
>>> *U-l {IAIN .I'i. ‘ III! [.03 'I‘ I - I95 w. I .-II vII A I I*
>>> *II I U i I I o. '4 I99 0' .1“, DA. 3 I I ‘ v 9 3.), I .‘aI vlh. I*
>>> *II M I. {?HJI 0 I: III; '403‘I0 v. IIJ' HI. I IO.“ I II I-!*
>>> *{.A.‘l. .' I Ilu 'J: -. I' 3. I IIvIII I .III II*
>>> *0 Inn . I II I*
>>>
>>> What I have tried thus far:
>>> - Increasing image size & dpi significantly.
>>> - Pixel thresholding (from opencv
>>> <https://docs.opencv.org/3.3.1/d7/d4d/tutorial_py_thresholding.html>)
>>> - Median blurring (from opencv
>>> <https://docs.opencv.org/3.1.0/d4/d86/group__imgproc__filter.html#ga564869aa33e58769b4469101aac458f9>)
>>> - both through Python interface
>>> - Went through the Improve Quality
>>> <https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality> page,
>>> but it is clear i am flailing around helplessly.
>>>
>>> Appreciate any suggestions for next steps; based on the characteristics
>>> of the jpgs, what transformations would be most or least useful?
>>>
>>> Thank you.
>>>
>>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-oc...@googlegroups.com.
>> To post to this group, send email to tesser...@googlegroups.com.
>>
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit https://groups.google.com/d/ms
>> gid/tesseract-ocr/44423aa4-ed2e-46a8-a31c-a90489bf9f6a%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/44423aa4-ed2e-46a8-a31c-a90489bf9f6a%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
> --
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/
msgid/tesseract-ocr/16b82120-1d88-4df8-ba8e-1e4f38dd7221%40googlegroups.com
<https://groups.google.com/d/msgid/tesseract-ocr/16b82120-1d88-4df8-ba8e-1e4f38dd7221%40googlegroups.com?utm_medium=email&utm_source=footer>
.

For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVoAHB0yiu4Rccf1AJ0e-Qc1Ks%3D7vf9mcDv-wZ4wWtHzw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: Criminal record JPGs: Improving image quality

Reply via email to