Re: [tesseract-ocr] Re: Criminal record JPGs: Improving image quality

jm Sun, 28 Jan 2018 06:50:47 -0800


On Friday, January 26, 2018 at 2:23:39 PM UTC+1, shree wrote:
>
> Jozef,
>
> Thank you for your detailed answer and sample.
>
> Do you have a sample which can handle an image with tables using leptonica 
> and tesseract? 
>



Dear Shree, 

your request is simply too generic. First of all, if you identify a table, 
what next? Imagine invoice tables with multi line lines and completely 
different columns etc., removing horizontal/vertical lines does not help 
much (it can even make quality worse). You would need to find the contents 
of a cell which is again non trivial with touching or even overlapping 
letters (like in a form). Furthermore, not all tables have 
horizontal/vertical lines and many other specifics. 

However, to be at least somehow helpful I suggest to start by looking at 
(and modifying) 
https://github.com/DanBloomberg/leptonica/blob/45f5dbb78e5ac742312b85b21a79dedc726bb23b/src/pageseg.c#L1585

Best,
Jozef





 

>
> ShreeDevi
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Thu, Jan 25, 2018 at 3:24 PM, <j...@mazoea.com <javascript:>> wrote:
>
>> I allow myself to elaborate in this thread on general image processing 
>> questions in this forum. On the other hand I also include one example 
>> solution at the end to justify this email.
>>
>> Personally, I do not think that these questions should be posted exactly 
>> into this forum because tesseract is already doing a great job in 
>> segmentation when you do not have additional information about the input 
>> document set. Can it be improved? Definitely, but the price performance 
>> ratio is too high and I would rather see the authors/committers focusing on 
>> other things than handling of very specific documents. 
>>
>> That being said, to if you really want to have high(er) precision you 
>> simply have to do image processing. 
>> I have seen references to opencv quite a lot but no matter how great that 
>> library is, for document image processing my suggestion is to use Leptonica 
>> (https://github.com/DanBloomberg/leptonica/). Yes, the one tesseract is 
>> using internally. That library is very powerful, super fast even without 
>> cpu/gpu magic. I have to admit that I do not understand why it is not much 
>> more popular and more widely used if you are/have to be at least a bit 
>> serious with document image processing. 
>>
>> The basic keywords you should understand before even trying any 
>> processing are: connected components, basic morphological operations 
>> (dilate, erode, open, close), structuring elements and seed fills. With 
>> their rather simple usage, many questions in this forum could be answered 
>> (at least in a hardcoded way). The reason for only a few helpful answers 
>> might be that it takes a considerable amount of time and I believe some 
>> people have their internal frameworks where it can be done super easily but 
>> cannot share it. 
>>
>> Furthermore, the current (lstm based) traineddata are very good but you 
>> will find (even simple) examples where they are not performing well and you 
>> have to either do image processing or retrain (or use older version that 
>> relies on different properties). Have a look at these simple images:
>> 1. 
>> https://github.com/mazoea/tesseract-samples/blob/master/bitchanges/t1.png
>> 2. 
>> https://github.com/mazoea/tesseract-samples/blob/master/bitchanges/t2.png
>> 3. they slightly differ in the value of one pixel - (red dot in 
>> https://github.com/mazoea/tesseract-samples/blob/master/bitchanges/diff.png
>> )
>> 4. download Latin best and execute do OCR for both images e.g., 
>> tesseract -l Latin --psm 8 --oem 1 ./t1.png stdout
>> and you should get `MMEA` vs `MEA`. 
>> Well, this might not be the best example but I hope it illustrates the 
>> point.
>>
>>
>> Answer to original question
>>
>> In order to keep this message "short", I will stop here and point you to 
>> a 
>> https://github.com/mazoea/tesseract-samples/blob/master/lines/main.cc
>> and 
>> https://github.com/mazoea/tesseract-samples/blob/master/lines/test.sh
>>
>> The code users leptonica and it prepares the image by scaling and 
>> deskewing it, binarizing it and then it (very) roughly tries to find 
>> possible letter descenders of latin text on a line (here you could traverse 
>> the lines by columns and look for black pixels above/below), finds lines 
>> and computes the result. It looks far from perfect but the result is usable.
>>
>>
>> Kind Regards,
>> Jozef
>>
>>
>>  
>>
>>
>> On Thursday, January 18, 2018 at 12:49:22 PM UTC, brad.sol...@gmail.com 
>> wrote:
>>>
>>> Hello--I am attempting to pull full text from a few hundred JPGs that 
>>> contain information on death row executions hosted by the Texas Department 
>>> of Criminal Justice (TDCJ).
>>>
>>> Here's one example: 
>>> http://www.tdcj.state.tx.us/death_row/dr_info/ruizroland.jpg; another: 
>>> http://www.tdcj.state.tx.us/death_row/dr_info/rodrigezlionell.jpg.
>>>
>>> In raw form, the images are mostly ~840x1100, 139 KB, grayscale, with a 
>>> fair amount of whitespace.  
>>>
>>> Tesseract has been able to capture the field names quite well, but has 
>>> had trouble with the values/sequences corresponding to each field/key.  For 
>>> example, on the jpg above, I get:
>>>
>>> *Co-Defendants'*
>>> *U-l {IAIN .I'i. ‘ III! [.03 'I‘ I - I95 w. I .-II vII A I I*
>>> *II I U i I I o. '4 I99 0' .1“, DA. 3 I I ‘ v 9 3.), I .‘aI vlh. I*
>>> *II M I. {?HJI 0 I: III; '403‘I0 v. IIJ' HI. I IO.“ I II I-!*
>>> *{.A.‘l. .' I Ilu 'J: -. I' 3. I IIvIII I .III II*
>>> *0 Inn . I II I*
>>>
>>> What I have tried thus far:
>>> - Increasing image size & dpi significantly.
>>> - Pixel thresholding (from opencv 
>>> <https://docs.opencv.org/3.3.1/d7/d4d/tutorial_py_thresholding.html>)
>>> - Median blurring (from opencv 
>>> <https://docs.opencv.org/3.1.0/d4/d86/group__imgproc__filter.html#ga564869aa33e58769b4469101aac458f9>)
>>>  
>>> - both through Python interface
>>> - Went through the Improve Quality 
>>> <https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality> page, 
>>> but it is clear i am flailing around helplessly.
>>>
>>> Appreciate any suggestions for next steps; based on the characteristics 
>>> of the jpgs, what transformations would be most or least useful?
>>>
>>> Thank you.
>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com <javascript:>.
>> To post to this group, send email to tesser...@googlegroups.com 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/44423aa4-ed2e-46a8-a31c-a90489bf9f6a%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/44423aa4-ed2e-46a8-a31c-a90489bf9f6a%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/16b82120-1d88-4df8-ba8e-1e4f38dd7221%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: Criminal record JPGs: Improving image quality

Reply via email to