Jozef, Thank you for your detailed answer and sample.
Do you have a sample which can handle an image with tables using leptonica and tesseract? ShreeDevi ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Thu, Jan 25, 2018 at 3:24 PM, <j...@mazoea.com> wrote: > I allow myself to elaborate in this thread on general image processing > questions in this forum. On the other hand I also include one example > solution at the end to justify this email. > > Personally, I do not think that these questions should be posted exactly > into this forum because tesseract is already doing a great job in > segmentation when you do not have additional information about the input > document set. Can it be improved? Definitely, but the price performance > ratio is too high and I would rather see the authors/committers focusing on > other things than handling of very specific documents. > > That being said, to if you really want to have high(er) precision you > simply have to do image processing. > I have seen references to opencv quite a lot but no matter how great that > library is, for document image processing my suggestion is to use Leptonica > (https://github.com/DanBloomberg/leptonica/). Yes, the one tesseract is > using internally. That library is very powerful, super fast even without > cpu/gpu magic. I have to admit that I do not understand why it is not much > more popular and more widely used if you are/have to be at least a bit > serious with document image processing. > > The basic keywords you should understand before even trying any processing > are: connected components, basic morphological operations (dilate, erode, > open, close), structuring elements and seed fills. With their rather simple > usage, many questions in this forum could be answered (at least in a > hardcoded way). The reason for only a few helpful answers might be that it > takes a considerable amount of time and I believe some people have their > internal frameworks where it can be done super easily but cannot share it. > > Furthermore, the current (lstm based) traineddata are very good but you > will find (even simple) examples where they are not performing well and you > have to either do image processing or retrain (or use older version that > relies on different properties). Have a look at these simple images: > 1. https://github.com/mazoea/tesseract-samples/blob/master/ > bitchanges/t1.png > 2. https://github.com/mazoea/tesseract-samples/blob/master/ > bitchanges/t2.png > 3. they slightly differ in the value of one pixel - (red dot in > https://github.com/mazoea/tesseract-samples/blob/master/ > bitchanges/diff.png) > 4. download Latin best and execute do OCR for both images e.g., > tesseract -l Latin --psm 8 --oem 1 ./t1.png stdout > and you should get `MMEA` vs `MEA`. > Well, this might not be the best example but I hope it illustrates the > point. > > > Answer to original question > > In order to keep this message "short", I will stop here and point you to a > https://github.com/mazoea/tesseract-samples/blob/master/lines/main.cc > and > https://github.com/mazoea/tesseract-samples/blob/master/lines/test.sh > > The code users leptonica and it prepares the image by scaling and > deskewing it, binarizing it and then it (very) roughly tries to find > possible letter descenders of latin text on a line (here you could traverse > the lines by columns and look for black pixels above/below), finds lines > and computes the result. It looks far from perfect but the result is usable. > > > Kind Regards, > Jozef > > > > > > On Thursday, January 18, 2018 at 12:49:22 PM UTC, brad.sol...@gmail.com > wrote: >> >> Hello--I am attempting to pull full text from a few hundred JPGs that >> contain information on death row executions hosted by the Texas Department >> of Criminal Justice (TDCJ). >> >> Here's one example: http://www.tdcj.state.tx.us/death_row/dr_info/ruizr >> oland.jpg; another: http://www.tdcj.state.tx.us/death_row/dr_info/rodri >> gezlionell.jpg. >> >> In raw form, the images are mostly ~840x1100, 139 KB, grayscale, with a >> fair amount of whitespace. >> >> Tesseract has been able to capture the field names quite well, but has >> had trouble with the values/sequences corresponding to each field/key. For >> example, on the jpg above, I get: >> >> *Co-Defendants'* >> *U-l {IAIN .I'i. ‘ III! [.03 'I‘ I - I95 w. I .-II vII A I I* >> *II I U i I I o. '4 I99 0' .1“, DA. 3 I I ‘ v 9 3.), I .‘aI vlh. I* >> *II M I. {?HJI 0 I: III; '403‘I0 v. IIJ' HI. I IO.“ I II I-!* >> *{.A.‘l. .' I Ilu 'J: -. I' 3. I IIvIII I .III II* >> *0 Inn . I II I* >> >> What I have tried thus far: >> - Increasing image size & dpi significantly. >> - Pixel thresholding (from opencv >> <https://docs.opencv.org/3.3.1/d7/d4d/tutorial_py_thresholding.html>) >> - Median blurring (from opencv >> <https://docs.opencv.org/3.1.0/d4/d86/group__imgproc__filter.html#ga564869aa33e58769b4469101aac458f9>) >> - both through Python interface >> - Went through the Improve Quality >> <https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality> page, >> but it is clear i am flailing around helplessly. >> >> Appreciate any suggestions for next steps; based on the characteristics >> of the jpgs, what transformations would be most or least useful? >> >> Thank you. >> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To post to this group, send email to tesseract-ocr@googlegroups.com. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit https://groups.google.com/d/ > msgid/tesseract-ocr/44423aa4-ed2e-46a8-a31c-a90489bf9f6a% > 40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/44423aa4-ed2e-46a8-a31c-a90489bf9f6a%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXAy3L%3Dj%2BJMnEZpPgy-KzJ-UbBk-zSM8W%3DpUR1APNxu4Q%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.