On Friday, January 26, 2018 at 2:23:39 PM UTC+1, shree wrote: > > Jozef, > > Thank you for your detailed answer and sample. > > Do you have a sample which can handle an image with tables using leptonica > and tesseract? >
Dear Shree, your request is simply too generic. First of all, if you identify a table, what next? Imagine invoice tables with multi line lines and completely different columns etc., removing horizontal/vertical lines does not help much (it can even make quality worse). You would need to find the contents of a cell which is again non trivial with touching or even overlapping letters (like in a form). Furthermore, not all tables have horizontal/vertical lines and many other specifics. However, to be at least somehow helpful I suggest to start by looking at (and modifying) https://github.com/DanBloomberg/leptonica/blob/45f5dbb78e5ac742312b85b21a79dedc726bb23b/src/pageseg.c#L1585 Best, Jozef > > ShreeDevi > ____________________________________________________________ > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com > > On Thu, Jan 25, 2018 at 3:24 PM, <j...@mazoea.com <javascript:>> wrote: > >> I allow myself to elaborate in this thread on general image processing >> questions in this forum. On the other hand I also include one example >> solution at the end to justify this email. >> >> Personally, I do not think that these questions should be posted exactly >> into this forum because tesseract is already doing a great job in >> segmentation when you do not have additional information about the input >> document set. Can it be improved? Definitely, but the price performance >> ratio is too high and I would rather see the authors/committers focusing on >> other things than handling of very specific documents. >> >> That being said, to if you really want to have high(er) precision you >> simply have to do image processing. >> I have seen references to opencv quite a lot but no matter how great that >> library is, for document image processing my suggestion is to use Leptonica >> (https://github.com/DanBloomberg/leptonica/). Yes, the one tesseract is >> using internally. That library is very powerful, super fast even without >> cpu/gpu magic. I have to admit that I do not understand why it is not much >> more popular and more widely used if you are/have to be at least a bit >> serious with document image processing. >> >> The basic keywords you should understand before even trying any >> processing are: connected components, basic morphological operations >> (dilate, erode, open, close), structuring elements and seed fills. With >> their rather simple usage, many questions in this forum could be answered >> (at least in a hardcoded way). The reason for only a few helpful answers >> might be that it takes a considerable amount of time and I believe some >> people have their internal frameworks where it can be done super easily but >> cannot share it. >> >> Furthermore, the current (lstm based) traineddata are very good but you >> will find (even simple) examples where they are not performing well and you >> have to either do image processing or retrain (or use older version that >> relies on different properties). Have a look at these simple images: >> 1. >> https://github.com/mazoea/tesseract-samples/blob/master/bitchanges/t1.png >> 2. >> https://github.com/mazoea/tesseract-samples/blob/master/bitchanges/t2.png >> 3. they slightly differ in the value of one pixel - (red dot in >> https://github.com/mazoea/tesseract-samples/blob/master/bitchanges/diff.png >> ) >> 4. download Latin best and execute do OCR for both images e.g., >> tesseract -l Latin --psm 8 --oem 1 ./t1.png stdout >> and you should get `MMEA` vs `MEA`. >> Well, this might not be the best example but I hope it illustrates the >> point. >> >> >> Answer to original question >> >> In order to keep this message "short", I will stop here and point you to >> a >> https://github.com/mazoea/tesseract-samples/blob/master/lines/main.cc >> and >> https://github.com/mazoea/tesseract-samples/blob/master/lines/test.sh >> >> The code users leptonica and it prepares the image by scaling and >> deskewing it, binarizing it and then it (very) roughly tries to find >> possible letter descenders of latin text on a line (here you could traverse >> the lines by columns and look for black pixels above/below), finds lines >> and computes the result. It looks far from perfect but the result is usable. >> >> >> Kind Regards, >> Jozef >> >> >> >> >> >> On Thursday, January 18, 2018 at 12:49:22 PM UTC, brad.sol...@gmail.com >> wrote: >>> >>> Hello--I am attempting to pull full text from a few hundred JPGs that >>> contain information on death row executions hosted by the Texas Department >>> of Criminal Justice (TDCJ). >>> >>> Here's one example: >>> http://www.tdcj.state.tx.us/death_row/dr_info/ruizroland.jpg; another: >>> http://www.tdcj.state.tx.us/death_row/dr_info/rodrigezlionell.jpg. >>> >>> In raw form, the images are mostly ~840x1100, 139 KB, grayscale, with a >>> fair amount of whitespace. >>> >>> Tesseract has been able to capture the field names quite well, but has >>> had trouble with the values/sequences corresponding to each field/key. For >>> example, on the jpg above, I get: >>> >>> *Co-Defendants'* >>> *U-l {IAIN .I'i. ‘ III! [.03 'I‘ I - I95 w. I .-II vII A I I* >>> *II I U i I I o. '4 I99 0' .1“, DA. 3 I I ‘ v 9 3.), I .‘aI vlh. I* >>> *II M I. {?HJI 0 I: III; '403‘I0 v. IIJ' HI. I IO.“ I II I-!* >>> *{.A.‘l. .' I Ilu 'J: -. I' 3. I IIvIII I .III II* >>> *0 Inn . I II I* >>> >>> What I have tried thus far: >>> - Increasing image size & dpi significantly. >>> - Pixel thresholding (from opencv >>> <https://docs.opencv.org/3.3.1/d7/d4d/tutorial_py_thresholding.html>) >>> - Median blurring (from opencv >>> <https://docs.opencv.org/3.1.0/d4/d86/group__imgproc__filter.html#ga564869aa33e58769b4469101aac458f9>) >>> >>> - both through Python interface >>> - Went through the Improve Quality >>> <https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality> page, >>> but it is clear i am flailing around helplessly. >>> >>> Appreciate any suggestions for next steps; based on the characteristics >>> of the jpgs, what transformations would be most or least useful? >>> >>> Thank you. >>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-oc...@googlegroups.com <javascript:>. >> To post to this group, send email to tesser...@googlegroups.com >> <javascript:>. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/44423aa4-ed2e-46a8-a31c-a90489bf9f6a%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/44423aa4-ed2e-46a8-a31c-a90489bf9f6a%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/16b82120-1d88-4df8-ba8e-1e4f38dd7221%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.