I allow myself to elaborate in this thread on general image processing questions in this forum. On the other hand I also include one example solution at the end to justify this email.
Personally, I do not think that these questions should be posted exactly into this forum because tesseract is already doing a great job in segmentation when you do not have additional information about the input document set. Can it be improved? Definitely, but the price performance ratio is too high and I would rather see the authors/committers focusing on other things than handling of very specific documents. That being said, to if you really want to have high(er) precision you simply have to do image processing. I have seen references to opencv quite a lot but no matter how great that library is, for document image processing my suggestion is to use Leptonica (https://github.com/DanBloomberg/leptonica/). Yes, the one tesseract is using internally. That library is very powerful, super fast even without cpu/gpu magic. I have to admit that I do not understand why it is not much more popular and more widely used if you are/have to be at least a bit serious with document image processing. The basic keywords you should understand before even trying any processing are: connected components, basic morphological operations (dilate, erode, open, close), structuring elements and seed fills. With their rather simple usage, many questions in this forum could be answered (at least in a hardcoded way). The reason for only a few helpful answers might be that it takes a considerable amount of time and I believe some people have their internal frameworks where it can be done super easily but cannot share it. Furthermore, the current (lstm based) traineddata are very good but you will find (even simple) examples where they are not performing well and you have to either do image processing or retrain (or use older version that relies on different properties). Have a look at these simple images: 1. https://github.com/mazoea/tesseract-samples/blob/master/bitchanges/t1.png 2. https://github.com/mazoea/tesseract-samples/blob/master/bitchanges/t2.png 3. they slightly differ in the value of one pixel - (red dot in https://github.com/mazoea/tesseract-samples/blob/master/bitchanges/diff.png) 4. download Latin best and execute do OCR for both images e.g., tesseract -l Latin --psm 8 --oem 1 ./t1.png stdout and you should get `MMEA` vs `MEA`. Well, this might not be the best example but I hope it illustrates the point. Answer to original question In order to keep this message "short", I will stop here and point you to a https://github.com/mazoea/tesseract-samples/blob/master/lines/main.cc and https://github.com/mazoea/tesseract-samples/blob/master/lines/test.sh The code users leptonica and it prepares the image by scaling and deskewing it, binarizing it and then it (very) roughly tries to find possible letter descenders of latin text on a line (here you could traverse the lines by columns and look for black pixels above/below), finds lines and computes the result. It looks far from perfect but the result is usable. Kind Regards, Jozef On Thursday, January 18, 2018 at 12:49:22 PM UTC, brad.sol...@gmail.com wrote: > > Hello--I am attempting to pull full text from a few hundred JPGs that > contain information on death row executions hosted by the Texas Department > of Criminal Justice (TDCJ). > > Here's one example: > http://www.tdcj.state.tx.us/death_row/dr_info/ruizroland.jpg; another: > http://www.tdcj.state.tx.us/death_row/dr_info/rodrigezlionell.jpg. > > In raw form, the images are mostly ~840x1100, 139 KB, grayscale, with a > fair amount of whitespace. > > Tesseract has been able to capture the field names quite well, but has had > trouble with the values/sequences corresponding to each field/key. For > example, on the jpg above, I get: > > *Co-Defendants'* > *U-l {IAIN .I'i. ‘ III! [.03 'I‘ I - I95 w. I .-II vII A I I* > *II I U i I I o. '4 I99 0' .1“, DA. 3 I I ‘ v 9 3.), I .‘aI vlh. I* > *II M I. {?HJI 0 I: III; '403‘I0 v. IIJ' HI. I IO.“ I II I-!* > *{.A.‘l. .' I Ilu 'J: -. I' 3. I IIvIII I .III II* > *0 Inn . I II I* > > What I have tried thus far: > - Increasing image size & dpi significantly. > - Pixel thresholding (from opencv > <https://docs.opencv.org/3.3.1/d7/d4d/tutorial_py_thresholding.html>) > - Median blurring (from opencv > <https://docs.opencv.org/3.1.0/d4/d86/group__imgproc__filter.html#ga564869aa33e58769b4469101aac458f9>) > > - both through Python interface > - Went through the Improve Quality > <https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality> page, > but it is clear i am flailing around helplessly. > > Appreciate any suggestions for next steps; based on the characteristics of > the jpgs, what transformations would be most or least useful? > > Thank you. > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/44423aa4-ed2e-46a8-a31c-a90489bf9f6a%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.