Update: I provided a more detailed walkthrough of my process thus far here:
https://stackoverflow.com/questions/48327567/fixing-text-grainy-ness-with-opencv On Thursday, January 18, 2018 at 7:49:22 AM UTC-5, brad.sol...@gmail.com wrote: > > Hello--I am attempting to pull full text from a few hundred JPGs that > contain information on death row executions hosted by the Texas Department > of Criminal Justice (TDCJ). > > Here's one example: > http://www.tdcj.state.tx.us/death_row/dr_info/ruizroland.jpg; another: > http://www.tdcj.state.tx.us/death_row/dr_info/rodrigezlionell.jpg. > > In raw form, the images are mostly ~840x1100, 139 KB, grayscale, with a > fair amount of whitespace. > > Tesseract has been able to capture the field names quite well, but has had > trouble with the values/sequences corresponding to each field/key. For > example, on the jpg above, I get: > > *Co-Defendants'* > *U-l {IAIN .I'i. ‘ III! [.03 'I‘ I - I95 w. I .-II vII A I I* > *II I U i I I o. '4 I99 0' .1“, DA. 3 I I ‘ v 9 3.), I .‘aI vlh. I* > *II M I. {?HJI 0 I: III; '403‘I0 v. IIJ' HI. I IO.“ I II I-!* > *{.A.‘l. .' I Ilu 'J: -. I' 3. I IIvIII I .III II* > *0 Inn . I II I* > > What I have tried thus far: > - Increasing image size & dpi significantly. > - Pixel thresholding (from opencv > <https://docs.opencv.org/3.3.1/d7/d4d/tutorial_py_thresholding.html>) > - Median blurring (from opencv > <https://docs.opencv.org/3.1.0/d4/d86/group__imgproc__filter.html#ga564869aa33e58769b4469101aac458f9>) > > - both through Python interface > - Went through the Improve Quality > <https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality> page, > but it is clear i am flailing around helplessly. > > Appreciate any suggestions for next steps; based on the characteristics of > the jpgs, what transformations would be most or least useful? > > Thank you. > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3879b9dd-0d78-4eea-a7cd-c9d1e0edc8f9%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.