Hello--I am attempting to pull full text from a few hundred JPGs that contain information on death row executions hosted by the Texas Department of Criminal Justice (TDCJ).
Here's one example: http://www.tdcj.state.tx.us/death_row/dr_info/ruizroland.jpg; another: http://www.tdcj.state.tx.us/death_row/dr_info/rodrigezlionell.jpg. In raw form, the images are mostly ~840x1100, 139 KB, grayscale, with a fair amount of whitespace. Tesseract has been able to capture the field names quite well, but has had trouble with the values/sequences corresponding to each field/key. For example, on the jpg above, I get: *Co-Defendants'* *U-l {IAIN .I'i. ‘ III! [.03 'I‘ I - I95 w. I .-II vII A I I* *II I U i I I o. '4 I99 0' .1“, DA. 3 I I ‘ v 9 3.), I .‘aI vlh. I* *II M I. {?HJI 0 I: III; '403‘I0 v. IIJ' HI. I IO.“ I II I-!* *{.A.‘l. .' I Ilu 'J: -. I' 3. I IIvIII I .III II* *0 Inn . I II I* What I have tried thus far: - Increasing image size & dpi significantly. - Pixel thresholding (from opencv <https://docs.opencv.org/3.3.1/d7/d4d/tutorial_py_thresholding.html>) - Median blurring (from opencv <https://docs.opencv.org/3.1.0/d4/d86/group__imgproc__filter.html#ga564869aa33e58769b4469101aac458f9>) - both through Python interface - Went through the Improve Quality <https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality> page, but it is clear i am flailing around helplessly. Appreciate any suggestions for next steps; based on the characteristics of the jpgs, what transformations would be most or least useful? Thank you. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/12974fef-df83-449a-b92b-8ae1aa7ef4e2%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.