Hello--I am attempting to pull full text from a few hundred JPGs that 
contain information on death row executions hosted by the Texas Department 
of Criminal Justice (TDCJ).

Here's one 
example: http://www.tdcj.state.tx.us/death_row/dr_info/ruizroland.jpg; 
another: http://www.tdcj.state.tx.us/death_row/dr_info/rodrigezlionell.jpg.

In raw form, the images are mostly ~840x1100, 139 KB, grayscale, with a 
fair amount of whitespace.  

Tesseract has been able to capture the field names quite well, but has had 
trouble with the values/sequences corresponding to each field/key.  For 
example, on the jpg above, I get:

*Co-Defendants'*
*U-l {IAIN .I'i. ‘ III! [.03 'I‘ I - I95 w. I .-II vII A I I*
*II I U i I I o. '4 I99 0' .1“, DA. 3 I I ‘ v 9 3.), I .‘aI vlh. I*
*II M I. {?HJI 0 I: III; '403‘I0 v. IIJ' HI. I IO.“ I II I-!*
*{.A.‘l. .' I Ilu 'J: -. I' 3. I IIvIII I .III II*
*0 Inn . I II I*

What I have tried thus far:
- Increasing image size & dpi significantly.
- Pixel thresholding (from opencv 
<https://docs.opencv.org/3.3.1/d7/d4d/tutorial_py_thresholding.html>)
- Median blurring (from opencv 
<https://docs.opencv.org/3.1.0/d4/d86/group__imgproc__filter.html#ga564869aa33e58769b4469101aac458f9>)
 
- both through Python interface
- Went through the Improve Quality 
<https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality> page, but 
it is clear i am flailing around helplessly.

Appreciate any suggestions for next steps; based on the characteristics of 
the jpgs, what transformations would be most or least useful?

Thank you.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/12974fef-df83-449a-b92b-8ae1aa7ef4e2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to