Update: I provided a more detailed walkthrough of my process thus far here:

https://stackoverflow.com/questions/48327567/fixing-text-grainy-ness-with-opencv

On Thursday, January 18, 2018 at 7:49:22 AM UTC-5, brad.sol...@gmail.com 
wrote:
>
> Hello--I am attempting to pull full text from a few hundred JPGs that 
> contain information on death row executions hosted by the Texas Department 
> of Criminal Justice (TDCJ).
>
> Here's one example: 
> http://www.tdcj.state.tx.us/death_row/dr_info/ruizroland.jpg; another: 
> http://www.tdcj.state.tx.us/death_row/dr_info/rodrigezlionell.jpg.
>
> In raw form, the images are mostly ~840x1100, 139 KB, grayscale, with a 
> fair amount of whitespace.  
>
> Tesseract has been able to capture the field names quite well, but has had 
> trouble with the values/sequences corresponding to each field/key.  For 
> example, on the jpg above, I get:
>
> *Co-Defendants'*
> *U-l {IAIN .I'i. ‘ III! [.03 'I‘ I - I95 w. I .-II vII A I I*
> *II I U i I I o. '4 I99 0' .1“, DA. 3 I I ‘ v 9 3.), I .‘aI vlh. I*
> *II M I. {?HJI 0 I: III; '403‘I0 v. IIJ' HI. I IO.“ I II I-!*
> *{.A.‘l. .' I Ilu 'J: -. I' 3. I IIvIII I .III II*
> *0 Inn . I II I*
>
> What I have tried thus far:
> - Increasing image size & dpi significantly.
> - Pixel thresholding (from opencv 
> <https://docs.opencv.org/3.3.1/d7/d4d/tutorial_py_thresholding.html>)
> - Median blurring (from opencv 
> <https://docs.opencv.org/3.1.0/d4/d86/group__imgproc__filter.html#ga564869aa33e58769b4469101aac458f9>)
>  
> - both through Python interface
> - Went through the Improve Quality 
> <https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality> page, 
> but it is clear i am flailing around helplessly.
>
> Appreciate any suggestions for next steps; based on the characteristics of 
> the jpgs, what transformations would be most or least useful?
>
> Thank you.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/3879b9dd-0d78-4eea-a7cd-c9d1e0edc8f9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to