In my experience Tesseract gives poor results with lines within the text. 
You can test this by manually whiting out the lines in a paint editor and 
retrying Tesseract with the new image. If the results are improved then you 
will likely need to do this programatically. This is not straightforward 
though since the lines are touching the text, but you could remove at least 
some parts of them using opencv methods.

On Thursday, January 18, 2018 at 12:49:22 PM UTC, brad.sol...@gmail.com 
wrote:
>
> Hello--I am attempting to pull full text from a few hundred JPGs that 
> contain information on death row executions hosted by the Texas Department 
> of Criminal Justice (TDCJ).
>
> Here's one example: 
> http://www.tdcj.state.tx.us/death_row/dr_info/ruizroland.jpg; another: 
> http://www.tdcj.state.tx.us/death_row/dr_info/rodrigezlionell.jpg.
>
> In raw form, the images are mostly ~840x1100, 139 KB, grayscale, with a 
> fair amount of whitespace.  
>
> Tesseract has been able to capture the field names quite well, but has had 
> trouble with the values/sequences corresponding to each field/key.  For 
> example, on the jpg above, I get:
>
> *Co-Defendants'*
> *U-l {IAIN .I'i. ‘ III! [.03 'I‘ I - I95 w. I .-II vII A I I*
> *II I U i I I o. '4 I99 0' .1“, DA. 3 I I ‘ v 9 3.), I .‘aI vlh. I*
> *II M I. {?HJI 0 I: III; '403‘I0 v. IIJ' HI. I IO.“ I II I-!*
> *{.A.‘l. .' I Ilu 'J: -. I' 3. I IIvIII I .III II*
> *0 Inn . I II I*
>
> What I have tried thus far:
> - Increasing image size & dpi significantly.
> - Pixel thresholding (from opencv 
> <https://docs.opencv.org/3.3.1/d7/d4d/tutorial_py_thresholding.html>)
> - Median blurring (from opencv 
> <https://docs.opencv.org/3.1.0/d4/d86/group__imgproc__filter.html#ga564869aa33e58769b4469101aac458f9>)
>  
> - both through Python interface
> - Went through the Improve Quality 
> <https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality> page, 
> but it is clear i am flailing around helplessly.
>
> Appreciate any suggestions for next steps; based on the characteristics of 
> the jpgs, what transformations would be most or least useful?
>
> Thank you.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/0fd102a9-cc9d-44ad-8832-b91509fee96a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to