Are you sure you attached the correct image? That looks more like a Rodin than a Rousseau.
The red circle and printing might be amenable to a color selection technique after which you could desaturate, lighten, replace with background color, etc. Of course if it's overlapping the text, that'll complicate things. The foxing is also going to cause problems due to its uneven nature, but on the plus side, you've got pretty good dark blacks to work with in the print. The Rodin label looks like it's lifted on one side, warping the image. If that's common, you might want to consider a dewarping algorithm. Ditto for deskewing crooked labels. Good luck! Looks like a fun project. Tom On Tuesday, January 30, 2018 at 1:44:45 AM UTC-5, Lauren Arnett wrote: > > I'm working on a slide digitization project for a collection of 35mm > slides, all similar to the one attached. I'd like to improve my Tesseract > output for these slides. My preprocessing techniques follow as so: > > 1) crop into top third and bottom third to remove center image > 2) Apply Gaussian blur > 3) Apply Otsu Thresholding with OpenCV > > I then run Tesseract on each chunk of the image with load_system_dawg and > load_freq_dawg set to false to ignore the main Tesseract dictionary. > > I've had mixed success with the slides. I especially run into trouble as > each slide is marked with a red circle that can overlay text and ruin the > thresholding. > > The results I get on the attached image is: > > RDUSSEAU . H P32. R7 52 > WAR A > > DETAIL: center w/ girl . > [IBQHI ' > > Pgris: Mueee g'Orsay > i; "v ‘ ..:M“W ‘1Pvt. Collection, Paris. > Varnedoe Photo. > > > What can I do to improve my preprocessing for Tesseract, or are there > other specific parameters with Tesseract itself I can manipulate to improve > output? How can I deal with separating text from the red circle overlays? > > Thank you very much for any suggestions! > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/63f44a15-0093-4e3e-ace2-e9c6d89dc5fe%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

