Tesseract expects book-like text pages, i.e. black (dark) text on white (light, plain) background. While tesseract includes a few preprocessing features for removing lines (drawings), generally it is better to have everything which is not text removed beforehand. You mention cropping: that is one way to approach this but tesseract works well when the "page" has a bit of a white border (background color); may i suggest you look into *masking* the non text area of the image instead? The benefit of the masking approach is that text (word character) coordinates reported by tesseract (hoor, tsv formats) will then match the original image as masking does not change the image dimensions. Hence the key is to mask any in page image content and have it replaced by the background color (white). One tool which can do this for batches is imagemagick. See https://legacy.imagemagick.org/Usage/masking/ for various ways to create and deal with masks there.
Ciao, Ger On Tue, 20 Aug 2024, 08:31 Fred Eisele, <fredrick.eis...@gmail.com> wrote: > Can tesseract crop the background before processing the text? > Imagine you have jpeg images taken with a camera and there is > a certain amount of photo around the document. > When I run tesseract on the original photo it does not extract just the > text. > If I crop the jpeg into a png file then tesseract does a great job. > > Does tesseract need the images to be cropped? > What is a good way to crop images automatically? > > p.s. a have a few thousand photos to process. > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/d955cb58-484f-484f-bfd2-079642fa853dn%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/d955cb58-484f-484f-bfd2-079642fa853dn%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fpzDZj82y33zoV00EW-Ra7qb8qeZqRCyR3dL9xs4478-A%40mail.gmail.com.