Hi, all. Iam OCRing 10k invoices for AI training and, as it turns out, using Tesseract's -psm 4 exported as txt is ideal for this as it provides each individual line item as one uninterrupted line of text across the page, including all columns.
Example: Product Description Quantity Unit Price Total1001 Boots 2 $ 100.00 $ 200.00 *The only drawback is that -psm 4 does not use OSD (Orientation and Script Detection)* and will only accept invoices that are already correctly oriented. To solve this i will first have to run -psm 0 to get individual .osd-files with orientation of each file/page and then run convert -rotate 90 on the .TIF-files where the invoice orientation is not already correct. *My question is*: Can I somehow create my own -psm 4, combining the full width text extraction with the Orientation (and Script Detection) from -psm 1? Or is there any other way to somehow invoke OSD or ensure full page width text as with -psm 4? Thanks. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/bf992404-2e8a-43a5-863a-8c0de71012a9%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.