Hi, all. Iam OCRing 10k invoices for AI training and, as it turns out, 
using Tesseract's -psm 4 exported as txt is ideal for this as it provides 
each individual line item as one uninterrupted line of text across the 
page, including all columns.

Example:

Product     Description        Quantity       Unit Price     Total1001        
Boots              2              $ 100.00       $ 200.00

*The only drawback is that -psm 4 does not use OSD (Orientation and Script 
Detection)* and will only accept invoices that are already correctly 
oriented. To solve this i will first have to run -psm 0 to get individual 
.osd-files with orientation of each file/page and then run convert -rotate 
90 on the .TIF-files where the invoice orientation is not already correct.

*My question is*: Can I somehow create my own -psm 4, combining the full 
width text extraction with the Orientation (and Script Detection) from -psm 
1?

Or is there any other way to somehow invoke OSD or ensure full page width 
text as with -psm 4?

Thanks.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/bf992404-2e8a-43a5-863a-8c0de71012a9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to