Attached pdf OCRed by ocrmypdf using tesseract 4.00.00alpha Linux 4.13.0-32-generic #35~16.04.1-Ubuntu SMP x86_64 x86_64 x86_64 GNU/Linux
In some pdf viewers (Evince, Chrome, Opera) all ok but in other (Firefox, Alfresco Share, pdfjs) not so good - lost spaces between the words. So text "Test PDF from LibreOffice" looks like one big word "TestPDFfromLibreOffice" after copy/paste. You can load pdf to pdfjs demo here: https://mozilla.github.io/pdf.js/web/viewer.html If use some other commercial OCR engines for source pdf - got OCRed pdf with normal spaces in all pdf viewers (in pdfjs too all ok). So this is two side problem: tesseract devs says - its pdfjs problem, pdfjs devs says - its tesseract problem. Is it possible to solve this "spaces" problem via some keys for tesseract (ocrmypdf) to force space recognition (like in other OCRs)? Or make understanding problem root for some more info for pdfjs devs. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ec213df1-f390-4a42-8943-7c18775141d7%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Testpdfsandwich.pdf
Description: Adobe PDF document