Many thanks to George Chriss! (see above)

My workaround based on his description:
Modify the created hocr by XSLT (see below). Then using hocr2pdf 0.8.9 - and 
the textboxes are placed (almost) correctly.

$ tesseract image.tif ocr_file hocr 
$ xsltproc -html -nonet -novalid -o ocr_fixed.hocr fix-hocr.xsl ocr_file.hocr
$ hocr2pdf -i image.tif -o searchable.pdf <ocr_fixed.hocr

See attached file fix-hocr.xsl.

** Attachment added: "use on hocr file to fix for hocr2pdf 0.8.9 textbox 
placement"
   
https://bugs.launchpad.net/cuneiform-linux/+bug/623438/+attachment/4432658/+files/fix-hocr.xsl

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/623438

Title:
  Font size not correct in merged sandvich PDF

To manage notifications about this bug go to:
https://bugs.launchpad.net/cuneiform-linux/+bug/623438/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to