Tesseract produces searchable PDF directly. If you really want to use HOCR
as an
intermediate format, you can but you will need external software. There are
a couple
of "hocr2pdf" programs floating around and "OCRMyPDF" does an admirable
job
tying things together. That said, going direct should g
I use it as follows and it works. Please check that you are using correct
paths for the files.
combine_lang_model \
--input_unicharset ./layersan/san.unicharset \
--script_dir ~/langdata \
--words ~/langdata/san/san.wordlist \
--numbers ~/langdata/san/san.numbers \
--puncs ~/langdata/san/san.punc
I think pdf creation adds a text layer only and there isn't an option to
add HOCR to it.
@jbreiden can confirm.
On Mon, Sep 17, 2018 at 6:10 PM, Monica wrote:
> I have tried this, but this is showing the default behaviour. I think the
> default output is overlaying on pdf instead of hocr out.
>
I have tried this, but this is showing the default behaviour. I think the
default output is overlaying on pdf instead of hocr out.
On Mon, Sep 17, 2018 at 5:47 PM Monica wrote:
> Thanks Zdenko for you response.
> will "tesseract scannedFile.png scanned.pdf -l eng hocr pdf" overlay on
> pdf file
Thanks Zdenko for you response.
will "tesseract scannedFile.png scanned.pdf -l eng hocr pdf" overlay on pdf
file ?
On Mon, Sep 17, 2018 at 5:44 PM Zdenko Podobny wrote:
> Something like this?
>
> tesseract scannedFile.png scanned.pdf -l eng hocr pdf
>
> Zdenko
>
>
> po 17. 9. 2018 o 14:12 monica
Something like this?
tesseract scannedFile.png scanned.pdf -l eng hocr pdf
Zdenko
po 17. 9. 2018 o 14:12 monica kumari napĂsal(a):
> for OCRing a scanned pdf,
> first it is converted to image format then OCRed and gives a temperory
> file of pdf/text format and overlays on original scanned pd
for OCRing a scanned pdf,
first it is converted to image format then OCRed and gives a temperory file
of pdf/text format and overlays on original scanned pdf.
I want the output format to be hocr. for this, I ran the command
"convert scannedFile.pdf scannedFile.png" and then "tesseract
scannedFi
i used combine_lang_model like this:
combine_lang_model--input_unicharset
../combinelangmodel/fas.lstm-unicharset \
--script_dir../combinelangmodel/sdir \
--outputdiroutputdir \
--langfas \
--lang_is_rtltrue \
--words..\lists\fas.wordlist \
--puncs..\lists\fa
8 matches
Mail list logo