Ryan,
I had copied text with the extended range from wikipedia etc to create a
quick training set. It is recommended to train with 'actual' text - I think
Tesseract relies on language model data.
Please see the tutorial on tesseract from
https://drive.google.com/folderview?id=0B7l10Bj_LprhQnpSRkp
I'm dealing with font subsets, and I generate an image per font, so there
is no reading order. Though I've seen latin and cjk in the same font
subset. If OSD just gives, reading, orientation, and text order, it is not
going to give me anything useful. Plus I have the font, so I could get some
o
0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR.
3 = Fully automatic page segmentation, but no OSD. (Default)
See whether using OSD to detect the script helps you choose the correct
traineddata.
On
Thanks again.
you may get better results using appropriate language data rather than just
> the ascii range. Are the client documents sorted by language?
>
I'm not sure how they have them organised, I just know they want an
"automatic" solution...
>
> I am attaching files used - i had just c
>
> asc traineddata does not have a wordlist or dictionary, so using eng will
> help with that.
You mean unpack the wordlist from eng and pack it into the asc one? Or run
tesseract with "eng+asc"? Currently I run each language in complete
isolation from each other, and figure out the results
asc traineddata does not have a wordlist or dictionary, so using eng will
help with that. Also, I just trained using a few fonts that support the
whole range. If you train with the font you are using, you will get better
results.
You can use 'combine_tessdata' command with the -u (unpack) option t
Wow! Awesome.
That file definitely helps. It fixed a few issues, but introduced a few of
its own, so currently I am running "eng+asc" and that is giving great
output, and is running faster then "eng+deu".
Attached is an example image and output using asc. Note that asc is getting
the 'ΓΌ' as a
You can look at the unicharset of the traineddata to see the coverage.
try with eng+deu+iast
iast is a traineddata that I generated for sanskrit transliteration in
roman/latin script.
https://code.google.com/r/shreeshrii-langdata/source/browse/iast.unicharset?name=iast
https://code.google.com/r/
The project I am working on I need to do OCR on documents with characters
that are covered by the ISO 8859-1 Extended ASCII range (0x20-0xFF)
http://www.ascii-code.com/
I was wondering, does anyone have traineddata files for this?
Or do they know which existing language traineddata files would
9 matches
Mail list logo