Re: [tesseract-ocr] Covering ASCII Extended range.

2014-11-21 Thread ShreeDevi Kumar
Ryan, I had copied text with the extended range from wikipedia etc to create a quick training set. It is recommended to train with 'actual' text - I think Tesseract relies on language model data. Please see the tutorial on tesseract from https://drive.google.com/folderview?id=0B7l10Bj_LprhQnpSRkp

Re: [tesseract-ocr] Covering ASCII Extended range.

2014-11-19 Thread Ryan Dev
I'm dealing with font subsets, and I generate an image per font, so there is no reading order. Though I've seen latin and cjk in the same font subset. If OSD just gives, reading, orientation, and text order, it is not going to give me anything useful. Plus I have the font, so I could get some o

Re: [tesseract-ocr] Covering ASCII Extended range.

2014-11-18 Thread shree
0 = Orientation and script detection (OSD) only. 1 = Automatic page segmentation with OSD. 2 = Automatic page segmentation, but no OSD, or OCR. 3 = Fully automatic page segmentation, but no OSD. (Default) See whether using OSD to detect the script helps you choose the correct traineddata. On

Re: [tesseract-ocr] Covering ASCII Extended range.

2014-11-18 Thread Ryan Dev
Thanks again. you may get better results using appropriate language data rather than just > the ascii range. Are the client documents sorted by language? > I'm not sure how they have them organised, I just know they want an "automatic" solution... > > I am attaching files used - i had just c

Re: [tesseract-ocr] Covering ASCII Extended range.

2014-11-14 Thread Ryan Dev
> > asc traineddata does not have a wordlist or dictionary, so using eng will > help with that. You mean unpack the wordlist from eng and pack it into the asc one? Or run tesseract with "eng+asc"? Currently I run each language in complete isolation from each other, and figure out the results

Re: [tesseract-ocr] Covering ASCII Extended range.

2014-11-13 Thread ShreeDevi Kumar
asc traineddata does not have a wordlist or dictionary, so using eng will help with that. Also, I just trained using a few fonts that support the whole range. If you train with the font you are using, you will get better results. You can use 'combine_tessdata' command with the -u (unpack) option t

Re: [tesseract-ocr] Covering ASCII Extended range.

2014-11-13 Thread Ryan Dev
Wow! Awesome. That file definitely helps. It fixed a few issues, but introduced a few of its own, so currently I am running "eng+asc" and that is giving great output, and is running faster then "eng+deu". Attached is an example image and output using asc. Note that asc is getting the 'ΓΌ' as a

Re: [tesseract-ocr] Covering ASCII Extended range.

2014-11-12 Thread ShreeDevi Kumar
You can look at the unicharset of the traineddata to see the coverage. try with eng+deu+iast iast is a traineddata that I generated for sanskrit transliteration in roman/latin script. https://code.google.com/r/shreeshrii-langdata/source/browse/iast.unicharset?name=iast https://code.google.com/r/

[tesseract-ocr] Covering ASCII Extended range.

2014-11-12 Thread Ryan Dev
The project I am working on I need to do OCR on documents with characters that are covered by the ISO 8859-1 Extended ASCII range (0x20-0xFF) http://www.ascii-code.com/ I was wondering, does anyone have traineddata files for this? Or do they know which existing language traineddata files would