Re: [tesseract-ocr] Tesseract source directory: ./configure

2019-06-13 Thread Sivan Langer
That's the question. :) I'm talking about 4. But i discovered that for compiling successfully had to browse through atleast other 10 internet pages. Documentation is lacking. need to add a person to the project which order all the documentation. now I got it run on mac Yosemite too, but i ca

Re: [tesseract-ocr] Tesseract source directory: ./configure

2019-06-13 Thread Polixia Incubator
Hi, Can you provide the documentation links you followed. I'm doing the training on Mac(macOS High Sierra) too. But getting "sh: unicharset_extractor: command not found" when running tesseract-trainer.py. Thanks in advance BBDS On Thu, Jun 13, 2019 at 1:56 PM Sivan Langer wrote: > > That's the

[tesseract-ocr] Can you help me get text from the similar following images, with tesseract?

2019-06-13 Thread Nabi K.A.Z.
Can you help me get text from the similar following images, with tesseract? [image: exirbroker1.png] [image: exirbroker2.png] [image: exirbroker3.png] -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop

[tesseract-ocr] Box file for testing data

2019-06-13 Thread Jennil Thiyam
Lets say I have a file "test.tiff" which i want to OCRed, can we get the box file for this data. I know we get box file when creating training data, but what I want is to see how the model is performing segmentation algorithm over my testing data. I want to know this because i have some character w

[tesseract-ocr] could not find fonts

2019-06-13 Thread Jingjing Lin
When I was trying to fine tune a few character for chi_sim, by typing in: src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang chi_sim --linedata_only \ --noextract_font_properties --langdata_dir ../langdata \ --tessdata_dir ./tessdata --output_dir ~/tesstutorial/train I'm getting

[tesseract-ocr] Re: could not find fonts

2019-06-13 Thread Jingjing Lin
turns out this is actually not a tesseract problem, instead it's an operating system problem. we need to install the necessary fonts to our operating system (ubuntu) via: sudo apt-get install *** a useful link is: https://github.com/tesseract-ocr/tesseract/blob/master/src/training/language-s

[tesseract-ocr] Re: Box file for testing data

2019-06-13 Thread Jingjing Lin
I think this link might be helpful although I didn't succeed for some reason: https://github.com/tesseract-ocr/tesseract/wiki/ViewerDebugging 在 2019年6月13日星期四 UTC-4上午8:57:43,Jennil Thiyam写道: > > Lets say I have a file "test.tiff" which i want to OCRed, can we get the > box file for this data. I k

Re: [tesseract-ocr] Re: Box file for testing data

2019-06-13 Thread Jennil Thiyam
Thanks, I will check it out. On Thu, Jun 13, 2019 at 9:46 PM Jingjing Lin wrote: > I think this link might be helpful although I didn't succeed for some > reason: > https://github.com/tesseract-ocr/tesseract/wiki/ViewerDebugging > > 在 2019年6月13日星期四 UTC-4上午8:57:43,Jennil Thiyam写道: >> >> Lets say

Re: [tesseract-ocr] Re: could not find fonts

2019-06-13 Thread Shree Devi Kumar
FYI Font list used for LSTM training is at https://github.com/tesseract-ocr/langdata_lstm/blob/master/chi_sim/okfonts.txt ttps://github.com/tesseract-ocr/tesseract/blob/master/src/training/language-specific.sh

Re: [tesseract-ocr] Trained data for E13B font

2019-06-13 Thread Shree Devi Kumar
see http://www.devscope.net/Content/ocrchecks.aspx https://github.com/BigPino67/Tesseract-MICR-OCR https://groups.google.com/d/msg/tesseract-ocr/obWI4cz8rXg/6l82hEySgOgJ On Mon, Jun 10, 2019 at 11:21 AM ElGato ElMago wrote: > That'll be nice if there's traineddata out there but I didn't find any

[tesseract-ocr] unicharset of fine tuning a few characters

2019-06-13 Thread Jingjing Lin
For fine tuning a few characters (adding new characters that was not in the original unicharset), will we make a change or do we need to manually change the original unicharset? I don't see any change to the unicharset following the instructions below. Am I missing something? https://github.com

Re: [tesseract-ocr] Re: could not find fonts

2019-06-13 Thread Jingjing Lin
Thanks for the info. 在 2019年6月13日星期四 UTC-4下午12:39:57,shree写道: > > FYI > > Font list used for LSTM training is at > https://github.com/tesseract-ocr/langdata_lstm/blob/master/chi_sim/okfonts.txt > > > ttps://github.com/tesseract-ocr/tesseract/blob/master/src/training/language-specific.sh > >

[tesseract-ocr] Install tesseract 3.04.01 in mac

2019-06-13 Thread Sarasi Lalithsena
Hi, I am trying to install tesseract 3.04.01 in Mac. homebrew points to the latest version which is tesseract 4.0. I listed down the versions of tesseract in homebrew. However, I do not see other versions are listed there. In this case, what is the easiest way to install the pre-built binary pa

Re: [tesseract-ocr] Changes in Tesseract 4.0 to 4.1 causing loss in precision

2019-06-13 Thread Beck Olson
Interesting. I'm getting the same results when using tesseract on the command line. When I run through the Tess4J API I get much worse results. Porting all the configuration parameters from the command line into the Tess4J API fixes this issue. Seems like the default configuration in Tess4J was

[tesseract-ocr] fine tuning a few characters generating training images error

2019-06-13 Thread Jingjing Lin
when I tried to create new training data using the command below for fine tuning a few characters: src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang chi_sim --linedata_only \ --noextract_font_properties --langdata_dir ../langdata \ --tessdata_dir ./tessdata --output_dir ~/tesstut

Re: [tesseract-ocr] Re: could not find fonts

2019-06-13 Thread Jingjing Lin
Why hasn't the list below been updated though? For the chi_sim_fonts I only see the fonts used for base tesseract. Do we just need to manually add the fonts to language-specific.sh? https://github.com/tesseract-ocr/tesseract/blob/master/src/training/language-specific.sh 在 2019年6月13日星期四 UTC-4下午

[tesseract-ocr] Re: fine tuning a few characters generating training images error

2019-06-13 Thread Jingjing Lin
before src/training/tesstrain_utils.sh: line 72: 20849 Segmentation fault (core dumped) "${cmd}" "$@" 2>&1 20850 Done| tee -a ${LOG_FILE} it also shows: Error in pixCreateNoInit: pix_malloc fail for data Error in pixCreate: pixd not made 在 2019年6月13日星期四 UTC-4

[tesseract-ocr] Re: fine tuning a few characters generating training images error

2019-06-13 Thread Jingjing Lin
I didn't have any problem when following the instructions to add '±' to eng.traineddata. Is it because for Chinese there are much more characters? 在 2019年6月13日星期四 UTC-4下午4:04:45,Jingjing Lin写道: > > before > > src/training/tesstrain_utils.sh: line 72: 20849 Segmentation fault > (core dumped

[tesseract-ocr] Re: fine tuning a few characters generating training images error

2019-06-13 Thread Jingjing Lin
turns out it is indeed because the chi_sim.training_text I was using was too large. I downloaded it from langdata_lstm repository rather than langdata repository, which appears to be a problem. (Sometimes it's bad to be too careful :) )The .training_text from langdata is only 199kb but is like 2

[tesseract-ocr] Re: unicharset of fine tuning a few characters

2019-06-13 Thread Jingjing Lin
Looks like the command is generating .unicharset and everything itself. You just need to give it some text and it will make .tif files etc. all by itself in the process. 在 2019年6月13日星期四 UTC-4下午2:45:11,Jingjing Lin写道: > > For fine tuning a few characters (adding new characters that was not in >