Re: [tesseract-ocr] Unnecessary extra space with Japanese.traineddata

2018-07-24 Thread Shree Devi Kumar
Please see
https://github.com/tesseract-ocr/tessdata_fast#example---jpn-and--japanese
for Ray's comment regarding the 'script' traineddata.


preserve_interword_spaces 1

  was added via  jpn.config to jpn.traineddata file and other CJK languages
to fix this issue - see
https://github.com/tesseract-ocr/tessdata_fast/pull/7

We probably did not make the changes for the script traineddata files

you can test by giving the config variable on command line by adding

-c  preserve_interword_spaces 1


(Please check the syntax, it might need a = sign)

On Tue, Jul 24, 2018 at 10:40 AM Atsuyoshi Suzuki <
atuyosi.unloc...@gmail.com> wrote:

> Hi Shree.
>
> I use tessdata_fast.
>
>
> 2018年7月24日火曜日 13時44分40秒 UTC+9 shree:
>>
>> Which tessdata repository are you using for your trained data files?
>>
>> tessdata
>> tessdata_best
>> tessdata_fast
>>
>>
>>
>> On Tue 24 Jul, 2018, 9:01 AM Atsuyoshi Suzuki, 
>> wrote:
>>
>>> Hi.
>>>
>>> I tried new tesseract and  traineddata for Japanese (both
>>> jpn.traineddata and Japanese.traineddata).
>>>
>>> It's very good recognition result with jpn.traineddata.
>>>
>>> Japanese.traineddata provide good result  but unnecessary space is
>>> inserted in words or characters.
>>>
>>>
>>>
>>> Is this behavior expected? In Japanese, there is no space between each
>>> words.
>>>
>>> If this behavior is expected, what kind of usage is assumed for
>>> Japanese.traineddata?
>>>
>>>
>>>
>>> jpn.traineddata (very good, and I expected):
>>>
>>> --- start ---
>>> $ tesseract -l jpn  test_jpn_04.jpg stdout
>>> Warning. Invalid resolution 0 dpi. Using 70 instead.
>>> Estimating resolution as 168
>>> OCR 機能を提供する Web API はいくつか存在しますが、用途によってカスタマイズすることが
>>> できません。Tesseract は多数の言語に対応し、Linux、macOS、Windows で動作します。
>>>
>>> --- end ---
>>>
>>>
>>> Japanese.traineddata:
>>>
>>> --- start ---
>>> $ tesseract -l Japanese  test_jpn_04.jpg stdout
>>> Warning. Invalid resolution 0 dpi. Using 70 instead.
>>> Estimating resolution as 168
>>> OCR 機能 を 提供 する Web API は いく つか 存在 し ます が 、 用 途 に よっ て カス タマ イズ する こと が
>>> で きま せん 。Tesseract は 多数 の 言語 に 対応 し 、Linux、macOS、Windows で 動作 し ます 。
>>>
>>> --- end ---
>>>
>>>
>>> This result is same between Ubuntu (beta.1) and macOS
>>> (4.0.0-beta.2-586-g607e).
>>>
>>>
>>>
>>> Thanks.
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/ccfcb61b-3afa-4ecc-b6ac-ae3aebc55465%40googlegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/e009654e-7f40-42fb-bc56-6946a60105aa%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUcETMrrUZSCUJEqexXWo%3DPzMYzD1RK_rvBoyYLV40aqw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Assert failed:in file weightmatrix.cpp, line 244

2018-07-24 Thread Lorenzo Bolzani
I had this error when I was mixing best models with non best models.

I would try to run again

combine_tessdata -e base_model/eng.traineddata base_model/eng.lstm

to generate the eng.lstm from the "_best" model (the ones from
/usr/share/tessdata are not the "_best" models).

Then if the error is still there, just to be sure I do not really know if
it matters, I would also recreate the lstmf files.


Lorenzo


2018-07-23 22:56 GMT+02:00 Emiliano Isaza Villamizar :

> Hello everyone,
>
>
> 'm trying to train tesseract to improve the detection of some prices such
> as: CN¥2,400.48. I got got to a point that I keep getting this error:
>
> *total=`cat data/all-lstmf | wc -l` \*
> *   no=`echo "$total * 0.90 / 1" | bc`; \*
> *   head -n "$no" data/all-lstmf > "data/list.train"*
> *total=`cat data/all-lstmf | wc -l` \*
> *   no=`echo "($total - $total * 0.90) / 1" | bc`; \*
> *   tail -n "+$no" data/all-lstmf > "data/list.eval"*
> *combine_lang_model \*
> *  --input_unicharset data/unicharset \*
> *  --script_dir
> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/langdata-master
> \*
> *  --words
> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/langdata-master/eng/eng.wordlist
> \*
> *  --numbers
> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/langdata-master/eng/eng.numbers
> \*
> *  --puncs
> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/langdata-master/eng/eng.punc
> \*
> *  --output_dir data/ \*
> *  --lang eng*
> *Loaded unicharset of size 113 from file data/unicharset*
> *Setting unichar properties*
> *Other case É of é is not in unicharset*
> *Setting script properties*
> *Config file is optional, continuing...*
> *Failed to read data from:
> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/langdata-master/eng/eng.config*
> *Null char=2*
> *Reducing Trie to SquishedDawg*
> *Reducing Trie to SquishedDawg*
> *Reducing Trie to SquishedDawg*
> *mkdir -p data/checkpoints*
> *lstmtraining \*
> *  --continue_from
>  
> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/tessdata/eng.lstm
> \*
> *  --old_traineddata
> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/tessdata/eng.traineddata
> \*
> *  --traineddata data/eng/eng.traineddata \*
> *  --model_output data/checkpoints/eng \*
> *  --debug_interval -1 \*
> *  --train_listfile data/list.train \*
> *  --eval_listfile data/list.eval \*
> *  --sequential_training \*
> *  --max_iterations 3000*
> *Loaded file
> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/tessdata/eng.lstm,
> unpacking...*
> *Warning: LSTMTrainer deserialized an LSTMRecognizer!*
> *Code range changed from 111 to 112!*
> *Num (Extended) outputs,weights in Series:*
> *  1,36,0,1:1, 0*
> *Num (Extended) outputs,weights in Series:*
> *  C3,3:9, 0*
> *  Ft16:16, 160*
> *Total weights = 160*
> *  [C3,3Ft16]:16, 160*
> *  Mp3,3:16, 0*
> *  Lfys64:64, 20736*
> *  Lfx96:96, 61824*
> *  Lrx96:96, 74112*
> *  Lfx512:512, 1247232*
> *  Fc112:112, 0*
> *Total weights = 1404064*
> *Previous null char=110 mapped to 111*
> *Continuing from
> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/tessdata/eng.lstm*
> *Loaded 1/1 pages (1-1) of document
> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/data/train/72b.lstmf*
> *Loaded 1/1 pages (1-1) of document
> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/data/train/67e.lstmf*
> *Loaded 1/1 pages (1-1) of document
> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/data/train/75c.lstmf*
> *Loaded 1/1 pages (1-1) of document
> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/data/train/48b.lstmf*
> *Iteration 0: ALIGNED TRUTH : CN¥2,400.48*
> *Iteration 0: BEST OCR TEXT : ₩₩₩N₩₩4₩0₩0₩4₩8*
> *File
> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/data/train/72b.lstmf
> page 0 :*
> *!int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244*
> *!int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244*
> *!int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244*
> *Makefile:111: recipe for target 'data/checkpoints/eng_checkpoint' failed*
> *make: *** [data/checkpoints/eng_checkpoint] Segmentation fault (core
> dumped)*
>
> I already tried to download the best/tessdata eng.traineddata and
> replacing it in the continue_from but I haven't been able to pass this
> mistake. Any thoughts?
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/6152d324-0713-4de6-b646-162923273b63%
> 40googlegroups.com
> 

Re: [tesseract-ocr] Unnecessary extra space with Japanese.traineddata

2018-07-24 Thread Atsuyoshi Suzuki
Thank you Shree. 


I got same result jpn and Japanese  with '-c preserve_interword_spaces=1'. 

$ tesseract -l Japanese -c preserve_interword_spaces=1 test_jpn_04.jpg 
stdout

Unnecessary space problem is solved. Thanks.


2018年7月24日火曜日 16時28分22秒 UTC+9 shree:
>
> Please see 
> https://github.com/tesseract-ocr/tessdata_fast#example---jpn-and--japanese
> for Ray's comment regarding the 'script' traineddata.
>
>
>

Does it make sense to assume the case where English sentences and Japanese 
sentences are mixed in image?

In the case that English words are included in Japanese sentences, it seems 
that there is not much difference between jpn and Japanese.



 

> preserve_interword_spaces 1
>
>   was added via  jpn.config to jpn.traineddata file and other CJK 
> languages to fix this issue - see 
> https://github.com/tesseract-ocr/tessdata_fast/pull/7
>
> We probably did not make the changes for the script traineddata files
>
> you can test by giving the config variable on command line by adding 
>
> -c  preserve_interword_spaces 1 
>
>
> (Please check the syntax, it might need a = sign)
>
> On Tue, Jul 24, 2018 at 10:40 AM Atsuyoshi Suzuki  > wrote:
>
>> Hi Shree.
>>
>> I use tessdata_fast.
>>
>>
>> 2018年7月24日火曜日 13時44分40秒 UTC+9 shree:
>>>
>>> Which tessdata repository are you using for your trained data files?
>>>
>>> tessdata
>>> tessdata_best
>>> tessdata_fast
>>>
>>>
>>>
>>> On Tue 24 Jul, 2018, 9:01 AM Atsuyoshi Suzuki,  
>>> wrote:
>>>
 Hi.

 I tried new tesseract and  traineddata for Japanese (both 
 jpn.traineddata and Japanese.traineddata). 

 It's very good recognition result with jpn.traineddata.

 Japanese.traineddata provide good result  but unnecessary space is 
 inserted in words or characters.



 Is this behavior expected? In Japanese, there is no space between each 
 words.

 If this behavior is expected, what kind of usage is assumed for 
 Japanese.traineddata?



 jpn.traineddata (very good, and I expected):

 --- start ---
 $ tesseract -l jpn  test_jpn_04.jpg stdout
 Warning. Invalid resolution 0 dpi. Using 70 instead.
 Estimating resolution as 168
 OCR 機能を提供する Web API はいくつか存在しますが、用途によってカスタマイズすることが
 できません。Tesseract は多数の言語に対応し、Linux、macOS、Windows で動作します。

 --- end ---


 Japanese.traineddata:

 --- start ---
 $ tesseract -l Japanese  test_jpn_04.jpg stdout
 Warning. Invalid resolution 0 dpi. Using 70 instead.
 Estimating resolution as 168
 OCR 機能 を 提供 する Web API は いく つか 存在 し ます が 、 用 途 に よっ て カス タマ イズ する こと が
 で きま せん 。Tesseract は 多数 の 言語 に 対応 し 、Linux、macOS、Windows で 動作 し ます 。

 --- end ---


 This result is same between Ubuntu (beta.1) and macOS 
 (4.0.0-beta.2-586-g607e).



 Thanks.

 -- 
 You received this message because you are subscribed to the Google 
 Groups "tesseract-ocr" group.
 To unsubscribe from this group and stop receiving emails from it, send 
 an email to tesseract-oc...@googlegroups.com.
 To post to this group, send email to tesser...@googlegroups.com.
 Visit this group at https://groups.google.com/group/tesseract-ocr.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/tesseract-ocr/ccfcb61b-3afa-4ecc-b6ac-ae3aebc55465%40googlegroups.com
  
 
 .
 For more options, visit https://groups.google.com/d/optout.

>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/e009654e-7f40-42fb-bc56-6946a60105aa%40googlegroups.com
>>  
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
> -- 
>
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a9c4290c-c7ce-4395-9e88-db06a

[tesseract-ocr] How is the Mean rms calculated?

2018-07-24 Thread j . biros
I have been looking through the documentation but cannot seem to find 
anything that explains how the rms is calculated.  I am a bit new to this 
sort of work, so I am not quite sure where to look.  Can anyone point me in 
the right direction?

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a23d3771-ded6-4981-923d-2198549f0342%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Assert failed:in file weightmatrix.cpp, line 244

2018-07-24 Thread Emiliano Isaza Villamizar
I'm using OCR-D that uses 4.0.0-beta.1

On Tuesday, July 24, 2018 at 12:05:22 AM UTC-5, shree wrote:
>
> Which version of tesseract are you using?
>
> Please post output of
>
> tesseract -v
>
> On Tue 24 Jul, 2018, 2:26 AM Emiliano Isaza Villamizar,  > wrote:
>
>> Hello everyone,
>>
>>
>> 'm trying to train tesseract to improve the detection of some prices such 
>> as: CN¥2,400.48. I got got to a point that I keep getting this error:
>>
>> *total=`cat data/all-lstmf | wc -l` \*
>> *   no=`echo "$total * 0.90 / 1" | bc`; \*
>> *   head -n "$no" data/all-lstmf > "data/list.train"*
>> *total=`cat data/all-lstmf | wc -l` \*
>> *   no=`echo "($total - $total * 0.90) / 1" | bc`; \*
>> *   tail -n "+$no" data/all-lstmf > "data/list.eval"*
>> *combine_lang_model \*
>> *  --input_unicharset data/unicharset \*
>> *  --script_dir 
>> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/langdata-master
>>  
>> \*
>> *  --words 
>> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/langdata-master/eng/eng.wordlist
>>  
>> \*
>> *  --numbers 
>> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/langdata-master/eng/eng.numbers
>>  
>> \*
>> *  --puncs 
>> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/langdata-master/eng/eng.punc
>>  
>> \*
>> *  --output_dir data/ \*
>> *  --lang eng*
>> *Loaded unicharset of size 113 from file data/unicharset*
>> *Setting unichar properties*
>> *Other case É of é is not in unicharset*
>> *Setting script properties*
>> *Config file is optional, continuing...*
>> *Failed to read data from: 
>> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/langdata-master/eng/eng.config*
>> *Null char=2*
>> *Reducing Trie to SquishedDawg*
>> *Reducing Trie to SquishedDawg*
>> *Reducing Trie to SquishedDawg*
>> *mkdir -p data/checkpoints*
>> *lstmtraining \*
>> *  --continue_from  
>>  
>> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/tessdata/eng.lstm
>>  
>> \*
>> *  --old_traineddata 
>> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/tessdata/eng.traineddata
>>  
>> \*
>> *  --traineddata data/eng/eng.traineddata \*
>> *  --model_output data/checkpoints/eng \*
>> *  --debug_interval -1 \*
>> *  --train_listfile data/list.train \*
>> *  --eval_listfile data/list.eval \*
>> *  --sequential_training \*
>> *  --max_iterations 3000*
>> *Loaded file 
>> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/tessdata/eng.lstm,
>>  
>> unpacking...*
>> *Warning: LSTMTrainer deserialized an LSTMRecognizer!*
>> *Code range changed from 111 to 112!*
>> *Num (Extended) outputs,weights in Series:*
>> *  1,36,0,1:1, 0*
>> *Num (Extended) outputs,weights in Series:*
>> *  C3,3:9, 0*
>> *  Ft16:16, 160*
>> *Total weights = 160*
>> *  [C3,3Ft16]:16, 160*
>> *  Mp3,3:16, 0*
>> *  Lfys64:64, 20736*
>> *  Lfx96:96, 61824*
>> *  Lrx96:96, 74112*
>> *  Lfx512:512, 1247232*
>> *  Fc112:112, 0*
>> *Total weights = 1404064*
>> *Previous null char=110 mapped to 111*
>> *Continuing from 
>> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/tessdata/eng.lstm*
>> *Loaded 1/1 pages (1-1) of document 
>> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/data/train/72b.lstmf*
>> *Loaded 1/1 pages (1-1) of document 
>> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/data/train/67e.lstmf*
>> *Loaded 1/1 pages (1-1) of document 
>> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/data/train/75c.lstmf*
>> *Loaded 1/1 pages (1-1) of document 
>> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/data/train/48b.lstmf*
>> *Iteration 0: ALIGNED TRUTH : CN¥2,400.48*
>> *Iteration 0: BEST OCR TEXT : ₩₩₩N₩₩4₩0₩0₩4₩8*
>> *File 
>> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/data/train/72b.lstmf
>>  
>> page 0 :*
>> *!int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244*
>> *!int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244*
>> *!int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244*
>> *Makefile:111: recipe for target 'data/checkpoints/eng_checkpoint' failed*
>> *make: *** [data/checkpoints/eng_checkpoint] Segmentation fault (core 
>> dumped)*
>>
>> I already tried to download the best/tessdata eng.traineddata and 
>> replacing it in the continue_from but I haven't been able to pass this 
>> mistake. Any thoughts?
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/6152d324-0713-4de6-b646-162923273b63%40googlegroups.com
>>  
>> 

Re: [tesseract-ocr] Unnecessary extra space with Japanese.traineddata

2018-07-24 Thread mahendrag gajera
I am using  Japanese.traineddata.which gives good result

On Tue, Jul 24, 2018 at 2:59 PM, Atsuyoshi Suzuki <
atuyosi.unloc...@gmail.com> wrote:

> Thank you Shree.
>
>
> I got same result jpn and Japanese  with '-c preserve_interword_spaces=1'.
>
> $ tesseract -l Japanese -c preserve_interword_spaces=1 test_jpn_04.jpg
> stdout
>
> Unnecessary space problem is solved. Thanks.
>
>
> 2018年7月24日火曜日 16時28分22秒 UTC+9 shree:
>>
>> Please see https://github.com/tesseract-ocr/tessdata_fast#example--
>> -jpn-and--japanese
>> for Ray's comment regarding the 'script' traineddata.
>>
>>
>>
>
> Does it make sense to assume the case where English sentences and Japanese
> sentences are mixed in image?
>
> In the case that English words are included in Japanese sentences, it
> seems that there is not much difference between jpn and Japanese.
>
>
>
>
>
>> preserve_interword_spaces 1
>>
>>   was added via  jpn.config to jpn.traineddata file and other CJK
>> languages to fix this issue - see https://github.com/tessera
>> ct-ocr/tessdata_fast/pull/7
>>
>> We probably did not make the changes for the script traineddata files
>>
>> you can test by giving the config variable on command line by adding
>>
>> -c  preserve_interword_spaces 1
>>
>>
>> (Please check the syntax, it might need a = sign)
>>
>> On Tue, Jul 24, 2018 at 10:40 AM Atsuyoshi Suzuki 
>> wrote:
>>
>>> Hi Shree.
>>>
>>> I use tessdata_fast.
>>>
>>>
>>> 2018年7月24日火曜日 13時44分40秒 UTC+9 shree:

 Which tessdata repository are you using for your trained data files?

 tessdata
 tessdata_best
 tessdata_fast



 On Tue 24 Jul, 2018, 9:01 AM Atsuyoshi Suzuki, 
 wrote:

> Hi.
>
> I tried new tesseract and  traineddata for Japanese (both
> jpn.traineddata and Japanese.traineddata).
>
> It's very good recognition result with jpn.traineddata.
>
> Japanese.traineddata provide good result  but unnecessary space is
> inserted in words or characters.
>
>
>
> Is this behavior expected? In Japanese, there is no space between each
> words.
>
> If this behavior is expected, what kind of usage is assumed for
> Japanese.traineddata?
>
>
>
> jpn.traineddata (very good, and I expected):
>
> --- start ---
> $ tesseract -l jpn  test_jpn_04.jpg stdout
> Warning. Invalid resolution 0 dpi. Using 70 instead.
> Estimating resolution as 168
> OCR 機能を提供する Web API はいくつか存在しますが、用途によってカスタマイズすることが
> できません。Tesseract は多数の言語に対応し、Linux、macOS、Windows で動作します。
>
> --- end ---
>
>
> Japanese.traineddata:
>
> --- start ---
> $ tesseract -l Japanese  test_jpn_04.jpg stdout
> Warning. Invalid resolution 0 dpi. Using 70 instead.
> Estimating resolution as 168
> OCR 機能 を 提供 する Web API は いく つか 存在 し ます が 、 用 途 に よっ て カス タマ イズ する こと が
> で きま せん 。Tesseract は 多数 の 言語 に 対応 し 、Linux、macOS、Windows で 動作 し ます 。
>
> --- end ---
>
>
> This result is same between Ubuntu (beta.1) and macOS
> (4.0.0-beta.2-586-g607e).
>
>
>
> Thanks.
>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to tesseract-oc...@googlegroups.com.
> To post to this group, send email to tesser...@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/ccfcb61b-3af
> a-4ecc-b6ac-ae3aebc55465%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>
 --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/e009654e-7f40-42fb-bc56-6946a60105aa%40goo
>>> glegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>> --
>>
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+uns

Re: [tesseract-ocr] Assert failed:in file weightmatrix.cpp, line 244

2018-07-24 Thread Emiliano Isaza Villamizar
I'm using OCR-D I compiled it again changing the .traineddata in the 
original file but it hasn't worked. I still get the same error.

Iteration 0: ALIGNED TRUTH : Zhejiang Huamei Holding Co Ltd
Iteration 0: BEST OCR TEXT : ₩Z₩h₩e₩j₩i₩a₩n₩ ₩₩u₩a₩m₩e ₩₩o₩₩d₩i₩n₩ ₩C₩o 
₩L₩₩d
File 
/home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/data/train/44c.lstmf
 
page 0 :
!int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244
!int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244
Makefile:111: recipe for target 'data/checkpoints/eng_checkpoint' failed
make: *** [data/checkpoints/eng_checkpoint] Segmentation fault (core dumped)

I ran make clean and re run it to make the lstmf files but got the same 
error.



On Tuesday, July 24, 2018 at 3:16:27 AM UTC-5, Lorenzo Blz wrote:
>
> I had this error when I was mixing best models with non best models.
>
> I would try to run again 
>
> combine_tessdata -e base_model/eng.traineddata base_model/eng.lstm
>
> to generate the eng.lstm from the "_best" model (the ones from 
> /usr/share/tessdata are not the "_best" models).
>
> Then if the error is still there, just to be sure I do not really know if 
> it matters, I would also recreate the lstmf files.
>
>
> Lorenzo
>
>
> 2018-07-23 22:56 GMT+02:00 Emiliano Isaza Villamizar  >:
>
>> Hello everyone,
>>
>>
>> 'm trying to train tesseract to improve the detection of some prices such 
>> as: CN¥2,400.48. I got got to a point that I keep getting this error:
>>
>> *total=`cat data/all-lstmf | wc -l` \*
>> *   no=`echo "$total * 0.90 / 1" | bc`; \*
>> *   head -n "$no" data/all-lstmf > "data/list.train"*
>> *total=`cat data/all-lstmf | wc -l` \*
>> *   no=`echo "($total - $total * 0.90) / 1" | bc`; \*
>> *   tail -n "+$no" data/all-lstmf > "data/list.eval"*
>> *combine_lang_model \*
>> *  --input_unicharset data/unicharset \*
>> *  --script_dir 
>> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/langdata-master
>>  
>> \*
>> *  --words 
>> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/langdata-master/eng/eng.wordlist
>>  
>> \*
>> *  --numbers 
>> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/langdata-master/eng/eng.numbers
>>  
>> \*
>> *  --puncs 
>> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/langdata-master/eng/eng.punc
>>  
>> \*
>> *  --output_dir data/ \*
>> *  --lang eng*
>> *Loaded unicharset of size 113 from file data/unicharset*
>> *Setting unichar properties*
>> *Other case É of é is not in unicharset*
>> *Setting script properties*
>> *Config file is optional, continuing...*
>> *Failed to read data from: 
>> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/langdata-master/eng/eng.config*
>> *Null char=2*
>> *Reducing Trie to SquishedDawg*
>> *Reducing Trie to SquishedDawg*
>> *Reducing Trie to SquishedDawg*
>> *mkdir -p data/checkpoints*
>> *lstmtraining \*
>> *  --continue_from  
>>  
>> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/tessdata/eng.lstm
>>  
>> \*
>> *  --old_traineddata 
>> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/tessdata/eng.traineddata
>>  
>> \*
>> *  --traineddata data/eng/eng.traineddata \*
>> *  --model_output data/checkpoints/eng \*
>> *  --debug_interval -1 \*
>> *  --train_listfile data/list.train \*
>> *  --eval_listfile data/list.eval \*
>> *  --sequential_training \*
>> *  --max_iterations 3000*
>> *Loaded file 
>> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/tessdata/eng.lstm,
>>  
>> unpacking...*
>> *Warning: LSTMTrainer deserialized an LSTMRecognizer!*
>> *Code range changed from 111 to 112!*
>> *Num (Extended) outputs,weights in Series:*
>> *  1,36,0,1:1, 0*
>> *Num (Extended) outputs,weights in Series:*
>> *  C3,3:9, 0*
>> *  Ft16:16, 160*
>> *Total weights = 160*
>> *  [C3,3Ft16]:16, 160*
>> *  Mp3,3:16, 0*
>> *  Lfys64:64, 20736*
>> *  Lfx96:96, 61824*
>> *  Lrx96:96, 74112*
>> *  Lfx512:512, 1247232*
>> *  Fc112:112, 0*
>> *Total weights = 1404064*
>> *Previous null char=110 mapped to 111*
>> *Continuing from 
>> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/tessdata/eng.lstm*
>> *Loaded 1/1 pages (1-1) of document 
>> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/data/train/72b.lstmf*
>> *Loaded 1/1 pages (1-1) of document 
>> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/data/train/67e.lstmf*
>> *Loaded 1/1 pages (1-1) of document 
>> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/data/train/75c.lstmf*
>> *Loaded 1/1 pages (1-1) of document 
>> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/data/train/48b.lstmf*
>> *Iteration 0: ALIGNED TRUTH : CN¥2,400.48*
>> *Iteration 0: BEST OCR TEXT : ₩₩₩N₩₩4₩0₩0₩4₩8*
>> *File 
>> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/data/train/72b.lstmf
>>  
>> page 0 :*
>> *!int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244*
>> *!int_mode_:Error:Assert fa

[tesseract-ocr] Re: Read Bold fonts with Tesseract API - JAVA

2018-07-24 Thread Raed Kubaizi
any luck guys ???

>
>  
>
> Thanks
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/8d0204ad-94bc-42f7-9ac1-68ab6287fe52%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Assert failed:in file weightmatrix.cpp, line 244

2018-07-24 Thread shree

>
> *  --continue_from  
>>  
>> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/tessdata/eng.lstm
>>  
>> \*
>> *  --old_traineddata 
>> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/tessdata/eng.traineddata
>>  
>> \*
>>
>
Use eng.traineddata from tessdata_best
https://github.com/tesseract-ocr/tessdata_best

and extract the lstm file from it. 

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/18f5854f-2356-454a-bc77-1ae414ba0f09%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Problems when training tesseract in Spanish language

2018-07-24 Thread ricardo valadez
It happens to the moment in which a word contains this tilde, it is not 
recognized and the word changes, the same case is for the letter "ñ"

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/02b0114a-e9ee-4fe4-bd3e-192b1f6a6fae%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Assert failed:in file weightmatrix.cpp, line 244

2018-07-24 Thread Emiliano Isaza Villamizar
It worked maybe I was using another *eng.traineddata. *Thank you for your 
time Shree and Lorenzo 

kind regards,
Emiliano 

On Tuesday, July 24, 2018 at 11:40:34 AM UTC-5, shree wrote:
>
> *  --continue_from  
>>>  
>>> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/tessdata/eng.lstm
>>>  
>>> \*
>>> *  --old_traineddata 
>>> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/tessdata/ \*
>>>
>>
> Use eng.traineddata from tessdata_best
> https://github.com/tesseract-ocr/tessdata_best
>
> and extract the lstm file from it. 
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/652e14af-7e0a-4000-a06a-456f8db7654c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Problems when training tesseract in Spanish language

2018-07-24 Thread 'John Lee Ward' via tesseract-ocr
This may be a silly question, but I assume that when you call tesseract 
that you are using the  -l spa  option?



On Tuesday, July 24, 2018 at 12:20:11 PM UTC-5, ricardo valadez wrote:
>
> It happens to the moment in which a word contains this tilde, it is not 
> recognized and the word changes, the same case is for the letter "ñ"
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/f4fab824-7fa2-4a8c-b264-6bf120228d29%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Assert failed:in file weightmatrix.cpp, line 244

2018-07-24 Thread Emiliano Isaza Villamizar
I anyone is following this thread and are using OCR-D, I had to change the 
start of the .py file by adding these lines because I kept getting and 
unicode error:

*import sys*
*reload(sys)*
*sys.setdefaultencoding('utf-8')*


On Tuesday, July 24, 2018 at 4:41:45 PM UTC-5, Emiliano Isaza Villamizar 
wrote:
>
> It worked maybe I was using another *eng.traineddata. *Thank you for your 
> time Shree and Lorenzo 
>
> kind regards,
> Emiliano 
>
> On Tuesday, July 24, 2018 at 11:40:34 AM UTC-5, shree wrote:
>>
>> *  --continue_from  
  
 /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/tessdata/eng.lstm
  
 \*
 *  --old_traineddata 
 /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/tessdata/ 
 \*

>>>
>> Use eng.traineddata from tessdata_best
>> https://github.com/tesseract-ocr/tessdata_best
>>
>> and extract the lstm file from it. 
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/2e10c48d-e0dd-4fd3-820d-f9dc7a9777be%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Assert failed:in file weightmatrix.cpp, line 244

2018-07-24 Thread Emiliano Isaza Villamizar
If anyone is following this thread and are using OCR-D, I had to modify the 
.py file because I kept getting a Unicode error, just add these lines to 
the file:

import sys
reload(sys)
sys.setdefaultencoding('utf-8')


On Tuesday, July 24, 2018 at 4:41:45 PM UTC-5, Emiliano Isaza Villamizar 
wrote:
>
> It worked maybe I was using another *eng.traineddata. *Thank you for your 
> time Shree and Lorenzo 
>
> kind regards,
> Emiliano 
>
> On Tuesday, July 24, 2018 at 11:40:34 AM UTC-5, shree wrote:
>>
>> *  --continue_from  
  
 /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/tessdata/eng.lstm
  
 \*
 *  --old_traineddata 
 /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/tessdata/ 
 \*

>>>
>> Use eng.traineddata from tessdata_best
>> https://github.com/tesseract-ocr/tessdata_best
>>
>> and extract the lstm file from it. 
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/49adbbe0-b428-44c0-9acd-b6cdca444288%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: How is the Mean rms calculated?

2018-07-24 Thread 'John Lee Ward' via tesseract-ocr
I am new to the tesseract also. Where in the tesseract world does rms value 
come up? As a general rule in engineering, the rms value is .707 peak value 
if one is working with amps or volts and you are dealing with sinusoids. If 
the waveform is not sinusoidal, the rms value is equal to the average power 
(or its equivalent in another branch of physics/engineering) and so you 
need to know what kind of waveform you are dealing with and then integrate 
over a period of the absolute value of the waveform. In short, rms stands 
for "root mean square" and its definition and explanation can be found in 
many basic engineering and physics text books. 

On Tuesday, July 24, 2018 at 7:55:13 AM UTC-5, j.b...@churadata.okinawa 
wrote:
>
> I have been looking through the documentation but cannot seem to find 
> anything that explains how the rms is calculated.  I am a bit new to this 
> sort of work, so I am not quite sure where to look.  Can anyone point me in 
> the right direction?
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/2556f7ef-8343-4e48-8f75-e72455c13040%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Problems when training tesseract in Spanish language

2018-07-24 Thread ricardo valadez
maybe if it's silly but I'm new to tesseract ... I'll call it that, thank 
you

El martes, 24 de julio de 2018, 16:42:55 (UTC-5), John Lee Ward escribió:
>
> This may be a silly question, but I assume that when you call tesseract 
> that you are using the  -l spa  option?
>
>
>
> On Tuesday, July 24, 2018 at 12:20:11 PM UTC-5, ricardo valadez wrote:
>>
>> It happens to the moment in which a word contains this tilde, it is not 
>> recognized and the word changes, the same case is for the letter "ñ"
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/54913385-0a3c-4dc6-8d11-c72241ec05e6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.