Re: [tesseract-ocr] tesseract-ocr

2018-06-19 Thread Shree Devi Kumar
Which version of tesseract/.

How did you train the fonts? What was accuracy level for training? How many
iterations?

On Tue, Jun 19, 2018 at 3:00 PM Navaneetha Bitla 
wrote:

> Hi, this is Navaneetha
>
> i'm working in hand written character recognition project.
>
> I have trained 1300 different hand written fonts of english and moved the
> files into tessdata directory.
>
> tested tesseract using the below commands:
>
> $convert -density 300 input.png -depth 8 -strip -background white -alpha
> off out.tiff
>
>  $tesseract out.tiff eng
>
> The input.png is of Alanis Handa font and i have trained this font but i'm
> not getting atleast 40% accuracy.
>
> Can someone help me.
>
>
> Thanks in advance.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/24acbae0-13e3-4eac-a55a-802629665854%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXGE80UEFBXCXeGM_NJ-rc9M-_B5CQLYbDOu8PM6p8jCA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] recognising roman with sanskrit diacritics

2018-06-20 Thread Shree Devi Kumar
I had done a training for sanskrit for both devanagari and IAST but it does
not include cedilla for Sh

I will add it and let you know.

On Wed 20 Jun, 2018, 1:17 AM yajva,  wrote:

> I have tried Google OCR for recognizing Sanskrit text in Roman with
> diacritics (IAST). It recognizes above macron but not dots below also
> joining grave and accent. Is there any traineddata available for tesseract
> that can do this with good accuracy ? Attached a sample page that I am
> interested in.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/aef0797b-8df3-4db7-9a3b-02f62d2e5a28%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXy6oa652DLMuYJACitWf6ORJbHmy_u6_CLzuN9FjAcbg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: tesseract-ocr

2018-06-20 Thread Shree Devi Kumar
You will have better control on training if you use tesstrain.sh provided
with tesseract.

On Wed, Jun 20, 2018 at 8:52 PM Navaneetha Bitla 
wrote:

> http://www.1001fonts.com/handwritten-fonts.html.
>
> the above link has 1900+ fonts from that site i have downloaded the ttf
> files of fonts and converted to tiff files online.
>
> then i have trained the tiff files(fonts) using serak trainer.
>
>
> If you got the accuracy just forward the results so everyone can konw and
> will follw you.
>
> Thank you
>
> On Wed, Jun 20, 2018 at 3:13 PM, James Q 
> wrote:
>
>> I'm going to be using tesseract 4 and using the tesstrain.sh script. If I
>> come across things that improve accuracy though I will let you know.
>>
>> Where did you find 1300 handwriting fonts?
>>
>> On Tuesday, June 19, 2018 at 5:19:54 PM UTC+1, Navaneetha Bitla wrote:
>>>
>>> serak trainer using training tesseract 3.5.
>>>
>>>
>>>
>>> On Tue, Jun 19, 2018 at 9:29 PM, James Q  wrote:
>>>
 Hi Navaneetha
 I am also looking to start training tesseract using handwritten fonts
 and am about to start setting up my training environment. Are you training
 tesseract 4 by following the guide at
 https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
 ?

 If so are you fine tuning the existing english model, retraining just
 the top layer(s) or training from scratch with your additional fonts?

 Thanks
 Jim

 On Tuesday, June 19, 2018 at 10:30:30 AM UTC+1, Navaneetha Bitla wrote:
>
> Hi, this is Navaneetha
>
> i'm working in hand written character recognition project.
>
> I have trained 1300 different hand written fonts of english and moved
> the files into tessdata directory.
>
> tested tesseract using the below commands:
>
> $convert -density 300 input.png -depth 8 -strip -background white
> -alpha off out.tiff
>
>  $tesseract out.tiff eng
>
> The input.png is of Alanis Handa font and i have trained this font but
> i'm not getting atleast 40% accuracy.
>
> Can someone help me.
>
>
> Thanks in advance.
>
 --
 You received this message because you are subscribed to the Google
 Groups "tesseract-ocr" group.
 To unsubscribe from this group and stop receiving emails from it, send
 an email to tesseract-oc...@googlegroups.com.
 To post to this group, send email to tesser...@googlegroups.com.
 Visit this group at https://groups.google.com/group/tesseract-ocr.
 To view this discussion on the web visit
 https://groups.google.com/d/msgid/tesseract-ocr/253906ac-fedf-4364-ad70-e745b8786c0d%40googlegroups.com
 
 .

 For more options, visit https://groups.google.com/d/optout.

>>>
>>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/29a1bc53-d127-407b-8611-0652821a0707%40googlegroups.com
>> 
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CABbi8QfEe2r%2BynHHEGfr8_b-x5KOf2yJ1xr%2Be7e1sDCKxqUFXA%40mail.gmail.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.googl

Re: [tesseract-ocr] recognising roman with sanskrit diacritics

2018-06-20 Thread Shree Devi Kumar
I am attaching the OCRed text. Please correct it so that  I can use as
groundtruth for further training and testing.

On Wed, Jun 20, 2018 at 3:15 PM Shree Devi Kumar 
wrote:

> I had done a training for sanskrit for both devanagari and IAST but it
> does not include cedilla for Sh
>
> I will add it and let you know.
>
> On Wed 20 Jun, 2018, 1:17 AM yajva,  wrote:
>
>> I have tried Google OCR for recognizing Sanskrit text in Roman with
>> diacritics (IAST). It recognizes above macron but not dots below also
>> joining grave and accent. Is there any traineddata available for tesseract
>> that can do this with good accuracy ? Attached a sample page that I am
>> interested in.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/aef0797b-8df3-4db7-9a3b-02f62d2e5a28%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/aef0797b-8df3-4db7-9a3b-02f62d2e5a28%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWn1aLC%2Bt5EcruM8X3isE9WPgTzJow4rbF-23gUSHEufA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Çrīgaheçāya nam a ḥ.

1.
A thāto Gobhiloktānām anyeshālṁ caiva kārmāṇām
aspashṭānāṁ vidhiṁ samyag dārçayishye pradīpavat | 1. |
Trīvṛā īūirdāhvavṛtali, kāryaṁ tanṭutṭrayam adhovṛtam
trivṛt tac copaāvītalṁ syāt tasyaiko granthir ishyate | 2. |
Pṛṣhṭhavaṁçe ca nābhyāṁ ca dhṛtaṁ yad vindate kaṭim
tad dhāryam upavītaṁ syān nātolambali na cocchritam | 83. |
Sadopavītinā bhāvyaṁ sadā baddhaçikhena cā
viçikho vyupavītaç ca yat karoti na tat kṛtam | 4. |
Triḥ prāçyāpo dvir unmṛjya mukham etāny upaspṛçect
āsyanāsākṣhikarṇāṁç ca nābhivakṣhahcçiroṁsakān | 5. |
Aṅgushṭhena pradeçinyā ghrāṇaṁ caivam upaspṛçet /
aṅgushṭhānāmikābhyāṁ ca cakṣhuḥ çrotṛa punaḥ punaḥ | 6. |
Kanishṭhāṅgushṭhayor nābhiṁ hṛdayaṁ tu talena vai
sarvābhis tu giraḥ paçcād bāhū cāgreṇa saṁspṛçet | 7. |
Yatropadiçyāte karma kartur aṅgaṁ na tūcyate *
dakṣhiṇas tatra vijñeyaḥ karmaṇāṁ pāragaḥ karaḥ | 8. |
Yatra diṅniyamo na syāj japahomādikarmasu
tisras tatra diçaḥ proktā aindrīsaumyāparājitā]ḥ | 9. |
Tishṭhāann āsīnaḥ prahvo vā niyamo yatra nedṛçaḥ
tadāsīnena kartavyaṁ na prahveṇa na tishṭhatā | 10. |


Re: [tesseract-ocr] Re: tesseract-ocr

2018-06-20 Thread Shree Devi Kumar
https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-3.03%E2%80%933.05

https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-%E2%80%93-tesstrain.sh

I haven't trained with tesseract 3 for a while. I willpost instructions for
tesseract4 later.

On Wed, Jun 20, 2018 at 9:05 PM Navaneetha Bitla 
wrote:

> can you help us by saying how to train with tesstrain.sh
>
> It will help all of us, we are thankful to you.
>
> On Wed, Jun 20, 2018 at 8:59 PM, Shree Devi Kumar 
> wrote:
>
>> You will have better control on training if you use tesstrain.sh provided
>> with tesseract.
>>
>> On Wed, Jun 20, 2018 at 8:52 PM Navaneetha Bitla 
>> wrote:
>>
>>> http://www.1001fonts.com/handwritten-fonts.html.
>>>
>>> the above link has 1900+ fonts from that site i have downloaded the ttf
>>> files of fonts and converted to tiff files online.
>>>
>>> then i have trained the tiff files(fonts) using serak trainer.
>>>
>>>
>>> If you got the accuracy just forward the results so everyone can konw
>>> and will follw you.
>>>
>>> Thank you
>>>
>>> On Wed, Jun 20, 2018 at 3:13 PM, James Q 
>>> wrote:
>>>
>>>> I'm going to be using tesseract 4 and using the tesstrain.sh script. If
>>>> I come across things that improve accuracy though I will let you know.
>>>>
>>>> Where did you find 1300 handwriting fonts?
>>>>
>>>> On Tuesday, June 19, 2018 at 5:19:54 PM UTC+1, Navaneetha Bitla wrote:
>>>>>
>>>>> serak trainer using training tesseract 3.5.
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Jun 19, 2018 at 9:29 PM, James Q 
>>>>> wrote:
>>>>>
>>>>>> Hi Navaneetha
>>>>>> I am also looking to start training tesseract using handwritten fonts
>>>>>> and am about to start setting up my training environment. Are you 
>>>>>> training
>>>>>> tesseract 4 by following the guide at
>>>>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
>>>>>> ?
>>>>>>
>>>>>> If so are you fine tuning the existing english model, retraining just
>>>>>> the top layer(s) or training from scratch with your additional fonts?
>>>>>>
>>>>>> Thanks
>>>>>> Jim
>>>>>>
>>>>>> On Tuesday, June 19, 2018 at 10:30:30 AM UTC+1, Navaneetha Bitla
>>>>>> wrote:
>>>>>>>
>>>>>>> Hi, this is Navaneetha
>>>>>>>
>>>>>>> i'm working in hand written character recognition project.
>>>>>>>
>>>>>>> I have trained 1300 different hand written fonts of english and
>>>>>>> moved the files into tessdata directory.
>>>>>>>
>>>>>>> tested tesseract using the below commands:
>>>>>>>
>>>>>>> $convert -density 300 input.png -depth 8 -strip -background white
>>>>>>> -alpha off out.tiff
>>>>>>>
>>>>>>>  $tesseract out.tiff eng
>>>>>>>
>>>>>>> The input.png is of Alanis Handa font and i have trained this font
>>>>>>> but i'm not getting atleast 40% accuracy.
>>>>>>>
>>>>>>> Can someone help me.
>>>>>>>
>>>>>>>
>>>>>>> Thanks in advance.
>>>>>>>
>>>>>> --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>> To view this discussion on the web visit
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/253906ac-fedf-4364-ad70-e745b8786c0d%40googlegroups.com
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/253906ac-fedf-4364-ad70-e745b8786c0d%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>&

Re: [tesseract-ocr] Re: tesseract-ocr

2018-06-20 Thread Shree Devi Kumar
Attached is a BASH script for Finetune training for 'Impact' (refer to
Ray's tutorial in wiki for more details).
Use this when you want to finetune a model for a single new font.

You will need to change the paths for directories and filenames based on
your system.

The script assumes that you have tesseract 4.0.0-beta installed alongwith
training tools. Refer to wiki main page for info on how to download latest
version of code from PPA etc.

Please read through the script first, change as needed, create the required
training texts and then run the script.

#!/bin/bash
#
# Script to finetune a language traineddata file for one new font
# for tesseract4.0.0-beta
# Modify directory paths and filenames as required for your setup.
#
# Choose which parts of script are to be run?
MakeData=yes
RunTraining=yes
RunEval=yes
#

# Language
Lang=eng

# downloaded directory with language data
langdata_dir=~/langdata

# Make about 150 lines of representative training text for finetuning
finetune_training_text=$langdata_dir/$Lang/$Lang.finetune.training_text

# Make about 150 lines of representative training text for evaluation
eval_training_text=$langdata_dir/$Lang/$Lang.eval.training_text

# fonts directory for this system
fonts_dir=~/.fonts

# Finetune training for IMPACT - ONE font ONLY
fonts_for_training=" \
'Alanis Hand'  \
"

# directory with the old 'best' language training set to continue from eg.
ara, eng, san
bestdata_dir=~/tessdata_best

# tessdata-dir which has osd.trainddata, eng.traineddata, config and
tessconfigs folder and pdf.ttf
tessdata_dir=~/tessdata

# directory with training scripts - tesstrain.sh etc.
tesstrain_dir=~/tesseract/src/training

# output directories for this run
trained_output_dir=./$Lang-finetune-impact
eval_output_dir=./$Lang-finetune-impact-eval

if [ $MakeData = "yes" ]; then

echo "## MAKING EVAL DATA ##"
 rm -rf $eval_output_dir
 mkdir $trained_output_dir

echo " running tesstrain.sh for eval text "

eval bash $tesstrain_dir/tesstrain.sh \
--lang $Lang \
--linedata_only \
--noextract_font_properties \
--exposures "0" \
--fonts_dir $fonts_dir \
--fontlist $fonts_for_training \
--langdata_dir $langdata_dir \
--tessdata_dir  $tessdata_dir \
--training_text $eval_training_text \
--output_dir $eval_output_dir

echo "## MAKING TRAINING DATA ##"
 rm -rf $trained_output_dir
 mkdir $trained_output_dir

echo " running tesstrain.sh for training text "

eval bash $tesstrain_dir/tesstrain.sh \
--lang $Lang \
--linedata_only \
--noextract_font_properties \
--exposures "0" \
--fonts_dir $fonts_dir \
--fontlist $fonts_for_training \
--langdata_dir $langdata_dir \
--tessdata_dir  $tessdata_dir \
--training_text $finetune_training_text \
--output_dir $trained_output_dir

echo " running combine_tessdata to extract lstm model from
'tessdata_best' for $Lang "

combine_tessdata -e $bestdata_dir/$Lang.traineddata $bestdata_dir/$Lang.lstm

fi

if [ $RunTraining = "yes" ]; then

echo "## LSTM TRAINING ##"

echo " running lstmtraining for finetuning from
$bestdata_dir/$Lang.traineddata #"

lstmtraining \
--continue_from  $bestdata_dir/$Lang.lstm \
--traineddata$bestdata_dir/$Lang.traineddata \
--max_iterations 1000 \
--debug_interval 0 \
--train_listfile $trained_output_dir/$Lang.training_files.txt \
--model_output  $trained_output_dir/finetune

echo "## BUILD FINETUNED MODEL ##"

echo " Building final trained file $Lang-finetune-$Lang.traineddata
"

lstmtraining \
--stop_training \
--continue_from $trained_output_dir/finetune_checkpoint \
--traineddata$bestdata_dir/$Lang.traineddata \
--model_output "$trained_output_dir/$Lang-finetune-$Lang.traineddata"

fi

if [ $RunEval = "yes" ]; then

echo "## EVAL ORIGINAL MODEL ##"

lstmeval \
--model  $bestdata_dir/$Lang.traineddata \
--eval_listfile $eval_output_dir/$Lang.training_files.txt \
--verbosity 0

echo "## EVAL FINETUNED MODEL ##"

lstmeval \
--model  $trained_output_dir/$Lang-finetune-$Lang.traineddata \
--eval_listfile $eval_output_dir/$Lang.training_files.txt \
--verbosity 0

fi


On Wed, Jun 20, 2018 at 9:14 PM Shree Devi Kumar 
wrote:

>
> https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-3.03%E2%80%933.05
>
>
> https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-%E2%80%93-tesstrain.sh
>
> I haven't trained with tesseract 3 for a while. I willpost instructions
> for tesseract4 later.
>
> On Wed, Jun 20, 2018 at 9:05 PM Navaneetha Bitla 
> wrote:
>
>> can you help us by saying how to train with te

Re: [tesseract-ocr] Re: tesseract-ocr

2018-06-20 Thread Shree Devi Kumar
Here are the bash script files:

1. for finetune for impact training - add a font
2. for finetune plus-minus training - for adding a new character

On Thu, Jun 21, 2018 at 1:40 AM Shree Devi Kumar 
wrote:

> Attached is a BASH script for Finetune training for 'Impact' (refer to
> Ray's tutorial in wiki for more details).
> Use this when you want to finetune a model for a single new font.
>
> You will need to change the paths for directories and filenames based on
> your system.
>
> The script assumes that you have tesseract 4.0.0-beta installed alongwith
> training tools. Refer to wiki main page for info on how to download latest
> version of code from PPA etc.
>
> Please read through the script first, change as needed, create the
> required training texts and then run the script.
>
> #!/bin/bash
> #
> # Script to finetune a language traineddata file for one new font
> # for tesseract4.0.0-beta
> # Modify directory paths and filenames as required for your setup.
> #
> # Choose which parts of script are to be run?
> MakeData=yes
> RunTraining=yes
> RunEval=yes
> #
>
> # Language
> Lang=eng
>
> # downloaded directory with language data
> langdata_dir=~/langdata
>
> # Make about 150 lines of representative training text for finetuning
> finetune_training_text=$langdata_dir/$Lang/$Lang.finetune.training_text
>
> # Make about 150 lines of representative training text for evaluation
> eval_training_text=$langdata_dir/$Lang/$Lang.eval.training_text
>
> # fonts directory for this system
> fonts_dir=~/.fonts
>
> # Finetune training for IMPACT - ONE font ONLY
> fonts_for_training=" \
> 'Alanis Hand'  \
> "
>
> # directory with the old 'best' language training set to continue from eg.
> ara, eng, san
> bestdata_dir=~/tessdata_best
>
> # tessdata-dir which has osd.trainddata, eng.traineddata, config and
> tessconfigs folder and pdf.ttf
> tessdata_dir=~/tessdata
>
> # directory with training scripts - tesstrain.sh etc.
> tesstrain_dir=~/tesseract/src/training
>
> # output directories for this run
> trained_output_dir=./$Lang-finetune-impact
> eval_output_dir=./$Lang-finetune-impact-eval
>
> if [ $MakeData = "yes" ]; then
>
> echo "## MAKING EVAL DATA ##"
>  rm -rf $eval_output_dir
>  mkdir $trained_output_dir
>
> echo " running tesstrain.sh for eval text "
>
> eval bash $tesstrain_dir/tesstrain.sh \
> --lang $Lang \
> --linedata_only \
> --noextract_font_properties \
> --exposures "0" \
> --fonts_dir $fonts_dir \
> --fontlist $fonts_for_training \
> --langdata_dir $langdata_dir \
> --tessdata_dir  $tessdata_dir \
> --training_text $eval_training_text \
> --output_dir $eval_output_dir
>
> echo "## MAKING TRAINING DATA ##"
>  rm -rf $trained_output_dir
>  mkdir $trained_output_dir
>
> echo " running tesstrain.sh for training text "
>
> eval bash $tesstrain_dir/tesstrain.sh \
> --lang $Lang \
> --linedata_only \
> --noextract_font_properties \
> --exposures "0" \
> --fonts_dir $fonts_dir \
> --fontlist $fonts_for_training \
> --langdata_dir $langdata_dir \
> --tessdata_dir  $tessdata_dir \
> --training_text $finetune_training_text \
> --output_dir $trained_output_dir
>
> echo " running combine_tessdata to extract lstm model from
> 'tessdata_best' for $Lang "
>
> combine_tessdata -e $bestdata_dir/$Lang.traineddata
> $bestdata_dir/$Lang.lstm
>
> fi
>
> if [ $RunTraining = "yes" ]; then
>
> echo "## LSTM TRAINING ##"
>
> echo " running lstmtraining for finetuning from
> $bestdata_dir/$Lang.traineddata #"
>
> lstmtraining \
> --continue_from  $bestdata_dir/$Lang.lstm \
> --traineddata$bestdata_dir/$Lang.traineddata \
> --max_iterations 1000 \
> --debug_interval 0 \
> --train_listfile $trained_output_dir/$Lang.training_files.txt \
> --model_output  $trained_output_dir/finetune
>
> echo "## BUILD FINETUNED MODEL ##"
>
> echo " Building final trained file $Lang-finetune-$Lang.traineddata
> "
>
> lstmtraining \
> --stop_training \
> --continue_from $trained_output_dir/finetune_checkpoint \
> --traineddata$bestdata_dir/$Lang.traineddata \
> --model_output "$trained_output_dir/$Lang-finetune-$Lang.traineddata"
>
> fi
>
> if [ $RunEval = "yes" ]; then
>
> echo "## EVAL ORIGINAL MODE

Re: [tesseract-ocr] Re: tesseract-ocr

2018-06-20 Thread Shree Devi Kumar
>
> Thank you very much sir


Ma'am, not Sir. I am Mrs. Kumar.

Let me know if you have any questions or need clarification regarding the
scripts. I will post them on the wiki after any needed changes.

>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWPtTQkVcPA4YXAHiTh-8rRSMjveQOKJSJcN7bZO%3DOHFA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: tesseract-ocr

2018-06-21 Thread Shree Devi Kumar
Tesseract4 LSTM training is line based.

On Thu 21 Jun, 2018, 12:25 PM chandra churh chatterjee, <
chandrachurh.chatterje...@gmail.com> wrote:

> Excuse me @Shree Devi Kumar can you please tell me whether data for
> training tesseract 4.0 would be better if the data has images which have
> paragraphed hand written texts
> or single character based texts as follows
>
> On Wed, Jun 20, 2018 at 9:00 PM Shree Devi Kumar 
> wrote:
>
>> You will have better control on training if you use tesstrain.sh provided
>> with tesseract.
>>
>> On Wed, Jun 20, 2018 at 8:52 PM Navaneetha Bitla 
>> wrote:
>>
>>> http://www.1001fonts.com/handwritten-fonts.html.
>>>
>>> the above link has 1900+ fonts from that site i have downloaded the ttf
>>> files of fonts and converted to tiff files online.
>>>
>>> then i have trained the tiff files(fonts) using serak trainer.
>>>
>>>
>>> If you got the accuracy just forward the results so everyone can konw
>>> and will follw you.
>>>
>>> Thank you
>>>
>>> On Wed, Jun 20, 2018 at 3:13 PM, James Q 
>>> wrote:
>>>
>>>> I'm going to be using tesseract 4 and using the tesstrain.sh script. If
>>>> I come across things that improve accuracy though I will let you know.
>>>>
>>>> Where did you find 1300 handwriting fonts?
>>>>
>>>> On Tuesday, June 19, 2018 at 5:19:54 PM UTC+1, Navaneetha Bitla wrote:
>>>>>
>>>>> serak trainer using training tesseract 3.5.
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Jun 19, 2018 at 9:29 PM, James Q 
>>>>> wrote:
>>>>>
>>>>>> Hi Navaneetha
>>>>>> I am also looking to start training tesseract using handwritten fonts
>>>>>> and am about to start setting up my training environment. Are you 
>>>>>> training
>>>>>> tesseract 4 by following the guide at
>>>>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
>>>>>> ?
>>>>>>
>>>>>> If so are you fine tuning the existing english model, retraining just
>>>>>> the top layer(s) or training from scratch with your additional fonts?
>>>>>>
>>>>>> Thanks
>>>>>> Jim
>>>>>>
>>>>>> On Tuesday, June 19, 2018 at 10:30:30 AM UTC+1, Navaneetha Bitla
>>>>>> wrote:
>>>>>>>
>>>>>>> Hi, this is Navaneetha
>>>>>>>
>>>>>>> i'm working in hand written character recognition project.
>>>>>>>
>>>>>>> I have trained 1300 different hand written fonts of english and
>>>>>>> moved the files into tessdata directory.
>>>>>>>
>>>>>>> tested tesseract using the below commands:
>>>>>>>
>>>>>>> $convert -density 300 input.png -depth 8 -strip -background white
>>>>>>> -alpha off out.tiff
>>>>>>>
>>>>>>>  $tesseract out.tiff eng
>>>>>>>
>>>>>>> The input.png is of Alanis Handa font and i have trained this font
>>>>>>> but i'm not getting atleast 40% accuracy.
>>>>>>>
>>>>>>> Can someone help me.
>>>>>>>
>>>>>>>
>>>>>>> Thanks in advance.
>>>>>>>
>>>>>> --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>> To view this discussion on the web visit
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/253906ac-fedf-4364-ad70-e745b8786c0d%40googlegroups.com
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/253906ac-fedf-4364-ad70-e745b8786c0d%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>&

Re: [tesseract-ocr] Re: tesseract-ocr

2018-06-21 Thread Shree Devi Kumar
I had tried training with the handwriting font you mentioned in first
message.

I think that font has same shapes for capitals as well as lower case
letters.

So recognition rates will be lower for it.

On Thu 21 Jun, 2018, 1:49 PM Navaneetha Bitla, 
wrote:

> yeah i've tried to train with these images but its giving dpi etc error.
>
> Then i've moved to ttf font then converted ttf to tiff finally trained the
> data but output is very bad, i dont know whether bad results for training
> process or dataser.
>
> Still trying to make progress.
>
> On Thu, Jun 21, 2018 at 12:24 PM, chandra churh chatterjee <
> chandrachurh.chatterje...@gmail.com> wrote:
>
>> Excuse me @Shree Devi Kumar can you please tell me whether data for
>> training tesseract 4.0 would be better if the data has images which have
>> paragraphed hand written texts
>> or single character based texts as follows
>>
>> On Wed, Jun 20, 2018 at 9:00 PM Shree Devi Kumar 
>> wrote:
>>
>>> You will have better control on training if you use tesstrain.sh
>>> provided with tesseract.
>>>
>>> On Wed, Jun 20, 2018 at 8:52 PM Navaneetha Bitla 
>>> wrote:
>>>
>>>> http://www.1001fonts.com/handwritten-fonts.html.
>>>>
>>>> the above link has 1900+ fonts from that site i have downloaded the ttf
>>>> files of fonts and converted to tiff files online.
>>>>
>>>> then i have trained the tiff files(fonts) using serak trainer.
>>>>
>>>>
>>>> If you got the accuracy just forward the results so everyone can konw
>>>> and will follw you.
>>>>
>>>> Thank you
>>>>
>>>> On Wed, Jun 20, 2018 at 3:13 PM, James Q 
>>>> wrote:
>>>>
>>>>> I'm going to be using tesseract 4 and using the tesstrain.sh script.
>>>>> If I come across things that improve accuracy though I will let you know.
>>>>>
>>>>> Where did you find 1300 handwriting fonts?
>>>>>
>>>>> On Tuesday, June 19, 2018 at 5:19:54 PM UTC+1, Navaneetha Bitla wrote:
>>>>>>
>>>>>> serak trainer using training tesseract 3.5.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Jun 19, 2018 at 9:29 PM, James Q 
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Navaneetha
>>>>>>> I am also looking to start training tesseract using handwritten
>>>>>>> fonts and am about to start setting up my training environment. Are you
>>>>>>> training tesseract 4 by following the guide at
>>>>>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
>>>>>>> ?
>>>>>>>
>>>>>>> If so are you fine tuning the existing english model, retraining
>>>>>>> just the top layer(s) or training from scratch with your additional 
>>>>>>> fonts?
>>>>>>>
>>>>>>> Thanks
>>>>>>> Jim
>>>>>>>
>>>>>>> On Tuesday, June 19, 2018 at 10:30:30 AM UTC+1, Navaneetha Bitla
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi, this is Navaneetha
>>>>>>>>
>>>>>>>> i'm working in hand written character recognition project.
>>>>>>>>
>>>>>>>> I have trained 1300 different hand written fonts of english and
>>>>>>>> moved the files into tessdata directory.
>>>>>>>>
>>>>>>>> tested tesseract using the below commands:
>>>>>>>>
>>>>>>>> $convert -density 300 input.png -depth 8 -strip -background white
>>>>>>>> -alpha off out.tiff
>>>>>>>>
>>>>>>>>  $tesseract out.tiff eng
>>>>>>>>
>>>>>>>> The input.png is of Alanis Handa font and i have trained this font
>>>>>>>> but i'm not getting atleast 40% accuracy.
>>>>>>>>
>>>>>>>> Can someone help me.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks in advance.
>>>>>>>>
>>>>>>> --
>>>>>>> You received this message because you are subscribed to the Google
>>>>>>> Groups &qu

Re: [tesseract-ocr] Re: tesseract-ocr

2018-06-21 Thread Shree Devi Kumar
> Quite a few of these handwriting fonts are uppercase letters only (so
lowercase come out as uppercase when typed) . What is the best type of
[lang].training_text data to use for training these - is it uppercase only?

It would depend on the application where training is being used.

If you want support for both upper case and lower case, then make a list of
fonts that have only uppercase letters and create LSTMF files for that with
a training text that has only capitals. For rest of the fonts use a normal
training text with both upper and lower case. While running LSTMtraining
use bothh sets of lstmf files.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXSjQrUHwBVTUzax96WCOArrtjDCCCTf_j5-PFCN5hnpw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: tesseract-ocr

2018-06-21 Thread Shree Devi Kumar
You can use ALL fonts at once. However, I have had errors with box files
not being created for some fonts and the tesstrain_utils.sh script dies
only at end while checking whether files are readable or not.  In that case
have to restart the process again.

On Thu, Jun 21, 2018 at 8:28 PM James Q  wrote:

> Hi Shree, I'm trying out the script you posted earlier which is great so
> thank you! I was wondering how many fonts I can specify at once in the
> 'fonts_for_training' list. I have run it with 9 fonts at once and that
> seems fine but I would like to do 100s or even 1000s if I can. Is this the
> best way or would I be better off creating the lstmf files in batches first?
>
> On Thursday, June 21, 2018 at 1:05:42 PM UTC+1, shree wrote:
>>
>> > Quite a few of these handwriting fonts are uppercase letters only (so
>> lowercase come out as uppercase when typed) . What is the best type of
>> [lang].training_text data to use for training these - is it uppercase only?
>>
>> It would depend on the application where training is being used.
>>
>> If you want support for both upper case and lower case, then make a list
>> of fonts that have only uppercase letters and create LSTMF files for that
>> with a training text that has only capitals. For rest of the fonts use a
>> normal training text with both upper and lower case. While running
>> LSTMtraining use bothh sets of lstmf files.
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/3d49028a-5fd0-4756-8c3b-810e8f935bbe%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWBzFwhw%3D%3D%3DnngmS8RqsgG3C%3DuSUWKeM%3DoqqEU%3DbCFEKw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: tesseract-ocr

2018-06-21 Thread Shree Devi Kumar
 # Make about 150 lines of representative training text for finetuning
finetune_training_text=$langdata_dir/$Lang/$Lang.finetune.training_text

# Make about 150 lines of representative training text for evaluation
eval_training_text=$langdata_dir/$Lang/$Lang.eval.training_text




On Thu, Jun 21, 2018 at 10:03 PM  wrote:

> @Shree
>
> Thanks for providing the two bash scripts
> I want to ask you about tesstrain.sh and tesstrain_utils.sh, Is there
> something that must be edited before running lstmtrain_finetune_impact.sh ?
>
> On Wednesday, June 20, 2018 at 11:56:27 PM UTC+3, shree wrote:
>>
>> Here are the bash script files:
>>
>> 1. for finetune for impact training - add a font
>> 2. for finetune plus-minus training - for adding a new character
>>
>> On Thu, Jun 21, 2018 at 1:40 AM Shree Devi Kumar 
>> wrote:
>>
>>> Attached is a BASH script for Finetune training for 'Impact' (refer to
>>> Ray's tutorial in wiki for more details).
>>> Use this when you want to finetune a model for a single new font.
>>>
>>> You will need to change the paths for directories and filenames based on
>>> your system.
>>>
>>> The script assumes that you have tesseract 4.0.0-beta installed
>>> alongwith training tools. Refer to wiki main page for info on how to
>>> download latest version of code from PPA etc.
>>>
>>> Please read through the script first, change as needed, create the
>>> required training texts and then run the script.
>>>
>>> #!/bin/bash
>>> #
>>> # Script to finetune a language traineddata file for one new font
>>> # for tesseract4.0.0-beta
>>> # Modify directory paths and filenames as required for your setup.
>>> #
>>> # Choose which parts of script are to be run?
>>> MakeData=yes
>>> RunTraining=yes
>>> RunEval=yes
>>> #
>>>
>>> # Language
>>> Lang=eng
>>>
>>> # downloaded directory with language data
>>> langdata_dir=~/langdata
>>>
>>> # Make about 150 lines of representative training text for finetuning
>>> finetune_training_text=$langdata_dir/$Lang/$Lang.finetune.training_text
>>>
>>> # Make about 150 lines of representative training text for evaluation
>>> eval_training_text=$langdata_dir/$Lang/$Lang.eval.training_text
>>>
>>> # fonts directory for this system
>>> fonts_dir=~/.fonts
>>>
>>> # Finetune training for IMPACT - ONE font ONLY
>>> fonts_for_training=" \
>>> 'Alanis Hand'  \
>>> "
>>>
>>> # directory with the old 'best' language training set to continue from
>>> eg. ara, eng, san
>>> bestdata_dir=~/tessdata_best
>>>
>>> # tessdata-dir which has osd.trainddata, eng.traineddata, config and
>>> tessconfigs folder and pdf.ttf
>>> tessdata_dir=~/tessdata
>>>
>>> # directory with training scripts - tesstrain.sh etc.
>>> tesstrain_dir=~/tesseract/src/training
>>>
>>> # output directories for this run
>>> trained_output_dir=./$Lang-finetune-impact
>>> eval_output_dir=./$Lang-finetune-impact-eval
>>>
>>> if [ $MakeData = "yes" ]; then
>>>
>>> echo "## MAKING EVAL DATA ##"
>>>  rm -rf $eval_output_dir
>>>  mkdir $trained_output_dir
>>>
>>> echo " running tesstrain.sh for eval text "
>>>
>>> eval bash $tesstrain_dir/tesstrain.sh \
>>> --lang $Lang \
>>> --linedata_only \
>>> --noextract_font_properties \
>>> --exposures "0" \
>>> --fonts_dir $fonts_dir \
>>> --fontlist $fonts_for_training \
>>> --langdata_dir $langdata_dir \
>>> --tessdata_dir  $tessdata_dir \
>>> --training_text $eval_training_text \
>>> --output_dir $eval_output_dir
>>>
>>> echo "## MAKING TRAINING DATA ##"
>>>  rm -rf $trained_output_dir
>>>  mkdir $trained_output_dir
>>>
>>> echo " running tesstrain.sh for training text "
>>>
>>> eval bash $tesstrain_dir/tesstrain.sh \
>>> --lang $Lang \
>>> --linedata_only \
>>> --noextract_font_properties \
>>> --exposures "0" \
>>> --fonts_dir $fonts_dir \
>>> --fontlist $fonts_for_tra

Re: [tesseract-ocr] Getting error while creating .lstm files

2018-06-21 Thread Shree Devi Kumar
Look at src/training/language_specific.sh

The list of default fonts for English is being picked up from there and you
probably don't have them installed.

Use fonts that are available.

On Fri, Jun 22, 2018 at 9:20 AM Harathi Surya 
wrote:

> Hi,
>
> I am trying to create .lstm files to finetune tesseract4.0.0 for new
> characters. I want to fine tune tesseract to recognize new characters like
> ±.
> What i tried:
> I added text that consists of the plus or minus symbol to the
> eng.training_text in langdata.
> Then I tried to run the following command
>
> src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng
> --linedata_only --noextract_font_properties --langdata_dir ../langdata
>  --tessdata_dir ./tessdata --output_dir ~/tesstutorial/trainplusminus
>
> I am getting the following error:
> ERROR: /tmp/tmp.3qWucNlYrH/eng/eng.Arial.exp0.box does not exist or is not
> readable
>
> The error repeated for all the font types.
>
> Can you please give some suggestions why this error occurs and how to
> solve this?
>
> Thanks in advance
> Harathi
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/ae8f7849-8d9a-4799-be3d-47dc67fcddc2%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWXEvRyLZw2e2rg4cgZxHjvcnSETNCYZWssAz6LN%2BDS_g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Word coordinate for single lines.

2018-06-22 Thread Shree Devi Kumar
Please try with a different psm and see if you get better results. If you
share a sample image we can test and respond.

On Fri, Jun 22, 2018 at 5:29 PM  wrote:

> Could someone please try to give me an answer for my language.
>
> On Friday, June 15, 2018 at 2:42:00 PM UTC+2, ahka.an...@gmail.com wrote:
>>
>> Dear All,
>>
>> In the project that I am currently working in, I have a pure text line
>> cropped from an document image.
>>
>> As a next step, I need to recognize the text using and at the same time,
>> I need to get the words coordinates.
>>
>> To get that coordinates I am passing the hocr parameters to the command
>> line and assign the page segmentation mode to 7 (line).
>>
>> tesseract file.png out.txt --psm 7 hocr.
>>
>> However, the output is really bad because by passing these parameters,
>> the line will be conisders as a page and some words will not be detected at
>> the output.
>>
>> Is there another way to get the word coordinate of that line?
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/d24b268f-5cfa-4d20-89c0-9dfd2360f0dc%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXL-VCLpqzi3adCuBDwRfBhQ_ksCaqyQ%3DYgiGOwG1bEHg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Word coordinate for single lines.

2018-06-22 Thread Shree Devi Kumar
Try adding a slight white border to images and see if that helps.

On Fri, Jun 22, 2018 at 7:35 PM  wrote:

>
> 
>
>
> 
>
> Thanks for the reply
> Those are two line examples.
>
> On Friday, June 22, 2018 at 3:59:23 PM UTC+2, shree wrote:
>>
>> Please try with a different psm and see if you get better results. If you
>> share a sample image we can test and respond.
>>
>> On Fri, Jun 22, 2018 at 5:29 PM  wrote:
>>
>>> Could someone please try to give me an answer for my language.
>>>
>>> On Friday, June 15, 2018 at 2:42:00 PM UTC+2, ahka.an...@gmail.com
>>> wrote:

 Dear All,

 In the project that I am currently working in, I have a pure text line
 cropped from an document image.

 As a next step, I need to recognize the text using and at the same
 time, I need to get the words coordinates.

 To get that coordinates I am passing the hocr parameters to the command
 line and assign the page segmentation mode to 7 (line).

 tesseract file.png out.txt --psm 7 hocr.

 However, the output is really bad because by passing these parameters,
 the line will be conisders as a page and some words will not be detected at
 the output.

 Is there another way to get the word coordinate of that line?

>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/d24b268f-5cfa-4d20-89c0-9dfd2360f0dc%40googlegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>> --
>>
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/38f2e418-76a3-4c0b-8ec3-71e6ebe62d83%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVP0OdjJc0Qh4YWWJjq-yWwtpU57vbMsiqD3L9eVDDUeQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] recognising roman with sanskrit diacritics

2018-06-22 Thread Shree Devi Kumar
Please try with iast.traineddata model for tesseract.4.0.0-beta posted at
https://github.com/Shreeshrii/tessdata_sanskrit

On Thu, Jun 21, 2018 at 11:38 PM yajva  wrote:

> one more correction.
>
>
> On Thursday, June 21, 2018 at 11:34:00 PM UTC+5:30, yajva wrote:
>>
>> done
>>
>> On Wednesday, June 20, 2018 at 9:05:01 PM UTC+5:30, shree wrote:
>>>
>>> I am attaching the OCRed text. Please correct it so that  I can use as
>>> groundtruth for further training and testing.
>>>
>>> On Wed, Jun 20, 2018 at 3:15 PM Shree Devi Kumar 
>>> wrote:
>>>
>>>> I had done a training for sanskrit for both devanagari and IAST but it
>>>> does not include cedilla for Sh
>>>>
>>>> I will add it and let you know.
>>>>
>>>> On Wed 20 Jun, 2018, 1:17 AM yajva,  wrote:
>>>>
>>>>> I have tried Google OCR for recognizing Sanskrit text in Roman with
>>>>> diacritics (IAST). It recognizes above macron but not dots below also
>>>>> joining grave and accent. Is there any traineddata available for tesseract
>>>>> that can do this with good accuracy ? Attached a sample page that I am
>>>>> interested in.
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to tesseract-oc...@googlegroups.com.
>>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/aef0797b-8df3-4db7-9a3b-02f62d2e5a28%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/aef0797b-8df3-4db7-9a3b-02f62d2e5a28%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>
>>> --
>>>
>>> 
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/a7bdf637-7f17-4eb3-8fa8-297018633bfa%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/a7bdf637-7f17-4eb3-8fa8-297018633bfa%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduX5A3AamK0JGjmBfxpG8FhoAoODvTkiPZYciX2WMCqp0g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] recognising roman with sanskrit diacritics

2018-06-22 Thread Shree Devi Kumar
Sorry, there seems to be some regression in the file posted on github. I
will upload again later.

On Fri, Jun 22, 2018 at 7:56 PM Shree Devi Kumar 
wrote:

> Please try with iast.traineddata model for tesseract.4.0.0-beta posted at
> https://github.com/Shreeshrii/tessdata_sanskrit
>
> On Thu, Jun 21, 2018 at 11:38 PM yajva  wrote:
>
>> one more correction.
>>
>>
>> On Thursday, June 21, 2018 at 11:34:00 PM UTC+5:30, yajva wrote:
>>>
>>> done
>>>
>>> On Wednesday, June 20, 2018 at 9:05:01 PM UTC+5:30, shree wrote:
>>>>
>>>> I am attaching the OCRed text. Please correct it so that  I can use as
>>>> groundtruth for further training and testing.
>>>>
>>>> On Wed, Jun 20, 2018 at 3:15 PM Shree Devi Kumar 
>>>> wrote:
>>>>
>>>>> I had done a training for sanskrit for both devanagari and IAST but it
>>>>> does not include cedilla for Sh
>>>>>
>>>>> I will add it and let you know.
>>>>>
>>>>> On Wed 20 Jun, 2018, 1:17 AM yajva,  wrote:
>>>>>
>>>>>> I have tried Google OCR for recognizing Sanskrit text in Roman with
>>>>>> diacritics (IAST). It recognizes above macron but not dots below also
>>>>>> joining grave and accent. Is there any traineddata available for 
>>>>>> tesseract
>>>>>> that can do this with good accuracy ? Attached a sample page that I am
>>>>>> interested in.
>>>>>>
>>>>>> --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>> To view this discussion on the web visit
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/aef0797b-8df3-4db7-9a3b-02f62d2e5a28%40googlegroups.com
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/aef0797b-8df3-4db7-9a3b-02f62d2e5a28%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>
>>>> --
>>>>
>>>> 
>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>
>>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/a7bdf637-7f17-4eb3-8fa8-297018633bfa%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/a7bdf637-7f17-4eb3-8fa8-297018633bfa%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
> --
>
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXmGb9rjEo31q19%2BD1ArqVqm0LiWGFt8O8NSzos1KXe%2Bg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Not getting correct output even after finetuning tesseract with new character

2018-06-22 Thread Shree Devi Kumar
Did you run the eval as given in

https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters

Did you stop training and create a new traineddata file?

Are you using the new traineddata file for testing?

On Sat, Jun 23, 2018 at 12:36 AM Harathi Surya 
wrote:

> Hi,
>
> I am facing one more problem here.
> I have trained the tesseract for new character successfully.
>
> But when i tried to test it by giving the following command:
>
> 'tesseract test.png out -l eng'
>
> The output is not satisfactory. I trained tesseract for '±' character. But
> there is no change in output before and after finetuning. I have trained
> the model for 3600 iterations. The final loss is 0.0013.
>
> Please find the attached files for the input image and the output text i
> am getting.
> Can anyone please help me with this...
>
> Thanks,
> Harathi
>
> On Thursday, June 21, 2018 at 8:50:14 PM UTC-7, Harathi Surya wrote:
>>
>> Hi,
>>
>> I am trying to create .lstm files to finetune tesseract4.0.0 for new
>> characters. I want to fine tune tesseract to recognize new characters like
>> ±.
>> What i tried:
>> I added text that consists of the plus or minus symbol to the
>> eng.training_text in langdata.
>> Then I tried to run the following command
>>
>> src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng
>> --linedata_only --noextract_font_properties --langdata_dir ../langdata
>>  --tessdata_dir ./tessdata --output_dir ~/tesstutorial/trainplusminus
>>
>> I am getting the following error:
>> ERROR: /tmp/tmp.3qWucNlYrH/eng/eng.Arial.exp0.box does not exist or is
>> not readable
>>
>> The error repeated for all the font types.
>>
>> Can you please give some suggestions why this error occurs and how to
>> solve this?
>>
>> Thanks in advance
>> Harathi
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/bab88269-2791-49bf-9d6d-b426bee0eaec%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWZCPC__VDRdkdBw_9G2mbQ6u%3DAxqZ7fXK41YKjj%2Bg7Tg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Getting error while creating .lstm files

2018-06-22 Thread Shree Devi Kumar
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#combining-the-output-files

On Sat, Jun 23, 2018 at 12:55 AM Harathi Surya 
wrote:

> Hi Shree,
>
> Thanks for your reply.
> I replaced fontlist argument 'Impact Condensed' with 'DejaVu Sans' to
> create evalplusminus folder.
>
> Then i ran the lstmeval command and i got this as output
> At iteration 0, stage 0, Eval Char error rate=0.024610566, Word error
> rate=0.086171938
>
> Do i need to create new traineddata file?
>
> I have traineddata files in '/local/share/tessdata' for old data and
> 'tesstutorial/trainplusminus/eng' which is created for new data.
>
> Do i need to give TESSDATA_PREFIX='tesstutorial/trainplusminus/eng'
> instead of  '/local/share/tessdata'
>
> Please guide me
>
> Thanks in advance,
> Harathi
>
>
>
> On Thursday, June 21, 2018 at 8:50:14 PM UTC-7, Harathi Surya wrote:
>>
>> Hi,
>>
>> I am trying to create .lstm files to finetune tesseract4.0.0 for new
>> characters. I want to fine tune tesseract to recognize new characters like
>> ±.
>> What i tried:
>> I added text that consists of the plus or minus symbol to the
>> eng.training_text in langdata.
>> Then I tried to run the following command
>>
>> src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng
>> --linedata_only --noextract_font_properties --langdata_dir ../langdata
>>  --tessdata_dir ./tessdata --output_dir ~/tesstutorial/trainplusminus
>>
>> I am getting the following error:
>> ERROR: /tmp/tmp.3qWucNlYrH/eng/eng.Arial.exp0.box does not exist or is
>> not readable
>>
>> The error repeated for all the font types.
>>
>> Can you please give some suggestions why this error occurs and how to
>> solve this?
>>
>> Thanks in advance
>> Harathi
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/8c20d6e6-14ad-476e-a7c1-db72e2e0ec3e%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU8E9ie1OH6dy_3DpzEycTC1d_%2B18mwLYh%3DA6S7oHzuvQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Getting error while creating .lstm files

2018-06-22 Thread Shree Devi Kumar
The tutorial has been written by Ray Smith. I haven't tested the plus-minus
as given.

Check whether the fonts you are using have the plus-minus sign.

Using one font is for the IMPACT tutorial with 400 iterations.

For plus-minus you need to use the larger list of fonts.

On Sat, Jun 23, 2018 at 1:13 AM Harathi Surya 
wrote:

> Sorry by mistake uploaded the wrong file. Please find the attached file
> for the output i got.
>
> Thanks,
> Harathi
>
> On Friday, June 22, 2018 at 12:41:25 PM UTC-7, Harathi Surya wrote:
>>
>> Thanks Shree,
>>
>> I followed the instructions and ran the following command:
>>
>> src/training/lstmtraining --stop_training   --continue_from
>> ~/tesstutorial/trainplusminus/plusminus_checkpoint   --traineddata
>> ~/tesstutorial/trainplusminus/eng/eng.traineddata   --model_output
>> ~/tesstutorial/trainplusminus/eng.traineddata
>>
>> Then i changed the TESSDATA_PREFIX to '/tesstutorial/trainplusminus'.
>> Then i tested the model with the image i attached in the previous email.
>> The output is little changed. But didnt get expected. '±' symbol is
>> replaced by '+' symbol. Please find the attached output file.
>> Training for more epochs may improve this?
>>
>> Thanks,
>> Harathi
>>
>> On Thursday, June 21, 2018 at 8:50:14 PM UTC-7, Harathi Surya wrote:
>>>
>>> Hi,
>>>
>>> I am trying to create .lstm files to finetune tesseract4.0.0 for new
>>> characters. I want to fine tune tesseract to recognize new characters like
>>> ±.
>>> What i tried:
>>> I added text that consists of the plus or minus symbol to the
>>> eng.training_text in langdata.
>>> Then I tried to run the following command
>>>
>>> src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng
>>> --linedata_only --noextract_font_properties --langdata_dir ../langdata
>>>  --tessdata_dir ./tessdata --output_dir ~/tesstutorial/trainplusminus
>>>
>>> I am getting the following error:
>>> ERROR: /tmp/tmp.3qWucNlYrH/eng/eng.Arial.exp0.box does not exist or is
>>> not readable
>>>
>>> The error repeated for all the font types.
>>>
>>> Can you please give some suggestions why this error occurs and how to
>>> solve this?
>>>
>>> Thanks in advance
>>> Harathi
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/18d951f5-7ef4-4f2f-9faf-9b1233c6c325%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWzNORdeVhz8POhBj3AF4zKFLpe3urXQN9zvKzbhDAspA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] why my hocr file look like this

2018-06-23 Thread Shree Devi Kumar
 tesseract test.png result horc

You used wrong config file. It should be hocr not horc

On Sat, Jun 23, 2018 at 12:23 PM Ben Zhang  wrote:

> Hi, All,
> I used tesseract 3.05, and type 'tesseract test.png result horc' in
> command line, get result.horc, in this file it has:
>
> *Provider* *Networks* Precertification 808.791.7505 direct 888.941.4622
> x302toll-free 808.535.8398 fax
>
> Medical *&* Dental *-* *Hawaii* Medical *-* Mainland 888.941 .HMAA (4622)
> *V* Cigna PPO *‘* hmaa.com/providers *HWMG* cigna.com 4F Clgna Submit
> claims directly to HWMG: Submit claims directly to Cigna: PO Box 32580 PO
> Box 188061 Honolulu, HI 96803-2580 Chattanooga, TN 37422-8061 Payer ID
> 48330 Payer ID 62308 8 Drug *-* *Hawaii* *&* Mainland *Vision* *-*
> *Hawaii* *&* Mainland 855.785.6960 .._ Vision Choice {.3
> Express—Scripts.com fl; amass *SCRIPTSE* 800.877.7195 VS V" VS p *I* CO m
> 9; *care* for Me Submit claims directly to Express Scripts Submit claims
> directly to VSP.
>
>
> \ or call 800.922.1557 for pharmacy help.
>
> Why no info like
>
> LibTesseract.simple_read(config_line_with_hocr, 'phrase.png')
>   
>
> 
>  the  class='ocrx_word' id='word_1_2' title='bbox 53 13 97 25; x_wconf 84' 
> lang='eng' dir='ltr'>book  id='word_1_3' title='bbox 111 13 129 25; x_wconf 79' lang='eng' 
> dir='ltr'>is  title='bbox 143 17 164 25; x_wconf 83' lang='eng' dir='ltr'>on  class='ocrx_word' id='word_1_5' title='bbox 178 14 209 25; x_wconf 75' 
> lang='eng' dir='ltr'>the  id='word_1_6' title='bbox 223 14 276 25; x_wconf 76' lang='eng' 
> dir='ltr'>table
>  
> 
>
>   
>
> I am new to tesseract. Thanks for your help
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/37fb723a-750f-434d-a12e-f597a80b59e7%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWYXScug%3DQ7%2BVmDpimQTV0babuju7Pt358Xg4rAqbbTaA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] recognising roman with sanskrit diacritics

2018-06-23 Thread Shree Devi Kumar
Please test with traineddata file from
https://github.com/Shreeshrii/tessdata_sanskrit/tree/master/iast-plus1

Need to check that is it not overfitted.

Please share a couple more images which I can use for testing.


On Thu, Jun 21, 2018 at 11:38 PM yajva  wrote:

> one more correction.
>
>
> On Thursday, June 21, 2018 at 11:34:00 PM UTC+5:30, yajva wrote:
>>
>> done
>>
>> On Wednesday, June 20, 2018 at 9:05:01 PM UTC+5:30, shree wrote:
>>>
>>> I am attaching the OCRed text. Please correct it so that  I can use as
>>> groundtruth for further training and testing.
>>>
>>> On Wed, Jun 20, 2018 at 3:15 PM Shree Devi Kumar 
>>> wrote:
>>>
>>>> I had done a training for sanskrit for both devanagari and IAST but it
>>>> does not include cedilla for Sh
>>>>
>>>> I will add it and let you know.
>>>>
>>>> On Wed 20 Jun, 2018, 1:17 AM yajva,  wrote:
>>>>
>>>>> I have tried Google OCR for recognizing Sanskrit text in Roman with
>>>>> diacritics (IAST). It recognizes above macron but not dots below also
>>>>> joining grave and accent. Is there any traineddata available for tesseract
>>>>> that can do this with good accuracy ? Attached a sample page that I am
>>>>> interested in.
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to tesseract-oc...@googlegroups.com.
>>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/aef0797b-8df3-4db7-9a3b-02f62d2e5a28%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/aef0797b-8df3-4db7-9a3b-02f62d2e5a28%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>
>>> --
>>>
>>> 
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/a7bdf637-7f17-4eb3-8fa8-297018633bfa%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/a7bdf637-7f17-4eb3-8fa8-297018633bfa%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXBbHz9FY6hHMitiDAJtckxnyBXYfX7ZJLQhLxgGb2gAg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] "read_params_file: parameter not found: " for hindi

2018-06-25 Thread Shree Devi Kumar
looks like you are using wrong version of traineddata file ie. 3.0x
hin.traineddata with code for tesseract4.0.0.

On Mon, Jun 25, 2018 at 1:01 PM Kiran Sonar  wrote:

> Hi,
> I am trying to get Hindi text from attached image. But when i set language
> to "hin" i get "read_params_file: parameter not found: "
> I am using Tess4J 3.8.4.
> Anyone knows what is the issue here.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/3942a063-d517-4767-b439-587a737dbc62%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduW8xW7TSRwZgnN3NXgqdQAK2Y2mvyeKArr7%2BuJQMhF1Zg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: "read_params_file: parameter not found: " for hindi

2018-06-25 Thread Shree Devi Kumar
i am not familiar with   Tess4J 3.8.4.

Have you tried it directly from command line?

It is also possible that you are not  using correct syntax for the command
and the language name is being used as output file name,

try the following

tesseract input.png output -l hin


On Mon, Jun 25, 2018 at 2:34 PM Kiran Sonar  wrote:

> Using this trainedata has significantly improved my accuracy for English.
> I expected same for hindi. I tried adding Configs files, Langdata for hindi
> but that doesnt work.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/92caa5fa-6a7d-42e1-98c2-951809e88fce%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU1ddUtBJ-hyThGFEwQBfgZMGDmX8oGUkjc1%2BZ5KY8ARw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] java.lang.UnsatisfiedLinkError: The specified module could not be found.

2018-06-26 Thread Shree Devi Kumar
Please post in https://github.com/nguyenq/tess4j/issues

On Tue, Jun 26, 2018 at 1:30 PM Kiran Sonar  wrote:

> I moved to tess4j_4.00 from tess4J_3.8.4 which is neccesary to use new
> trainedData- best files. I am getting this error
> Exception in thread "main" java.lang.UnsatisfiedLinkError: The specified
> module could not be found.
>
> at com.sun.jna.Native.open(Native Method)
> at com.sun.jna.Native.open(Native.java:1759)
> at com.sun.jna.NativeLibrary.loadLibrary(NativeLibrary.java:260)
> at com.sun.jna.NativeLibrary.getInstance(NativeLibrary.java:398)
> at com.sun.jna.Native.register(Native.java:1396)
> at com.sun.jna.Native.register(Native.java:1156)
> at net.sourceforge.tess4j.TessAPI1.(TessAPI1.java:41)
> at OCR.confidenceWord(OCR.java:106)
> at OCR.processImg(OCR.java:381)
> at test.main(test.java:10)
>
> I've reinstalled VC++ 2015 redistributable. I looked up in
> dependencyWalker but no dll seems to be missing. I enabled JNA logs and get
> this
> Looking in classpath from sun.misc.Launcher$AppClassLoader@73d16e93 for
> /com/sun/jna/win32-x86-64/jnidispatch.dll
> Found library resource at
> jar:file:/D:/Kiran/Update_27/Test/lib/jna-4.1.0.jar!/com/sun/jna/win32-x86-64/jnidispatch.dll
> Looking for library 'libtesseract400'
> Adding paths from jna.library.path:
> D://KITE_DATA//ExternalLib//TextLib//win32-x86-64;C:\Users\kirans13\AppData\Local\Temp\tess4j\win32-x86-64
> Trying D:\KITE_DATA\ExternalLib\TextLib\win32-x86-64\libtesseract400.dll
> Adding system paths: []
> Trying D:\KITE_DATA\ExternalLib\TextLib\win32-x86-64\libtesseract400.dll
> Looking for lib- prefix
> Trying liblibtesseract400.dll
> Looking in classpath from sun.misc.Launcher$AppClassLoader@73d16e93 for
> libtesseract400
> Found library resource at
> jar:file:/D:/Kiran/Update_27/Test/lib/tess4j-4.0.0.jar!/win32-x86-64/libtesseract400.dll
>
> Does anyone know how to resolve this issue?
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/81ae872d-a427-49c9-b35b-d55245b382b3%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVKE9H5-W8S4ZQye4e-JiMa1D043SrhQAKm-bCZDqqzDw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] recognising roman with sanskrit diacritics

2018-06-26 Thread Shree Devi Kumar
I had used ghostview to convert PDF to tif or png.

You can ocr PDF directly with gimagereader using the traineddata file I
sent.

See links for new windows binaries in msg below.


At last, here are some fresh builds:

https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_i686_tesseract4.git87635c1.exe
https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_x86_64_tesseract4.git87635c1.exe

I'd be also interested in testing of the tessdata manager, which should now
also properly handle script tessdatas

On Tue 26 Jun, 2018, 10:59 PM yajva,  wrote:

> The doc is diff ver of the same text. Here's the doc used for the first.
> png. This is slightly darker, but the one sent earlier is cleaner. Let me
> know which is more amenable for OCRing. I use PDF Shaper to extract images
> and convert to png using xnview.
>
> On Tuesday, June 26, 2018 at 7:48:28 PM UTC+5:30, shree wrote:
>>
>> Traineddata file is attached for use with tesseract4.0.0-beta.
>>
>> How did you create the test png from the pdf? I am not getting as good
>> quality, tried various settings with irfanview.
>>
>>
>>
>> On Tue, Jun 26, 2018 at 4:58 PM yajva  wrote:
>>
>>> Sorry for the delay, my system was down.
>>>
>>> I am getting "Page not Found" for the link given. Can you pl re-check?
>>>
>>> Here's the doc I am trying to OCR
>>>
>>>
>>> On Saturday, June 23, 2018 at 9:46:08 PM UTC+5:30, shree wrote:
>>>>
>>>> Please test with traineddata file from
>>>> https://github.com/Shreeshrii/tessdata_sanskrit/tree/master/iast-plus1
>>>> <https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2FShreeshrii%2Ftessdata_sanskrit%2Ftree%2Fmaster%2Fiast-plus1&sa=D&sntz=1&usg=AFQjCNHSTndmiJUoozyMRJ7OpHzTKIqYLw>
>>>>
>>>> Need to check that is it not overfitted.
>>>>
>>>> Please share a couple more images which I can use for testing.
>>>>
>>>>
>>>> On Thu, Jun 21, 2018 at 11:38 PM yajva  wrote:
>>>>
>>>>> one more correction.
>>>>>
>>>>>
>>>>> On Thursday, June 21, 2018 at 11:34:00 PM UTC+5:30, yajva wrote:
>>>>>>
>>>>>> done
>>>>>>
>>>>>> On Wednesday, June 20, 2018 at 9:05:01 PM UTC+5:30, shree wrote:
>>>>>>>
>>>>>>> I am attaching the OCRed text. Please correct it so that  I can use
>>>>>>> as groundtruth for further training and testing.
>>>>>>>
>>>>>>> On Wed, Jun 20, 2018 at 3:15 PM Shree Devi Kumar 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I had done a training for sanskrit for both devanagari and IAST but
>>>>>>>> it does not include cedilla for Sh
>>>>>>>>
>>>>>>>> I will add it and let you know.
>>>>>>>>
>>>>>>>> On Wed 20 Jun, 2018, 1:17 AM yajva,  wrote:
>>>>>>>>
>>>>>>>>> I have tried Google OCR for recognizing Sanskrit text in Roman
>>>>>>>>> with diacritics (IAST). It recognizes above macron but not dots below 
>>>>>>>>> also
>>>>>>>>> joining grave and accent. Is there any traineddata available for 
>>>>>>>>> tesseract
>>>>>>>>> that can do this with good accuracy ? Attached a sample page that I am
>>>>>>>>> interested in.
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>>>>> To view this discussion on the web visit
>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/aef0797b-8df3-4db7-9a3b-02f62d2e5a28%40googlegroups.com
>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/aef0797b-8df3-4db7-9a3b-02f62d2e5a28%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>

Re: [tesseract-ocr] recognising roman with sanskrit diacritics

2018-06-27 Thread Shree Devi Kumar
ok. I will take a look.

On Wed, Jun 27, 2018 at 5:04 PM yajva  wrote:

> Checked with both light & dark pdfs. The results are very good. Thanks.
>
> A few concerns. E is consistently missed in both. J is missed consistently
> in darker image but recognized as T in dark image. ṝ is recognized as ṛ
> consistently. Can these be addressed ?
> I am using tesseract 4 alpha windows build from command line.
>
> Are the dev files in repos ?
>
>
> On Tuesday, June 26, 2018 at 11:06:06 PM UTC+5:30, shree wrote:
>>
>> I had used ghostview to convert PDF to tif or png.
>>
>> You can ocr PDF directly with gimagereader using the traineddata file I
>> sent.
>>
>> See links for new windows binaries in msg below.
>>
>>
>> At last, here are some fresh builds:
>>
>>
>> https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_i686_tesseract4.git87635c1.exe
>>
>> https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_x86_64_tesseract4.git87635c1.exe
>>
>> I'd be also interested in testing of the tessdata manager, which should
>> now also properly handle script tessdatas
>>
>> On Tue 26 Jun, 2018, 10:59 PM yajva,  wrote:
>>
>>> The doc is diff ver of the same text. Here's the doc used for the first.
>>> png. This is slightly darker, but the one sent earlier is cleaner. Let me
>>> know which is more amenable for OCRing. I use PDF Shaper to extract images
>>> and convert to png using xnview.
>>>
>>> On Tuesday, June 26, 2018 at 7:48:28 PM UTC+5:30, shree wrote:
>>>>
>>>> Traineddata file is attached for use with tesseract4.0.0-beta.
>>>>
>>>> How did you create the test png from the pdf? I am not getting as good
>>>> quality, tried various settings with irfanview.
>>>>
>>>>
>>>>
>>>> On Tue, Jun 26, 2018 at 4:58 PM yajva  wrote:
>>>>
>>>>> Sorry for the delay, my system was down.
>>>>>
>>>>> I am getting "Page not Found" for the link given. Can you pl re-check?
>>>>>
>>>>> Here's the doc I am trying to OCR
>>>>>
>>>>>
>>>>> On Saturday, June 23, 2018 at 9:46:08 PM UTC+5:30, shree wrote:
>>>>>>
>>>>>> Please test with traineddata file from
>>>>>> https://github.com/Shreeshrii/tessdata_sanskrit/tree/master/iast-plus1
>>>>>> <https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2FShreeshrii%2Ftessdata_sanskrit%2Ftree%2Fmaster%2Fiast-plus1&sa=D&sntz=1&usg=AFQjCNHSTndmiJUoozyMRJ7OpHzTKIqYLw>
>>>>>>
>>>>>> Need to check that is it not overfitted.
>>>>>>
>>>>>> Please share a couple more images which I can use for testing.
>>>>>>
>>>>>>
>>>>>> On Thu, Jun 21, 2018 at 11:38 PM yajva  wrote:
>>>>>>
>>>>>>> one more correction.
>>>>>>>
>>>>>>>
>>>>>>> On Thursday, June 21, 2018 at 11:34:00 PM UTC+5:30, yajva wrote:
>>>>>>>>
>>>>>>>> done
>>>>>>>>
>>>>>>>> On Wednesday, June 20, 2018 at 9:05:01 PM UTC+5:30, shree wrote:
>>>>>>>>>
>>>>>>>>> I am attaching the OCRed text. Please correct it so that  I can
>>>>>>>>> use as groundtruth for further training and testing.
>>>>>>>>>
>>>>>>>>> On Wed, Jun 20, 2018 at 3:15 PM Shree Devi Kumar <
>>>>>>>>> shree...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> I had done a training for sanskrit for both devanagari and IAST
>>>>>>>>>> but it does not include cedilla for Sh
>>>>>>>>>>
>>>>>>>>>> I will add it and let you know.
>>>>>>>>>>
>>>>>>>>>> On Wed 20 Jun, 2018, 1:17 AM yajva,  wrote:
>>>>>>>>>>
>>>>>>>>>>> I have tried Google OCR for recognizing Sanskrit text in Roman
>>>>>>>>>>> with diacritics (IAST). It recognizes above macron but not dots 
>>>>>>>>>>> below also
>>>>>>>>>>> joining grave and accent. Is there any traineddata available for 
>>>>>>>>>>> tess

Re: [tesseract-ocr] java.lang.UnsatisfiedLinkError: The specified module could not be found.

2018-06-28 Thread Shree Devi Kumar
Training was done by Ray Smith at Google.

Available documentation is in the wiki.

See
https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM#documentation

The info you are looking for maybe in the powerpoint files.

On Thu, Jun 28, 2018 at 1:11 PM chandra churh chatterjee <
chandrachurh.chatterje...@gmail.com> wrote:

> @Shree Devi Kumar ,
> Can I get a complete detailed description of the Neural Network
> Architecture of the Tesseract 4 with diagram relating to what the net_spec
> command line of lstm training specifies.
>
> On Tue, Jun 26, 2018 at 1:42 PM Shree Devi Kumar 
> wrote:
>
>> Please post in https://github.com/nguyenq/tess4j/issues
>>
>> On Tue, Jun 26, 2018 at 1:30 PM Kiran Sonar 
>> wrote:
>>
>>> I moved to tess4j_4.00 from tess4J_3.8.4 which is neccesary to use new
>>> trainedData- best files. I am getting this error
>>> Exception in thread "main" java.lang.UnsatisfiedLinkError: The specified
>>> module could not be found.
>>>
>>> at com.sun.jna.Native.open(Native Method)
>>> at com.sun.jna.Native.open(Native.java:1759)
>>> at com.sun.jna.NativeLibrary.loadLibrary(NativeLibrary.java:260)
>>> at com.sun.jna.NativeLibrary.getInstance(NativeLibrary.java:398)
>>> at com.sun.jna.Native.register(Native.java:1396)
>>> at com.sun.jna.Native.register(Native.java:1156)
>>> at net.sourceforge.tess4j.TessAPI1.(TessAPI1.java:41)
>>> at OCR.confidenceWord(OCR.java:106)
>>> at OCR.processImg(OCR.java:381)
>>> at test.main(test.java:10)
>>>
>>> I've reinstalled VC++ 2015 redistributable. I looked up in
>>> dependencyWalker but no dll seems to be missing. I enabled JNA logs and get
>>> this
>>> Looking in classpath from sun.misc.Launcher$AppClassLoader@73d16e93 for
>>> /com/sun/jna/win32-x86-64/jnidispatch.dll
>>> Found library resource at
>>> jar:file:/D:/Kiran/Update_27/Test/lib/jna-4.1.0.jar!/com/sun/jna/win32-x86-64/jnidispatch.dll
>>> Looking for library 'libtesseract400'
>>> Adding paths from jna.library.path:
>>> D://KITE_DATA//ExternalLib//TextLib//win32-x86-64;C:\Users\kirans13\AppData\Local\Temp\tess4j\win32-x86-64
>>> Trying D:\KITE_DATA\ExternalLib\TextLib\win32-x86-64\libtesseract400.dll
>>> Adding system paths: []
>>> Trying D:\KITE_DATA\ExternalLib\TextLib\win32-x86-64\libtesseract400.dll
>>> Looking for lib- prefix
>>> Trying liblibtesseract400.dll
>>> Looking in classpath from sun.misc.Launcher$AppClassLoader@73d16e93 for
>>> libtesseract400
>>> Found library resource at
>>> jar:file:/D:/Kiran/Update_27/Test/lib/tess4j-4.0.0.jar!/win32-x86-64/libtesseract400.dll
>>>
>>> Does anyone know how to resolve this issue?
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/81ae872d-a427-49c9-b35b-d55245b382b3%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/81ae872d-a427-49c9-b35b-d55245b382b3%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>> --
>>
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVKE9H5-W8S4ZQye4e-JiMa1D043SrhQAKm-bCZDqqzDw%40mail.gmail.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVKE9H5-W8S4ZQye4e-JiMa1D043SrhQAKm-bCZDqqzDw%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
> --
> You received 

Re: [tesseract-ocr] How come tesseract 4.0 misses, what am I missing here?

2018-06-28 Thread Shree Devi Kumar
Rotate your shot to correct orientation and try.

On 6/28/18, cohengil...@gmail.com  wrote:
> I'm  quite new to tesseract and would like to use it in a project for OCR
> purposes,
> I found a tutorial on the web with photos, so I have executed tesseract
> (tesseract 4.0.0-beta.2) on it,
> and noticed it has *successfully retrieved every single word*, wow
> IMPRESSIVE!!
>
> so I took my smartphone and took a crystal clear photo (no blurry), and
> hoped it would work for me too.
> but *NOTHING it failed miserably* (every word miss :/ bummer)
>
> I read this too:
> https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality
>
> I tried to figure out what's i'm doing wrong by comparing the metedata EXIF
>
> of each photo,
> but apparently the photo's metadata from the web tutorial has been stripped
>
> :/
>
> Can someone explain to me. what am i missing here??
> I'm attaching the two photos.
>
>
> Thank you in advance :)
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/9676c56c-4ed4-4329-9aad-82937c495b91%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWL1eTfnUrNHj8Qxkc8h-cYSS%3D%3D%3DKmjVeUuoEeutd_0pg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Fine tuning existing model

2018-06-29 Thread Shree Devi Kumar
I modified the makefile for ocrd-train to do fine-tuning.  It is pasted
below:

export

SHELL := /bin/bash
LOCAL := $(PWD)/usr
PATH := $(LOCAL)/bin:$(PATH)
HOME := /home/ubuntu
TESSDATA =  $(HOME)/tessdata_best
LANGDATA = $(HOME)/langdata

# Name of the model to be built
MODEL_NAME = frk

# Name of the model to continue from
CONTINUE_FROM = frk

# Normalization Mode - see src/training/language_specific.sh for details
NORM_MODE = 2

# Tesseract model repo to use. Default: $(TESSDATA_REPO)
TESSDATA_REPO = _best

# Train directory
TRAIN := data/train

# BEGIN-EVAL makefile-parser --make-help Makefile

help:
@echo ""
@echo "  Targets"
@echo ""
@echo "unicharset   Create unicharset"
@echo "listsCreate lists of lstmf filenames for training
and eval"
@echo "training Start training"
@echo "proto-model  Build the proto model"
@echo "leptonicaBuild leptonica"
@echo "tesseractBuild tesseract"
@echo "tesseract-langs  Download tesseract-langs"
@echo "langdata Download langdata"
@echo "cleanClean all generated files"
@echo ""
@echo "  Variables"
@echo ""
@echo "MODEL_NAME Name of the model to be built"
@echo "CORES  No of cores to use for compiling
leptonica/tesseract"
@echo "LEPTONICA_VERSION  Leptonica version. Default:
$(LEPTONICA_VERSION)"
@echo "TESSERACT_VERSION  Tesseract commit. Default:
$(TESSERACT_VERSION)"
@echo "LANGDATA_VERSION   Tesseract langdata version. Default:
$(LANGDATA_VERSION)"
@echo "TESSDATA_REPO  Tesseract model repo to use. Default:
$(TESSDATA_REPO)"
@echo "TRAIN  Train directory"
@echo "RATIO_TRAINRatio of train / eval training data"

# END-EVAL

# Ratio of train / eval training data
RATIO_TRAIN := 0.90

ALL_BOXES = data/all-boxes
ALL_LSTMF = data/all-lstmf

# Create unicharset
unicharset: data/unicharset

# Create lists of lstmf filenames for training and eval
lists: $(ALL_LSTMF) data/list.train data/list.eval

data/list.train: $(ALL_LSTMF)
total=`cat $(ALL_LSTMF) | wc -l` \
   no=`echo "$$total * $(RATIO_TRAIN) / 1" | bc`; \
   head -n "$$no" $(ALL_LSTMF) > "$@"

data/list.eval: $(ALL_LSTMF)
total=`cat $(ALL_LSTMF) | wc -l` \
   no=`echo "($$total - $$total * $(RATIO_TRAIN)) / 1" | bc`; \
   tail -n "+$$no" $(ALL_LSTMF) > "$@"

# Start training
training: data/$(MODEL_NAME).traineddata

data/unicharset: $(ALL_BOXES)
combine_tessdata -u $(TESSDATA)/$(CONTINUE_FROM).traineddata
$(TESSDATA)/$(CONTINUE_FROM).
unicharset_extractor --output_unicharset "$(TRAIN)/my.unicharset"
--norm_mode $(NORM_MODE) "$(ALL_BOXES)"
merge_unicharsets $(TESSDATA)/$(CONTINUE_FROM).lstm-unicharset
$(TRAIN)/my.unicharset  "$@"
$(ALL_BOXES): $(sort $(patsubst %.tif,%.box,$(wildcard $(TRAIN)/*.tif)))
find $(TRAIN) -name '*.box' -exec cat {} \; > "$@"
$(TRAIN)/%.box: $(TRAIN)/%.tif $(TRAIN)/%-gt.txt
python generate_line_box.py -i "$(TRAIN)/$*.tif" -t "$(TRAIN)/$*-gt.txt" >
"$@"

$(ALL_LSTMF): $(sort $(patsubst %.tif,%.lstmf,$(wildcard $(TRAIN)/*.tif)))
find $(TRAIN) -name '*.lstmf' -exec echo {} \; | sort -R -o "$@"

$(TRAIN)/%.lstmf: $(TRAIN)/%.box
tesseract $(TRAIN)/$*.tif $(TRAIN)/$*   --psm 6 lstm.train

# Build the proto model
proto-model: data/$(MODEL_NAME)/$(MODEL_NAME).traineddata

data/$(MODEL_NAME)/$(MODEL_NAME).traineddata: $(LANGDATA) data/unicharset
combine_lang_model \
  --input_unicharset data/unicharset \
  --script_dir $(LANGDATA) \
  --words $(LANGDATA)/$(MODEL_NAME)/$(MODEL_NAME).wordlist \
  --numbers $(LANGDATA)/$(MODEL_NAME)/$(MODEL_NAME).numbers \
  --puncs $(LANGDATA)/$(MODEL_NAME)/$(MODEL_NAME).punc \
  --output_dir data/ \
  --lang $(MODEL_NAME)

data/checkpoints/$(MODEL_NAME)_checkpoint: unicharset lists proto-model
mkdir -p data/checkpoints
lstmtraining \
  --continue_from   $(TESSDATA)/$(CONTINUE_FROM).lstm \
  --old_traineddata $(TESSDATA)/$(CONTINUE_FROM).traineddata \
  --traineddata data/$(MODEL_NAME)/$(MODEL_NAME).traineddata \
  --model_output data/checkpoints/$(MODEL_NAME) \
  --debug_interval -1 \
  --train_listfile data/list.train \
  --eval_listfile data/list.eval \
  --sequential_training \
  --max_iterations 3000

data/$(MODEL_NAME).traineddata: data/checkpoints/$(MODEL_NAME)_checkpoint
lstmtraining \
--stop_training \
--continue_from $^ \
--old_traineddata $(TESSDATA)/$(CONTINUE_FROM).traineddata \
--traineddata data/$(MODEL_NAME)/$(MODEL_NAME).traineddata \
--model_output $@

# Clean all generated files
clean:
find data/train -name '*.box' -delete
find data/train -name '*.lstmf' -delete
rm -rf data/all-*
rm -rf data/list.*
rm -rf data/$(MODEL_NAME)
rm -rf data/unicharset
rm -rf data/checkpoints

On Fri, Jun 29, 2018 at 5:31 PM Lorenzo Bolzani  wrote:

> ​​
>
> Hi,
> I'm trying to do fine tuning of an existing model using line images and
> text labels. I'm running this version:
>
> tesseract 4.0.0-beta.3-56-g5fda
>  leptonica-1.76.0
>   libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 :
> libti

Re: [tesseract-ocr] Fine tuning existing model

2018-06-29 Thread Shree Devi Kumar
You should be able to use the new makefile after you make changes for all
the directory locations to match your setup.

Change the language from frk to eng, though the sample training text seems
to be non-english. In which case it is better for you to use the
appropriate language traineddata eg. tessdata_best/deu.traineddata for
German.

On Fri, Jun 29, 2018 at 9:03 PM Lorenzo Bolzani  wrote:

> Hi Shree, thanks for your answer.
>
> I tried the script setting:
>
> TESSDATA=extracted # here I have the eng.lstm and
> eng.trainedata
> LANGDATA=langdata-master # all langdata downladed by OCR-D
>
> MODEL_NAME = eng
> CONTINUE_FROM = eng
>
>
> First I run the old Makefile to create the boxes.
>
> $ make training MODEL_NAME=eng
>
>
> I stop it as soon as the training starts:
>
> At iteration 400/400/400, Mean rms=6.657%, delta=40.765%, char
> train=100.827%, word train=100%, skip ratio=0%,  New worst char error =
> 100.827 wrote checkpoint.
>
>
> At iteration 500/500/500, Mean rms=6.644%, delta=40.423%, char
> train=100.662%, word train=100%, skip ratio=0%,  New worst char error =
> 100.662 wrote checkpoint.
>
> ^Cmake: *** Deleting file 'data/checkpoints/eng_checkpoint'
> Makefile:110: recipe for target 'data/checkpoints/eng_checkpoint' failed
> make: *** [data/checkpoints/eng_checkpoint] Interrupt
>
> Notice that the data/checkpoints/eng_checkpoint file is deleted, I do not
> know if it is relevant or not.
>
>
> then I switch to the new one and I get this:
>
> $ make training
>
> mkdir -p data/checkpoints
> lstmtraining \
>   --continue_from   extracted/eng.lstm \
>   --old_traineddata extracted/eng.traineddata \
>   --traineddata data/eng/eng.traineddata \
>   --model_output data/checkpoints/eng \
>   --debug_interval -1 \
>   --train_listfile data/list.train \
>   --eval_listfile data/list.eval \
>   --sequential_training \
>   --max_iterations 3000
> Loaded file extracted/eng.lstm, unpacking...
> Warning: LSTMTrainer deserialized an LSTMRecognizer!
> Code range changed from 111 to 76!
> Num (Extended) outputs,weights in Series:
>   1,36,0,1:1, 0
> Num (Extended) outputs,weights in Series:
>   C3,3:9, 0
>   Ft16:16, 160
> Total weights = 160
>   [C3,3Ft16]:16, 160
>   Mp3,3:16, 0
>   Lfys64:64, 20736
>   Lfx96:96, 61824
>   Lrx96:96, 74112
>   Lfx512:512, 1247232
>   Fc76:76, 0
> Total weights = 1404064
> Previous null char=110 mapped to 75
> Continuing from extracted/eng.lstm
> Loaded 1/1 pages (1-1) of document
> data/train/mueller_waldhornist_1821_0130_010.lstmf
> Loaded 1/1 pages (1-1) of document
> data/train/bismarck_erinnerungen02_1898_0274_002.lstmf
> Loaded 1/1 pages (1-1) of document
> data/train/spyri_heidi_1880_0062_005.lstmf
> Loaded 1/1 pages (1-1) of document
> data/train/novalis_ofterdingen_1802_0210_001.lstmf
> Iteration 0: ALIGNED TRUTH : Sparoͤfen kauft' ich auch und Sorgenstuͤhle,
> Iteration 0: BEST OCR TEXT : l bd o D V fc ds ft hs D t' dsu PM )k ,„cGs D
> t' D„Gs 'A AKG„9„t d tft ü!Vt Eb ht Ac )k uF ' K,cGPFVts
> File data/train/mueller_waldhornist_1821_0130_010.lstmf page 0 :
> !int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244
> !int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244
> Makefile:113: recipe for target 'data/checkpoints/eng_checkpoint' failed
> make: *** [data/checkpoints/eng_checkpoint] Segmentation fault
>
>
> What am I doing wrong?
>
>
>
> Lorenzo
>
> 2018-06-29 14:08 GMT+02:00 Shree Devi Kumar :
>
>> I modified the makefile for ocrd-train to do fine-tuning.  It is pasted
>> below:
>>
>> export
>>
>> SHELL := /bin/bash
>> LOCAL := $(PWD)/usr
>> PATH := $(LOCAL)/bin:$(PATH)
>> HOME := /home/ubuntu
>> TESSDATA =  $(HOME)/tessdata_best
>> LANGDATA = $(HOME)/langdata
>>
>> # Name of the model to be built
>> MODEL_NAME = frk
>>
>> # Name of the model to continue from
>> CONTINUE_FROM = frk
>>
>> # Normalization Mode - see src/training/language_specific.sh for details
>> NORM_MODE = 2
>>
>> # Tesseract model repo to use. Default: $(TESSDATA_REPO)
>> TESSDATA_REPO = _best
>>
>> # Train directory
>> TRAIN := data/train
>>
>> # BEGIN-EVAL makefile-parser --make-help Makefile
>>
>> help:
>> @echo ""
>> @echo "  Targets"
>> @echo ""
>> @echo "unicharset   Create unicharset"
>> @echo "listsCreate lists of lstmf filenames for training
>> and eval"
>> @echo "training Start 

Re: [tesseract-ocr] Fine tuning existing model

2018-06-29 Thread Shree Devi Kumar
>
​
The problem was a "-gt.txt" rather than a ".gt.txt" as in my train files.
Now I can run your script directly.

Oh, I remember now. I had changed that for ease in renaming files for some
reason.

> In this way can I train a model that, for example, only recognize
uppercase characters, or numbers, simply by providing only uppercase
training data? Or is there something else to configure?

You could try finetune from English. Remove the line regarding merge of
unicharsets from my makefile (use command from original script). 300
iterations should be enough as you are not adding any characters. Try to
have a training text which resembles the kind of words that you expect to
OCR.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUpE8TeQXqto-Ahb7Mm%3DR4C5qOavthm0Y30ZbnvdrWr6w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Encoding of string failed when finetune fot adding new fonts is fas language

2018-06-30 Thread Shree Devi Kumar
see
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#error-messages-from-training

On Sat, Jun 30, 2018 at 3:23 PM john  wrote:

> Encoding of string failed! Failure bytes: ffc2 ffa9 20 ffd8
> ffa8 ffd8 ffa7 ffd8 ffae ffd8 ffaa ffd9
> ff86 ffd8 ffa7 20 ffd9 ff84 ffd8 ffa7 ffd8
> ffa4 ffd8 ffb3 20 ffdb ff8c ffd9 ff86 ffd8
> ffa7 ffd8 ffb1 ffdb ff8c ffd8 ffa7 20 ffd8
> ffa7 ffd8 ffa8 20 ffd8 ffaa ffd8 ffa8 ffd8
> ffab ffd9 ff87 20 ffd8 ffaf ffd8 ffa7 ffd9
> ff81 ffd8 ffaa ffd8 ffb3 ffd8 ffa7 20 ffd9
> ff86 ffdb ff8c ffd9 ff86 ffda ff86 ffd9
> ff85 ffd9 ff87 20 ffd9 ff82 ffd9 ff84 ffd8
> ffb7 ffd9 ff85
> Can't encode transcription: '۱۹ 2006© باختنا لاؤس یناریا اب تبثه دافتسا
> نینچمه قلطم' in language ''
> ^C
>
> when I finetune network for fas language i see top error?
> what is wrong with training?
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/11d5277e-2ef1-4ae9-8cb3-3f38290c1dfc%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVbq9yPfZdopAUM-MsBVfVQb1ve2SWuwdeTfVQO4SMKCg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Encoding of string failed when finetune fot adding new fonts is fas language

2018-06-30 Thread Shree Devi Kumar
Then there must be a mismatch between the unicharset you are using and the
training text. eg. check whether the copyright symbol is in your unicharset.

On Sat, Jun 30, 2018 at 4:48 PM john  wrote:

> I saw that link. this error occured many times,how can i prevent that?
>
> On Saturday, June 30, 2018 at 3:17:26 PM UTC+4:30, shree wrote:
>>
>> see
>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#error-messages-from-training
>>
>> On Sat, Jun 30, 2018 at 3:23 PM john  wrote:
>>
>>> Encoding of string failed! Failure bytes: ffc2 ffa9 20 ffd8
>>> ffa8 ffd8 ffa7 ffd8 ffae ffd8 ffaa ffd9
>>> ff86 ffd8 ffa7 20 ffd9 ff84 ffd8 ffa7 ffd8
>>> ffa4 ffd8 ffb3 20 ffdb ff8c ffd9 ff86 ffd8
>>> ffa7 ffd8 ffb1 ffdb ff8c ffd8 ffa7 20 ffd8
>>> ffa7 ffd8 ffa8 20 ffd8 ffaa ffd8 ffa8 ffd8
>>> ffab ffd9 ff87 20 ffd8 ffaf ffd8 ffa7 ffd9
>>> ff81 ffd8 ffaa ffd8 ffb3 ffd8 ffa7 20 ffd9
>>> ff86 ffdb ff8c ffd9 ff86 ffda ff86 ffd9
>>> ff85 ffd9 ff87 20 ffd9 ff82 ffd9 ff84 ffd8
>>> ffb7 ffd9 ff85
>>> Can't encode transcription: '۱۹ 2006© باختنا لاؤس یناریا اب تبثه دافتسا
>>> نینچمه قلطم' in language ''
>>> ^C
>>>
>>> when I finetune network for fas language i see top error?
>>> what is wrong with training?
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/11d5277e-2ef1-4ae9-8cb3-3f38290c1dfc%40googlegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>> --
>>
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/bb5696d3-f251-4181-a1a2-dcd6b0bbdf62%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXw2VNp6oik0MnyVoVg7oUUx7zqyqFT0jt6wxFZ0rP8kw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Encoding of string failed when finetune fot adding new fonts is fas language

2018-06-30 Thread Shree Devi Kumar
Also check that there is no tab or other unprintable character in your
training text.

Which version of tesseract are you using? show output  of

tesseract -v


On Sat, Jun 30, 2018 at 8:04 PM Shree Devi Kumar 
wrote:

> Then there must be a mismatch between the unicharset you are using and the
> training text. eg. check whether the copyright symbol is in your unicharset.
>
> On Sat, Jun 30, 2018 at 4:48 PM john  wrote:
>
>> I saw that link. this error occured many times,how can i prevent that?
>>
>> On Saturday, June 30, 2018 at 3:17:26 PM UTC+4:30, shree wrote:
>>>
>>> see
>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#error-messages-from-training
>>>
>>> On Sat, Jun 30, 2018 at 3:23 PM john  wrote:
>>>
>>>> Encoding of string failed! Failure bytes: ffc2 ffa9 20 ffd8
>>>> ffa8 ffd8 ffa7 ffd8 ffae ffd8 ffaa ffd9
>>>> ff86 ffd8 ffa7 20 ffd9 ff84 ffd8 ffa7 ffd8
>>>> ffa4 ffd8 ffb3 20 ffdb ff8c ffd9 ff86 ffd8
>>>> ffa7 ffd8 ffb1 ffdb ff8c ffd8 ffa7 20 ffd8
>>>> ffa7 ffd8 ffa8 20 ffd8 ffaa ffd8 ffa8 ffd8
>>>> ffab ffd9 ff87 20 ffd8 ffaf ffd8 ffa7 ffd9
>>>> ff81 ffd8 ffaa ffd8 ffb3 ffd8 ffa7 20 ffd9
>>>> ff86 ffdb ff8c ffd9 ff86 ffda ff86 ffd9
>>>> ff85 ffd9 ff87 20 ffd9 ff82 ffd9 ff84 ffd8
>>>> ffb7 ffd9 ff85
>>>> Can't encode transcription: '۱۹ 2006© باختنا لاؤس یناریا اب تبثه دافتسا
>>>> نینچمه قلطم' in language ''
>>>> ^C
>>>>
>>>> when I finetune network for fas language i see top error?
>>>> what is wrong with training?
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to tesseract-oc...@googlegroups.com.
>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/tesseract-ocr/11d5277e-2ef1-4ae9-8cb3-3f38290c1dfc%40googlegroups.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/11d5277e-2ef1-4ae9-8cb3-3f38290c1dfc%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>
>>> --
>>>
>>> 
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/bb5696d3-f251-4181-a1a2-dcd6b0bbdf62%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/bb5696d3-f251-4181-a1a2-dcd6b0bbdf62%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
> --
>
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWAg-_avX4s1b%3D1__oqUr4zSL3PrGjXHHn3CiqQFDsDHg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] recognising roman with sanskrit diacritics

2018-06-30 Thread Shree Devi Kumar
I have uploaded a new version of traineddata file at
https://github.com/Shreeshrii/tessdata_shreetest/blob/master/iast-layer-18003.traineddata

Attached is the OCRed output for pages 13-24 of dark pdf with it.

I am still training a different variation.



On Wed, Jun 27, 2018 at 6:46 PM Shree Devi Kumar 
wrote:

> ok. I will take a look.
>
> On Wed, Jun 27, 2018 at 5:04 PM yajva  wrote:
>
>> Checked with both light & dark pdfs. The results are very good. Thanks.
>>
>> A few concerns. E is consistently missed in both. J is missed
>> consistently in darker image but recognized as T in dark image. ṝ is
>> recognized as ṛ consistently. Can these be addressed ?
>> I am using tesseract 4 alpha windows build from command line.
>>
>> Are the dev files in repos ?
>>
>>
>> On Tuesday, June 26, 2018 at 11:06:06 PM UTC+5:30, shree wrote:
>>>
>>> I had used ghostview to convert PDF to tif or png.
>>>
>>> You can ocr PDF directly with gimagereader using the traineddata file I
>>> sent.
>>>
>>> See links for new windows binaries in msg below.
>>>
>>>
>>> At last, here are some fresh builds:
>>>
>>>
>>> https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_i686_tesseract4.git87635c1.exe
>>>
>>> https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_x86_64_tesseract4.git87635c1.exe
>>>
>>> I'd be also interested in testing of the tessdata manager, which should
>>> now also properly handle script tessdatas
>>>
>>> On Tue 26 Jun, 2018, 10:59 PM yajva,  wrote:
>>>
>>>> The doc is diff ver of the same text. Here's the doc used for the
>>>> first. png. This is slightly darker, but the one sent earlier is cleaner.
>>>> Let me know which is more amenable for OCRing. I use PDF Shaper to extract
>>>> images and convert to png using xnview.
>>>>
>>>> On Tuesday, June 26, 2018 at 7:48:28 PM UTC+5:30, shree wrote:
>>>>>
>>>>> Traineddata file is attached for use with tesseract4.0.0-beta.
>>>>>
>>>>> How did you create the test png from the pdf? I am not getting as good
>>>>> quality, tried various settings with irfanview.
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Jun 26, 2018 at 4:58 PM yajva  wrote:
>>>>>
>>>>>> Sorry for the delay, my system was down.
>>>>>>
>>>>>> I am getting "Page not Found" for the link given. Can you pl re-check?
>>>>>>
>>>>>> Here's the doc I am trying to OCR
>>>>>>
>>>>>>
>>>>>> On Saturday, June 23, 2018 at 9:46:08 PM UTC+5:30, shree wrote:
>>>>>>>
>>>>>>> Please test with traineddata file from
>>>>>>> https://github.com/Shreeshrii/tessdata_sanskrit/tree/master/iast-plus1
>>>>>>> <https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2FShreeshrii%2Ftessdata_sanskrit%2Ftree%2Fmaster%2Fiast-plus1&sa=D&sntz=1&usg=AFQjCNHSTndmiJUoozyMRJ7OpHzTKIqYLw>
>>>>>>>
>>>>>>> Need to check that is it not overfitted.
>>>>>>>
>>>>>>> Please share a couple more images which I can use for testing.
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jun 21, 2018 at 11:38 PM yajva  wrote:
>>>>>>>
>>>>>>>> one more correction.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thursday, June 21, 2018 at 11:34:00 PM UTC+5:30, yajva wrote:
>>>>>>>>>
>>>>>>>>> done
>>>>>>>>>
>>>>>>>>> On Wednesday, June 20, 2018 at 9:05:01 PM UTC+5:30, shree wrote:
>>>>>>>>>>
>>>>>>>>>> I am attaching the OCRed text. Please correct it so that  I can
>>>>>>>>>> use as groundtruth for further training and testing.
>>>>>>>>>>
>>>>>>>>>> On Wed, Jun 20, 2018 at 3:15 PM Shree Devi Kumar <
>>>>>>>>>> shree...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> I had done a training for sanskrit for both devanagari and IAST
>>>>>>>>>>> but it does not include cedilla for Sh
>>>>

Re: [tesseract-ocr] parameter not found: tessedit_ocr_psm_mode

2018-07-01 Thread Shree Devi Kumar
what's the output for ?

tesseract -v

which tesseract

which tesstrain.sh

On Sun, Jul 1, 2018 at 8:39 PM Zohreh Khosrobeygi 
wrote:

> Hi,
> when i use the tesstrain.sh, I have been getting this error that is about
> my fas.config. My config file is:
>
> tessedit_ocr_engine_mode 1
> tessedit_ocr_psm_mode 6
>
> The erroe is:
>
> read_params_file: parameter not found: tessedit_ocr_psm_mode
> + [[ 0 -gt 0 ]]
> + export TESSDATA_PREFIX=
> + TESSDATA_PREFIX=
> + for img_file in '${img_files}'
> + check_file_readable /tmp/tmp.AjJgcthbHl/fas/fas.B_Nazanin.exp0.lstmf
> + for file in '$@'
> + [[ ! -r /tmp/tmp.AjJgcthbHl/fas/fas.B_Nazanin.exp0.lstmf ]]
> + err_exit '/tmp/tmp.AjJgcthbHl/fas/fas.B_Nazanin.exp0.lstmf does not
> exist or is not readable'
> + echo -e 'ERROR: /tmp/tmp.AjJgcthbHl/fas/fas.B_Nazanin.exp0.lstmf' does
> not exist or is not readable
> + tee -a /tmp/tmp.AjJgcthbHl/fas/tesstrain.log
> ERROR: /tmp/tmp.AjJgcthbHl/fas/fas.B_Nazanin.exp0.lstmf does not exist or
> is not readable
> + exit 1
>
> Could you please help me?
>
>
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/544fed36-eeb2-484f-a0e1-a3067e489ea8%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduX0OWUFadP29EL6QSs3zAJemY-Y%2BoDOsULrkSY-UuYgng%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] parameter not found: tessedit_ocr_psm_mode

2018-07-01 Thread Shree Devi Kumar
correct variable is

tessedit_pageseg_mode

On Sun, Jul 1, 2018 at 8:51 PM Shree Devi Kumar 
wrote:

> what's the output for ?
>
> tesseract -v
>
> which tesseract
>
> which tesstrain.sh
>
> On Sun, Jul 1, 2018 at 8:39 PM Zohreh Khosrobeygi 
> wrote:
>
>> Hi,
>> when i use the tesstrain.sh, I have been getting this error that is about
>> my fas.config. My config file is:
>>
>> tessedit_ocr_engine_mode 1
>> tessedit_ocr_psm_mode 6
>>
>> The erroe is:
>>
>> read_params_file: parameter not found: tessedit_ocr_psm_mode
>> + [[ 0 -gt 0 ]]
>> + export TESSDATA_PREFIX=
>> + TESSDATA_PREFIX=
>> + for img_file in '${img_files}'
>> + check_file_readable /tmp/tmp.AjJgcthbHl/fas/fas.B_Nazanin.exp0.lstmf
>> + for file in '$@'
>> + [[ ! -r /tmp/tmp.AjJgcthbHl/fas/fas.B_Nazanin.exp0.lstmf ]]
>> + err_exit '/tmp/tmp.AjJgcthbHl/fas/fas.B_Nazanin.exp0.lstmf does not
>> exist or is not readable'
>> + echo -e 'ERROR: /tmp/tmp.AjJgcthbHl/fas/fas.B_Nazanin.exp0.lstmf' does
>> not exist or is not readable
>> + tee -a /tmp/tmp.AjJgcthbHl/fas/tesstrain.log
>> ERROR: /tmp/tmp.AjJgcthbHl/fas/fas.B_Nazanin.exp0.lstmf does not exist or
>> is not readable
>> + exit 1
>>
>> Could you please help me?
>>
>>
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/544fed36-eeb2-484f-a0e1-a3067e489ea8%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/544fed36-eeb2-484f-a0e1-a3067e489ea8%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
> --
>
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU2cXNw1565YJBYu-foO0%2BmV5whCPJXkmTFDb9iV6BnKw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Train 2 language together

2018-07-01 Thread Shree Devi Kumar
The font being used does not support English.

On Sun, Jul 1, 2018 at 10:06 PM Zohreh Khosrobeygi 
wrote:

> Hi,
> I have been training the text:
>
> 272-135031- BECAUSE YOU WERE SLEEPING INSTEAD OWHILE POOR SHAGGY SITS
> THERE A COOING DOVE
> فیلم و و , منابع سال آگهی آخرين آخرین بود. ساخت و کنی
>
> It means the text contains Persian and English. But when Tiff file has
> been created, all English text have been removed. The Tiff file contains
> this:
>
> 272-135031-
> فیلم و و , منابع سال آگهی آخرين آخرین بود. ساخت و کنی
>
> But for Persian we need to train both language together.
> How can I solve the problem? How can I train 2 language together?
> Thanks a lot.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/0e854ed2-3ca2-48e7-af79-9f4f1924e38b%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXs_rOmYiQUz_erdtgCFXsjF_qhLw81fa46auXA_wGwFw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Encoding of string failed when finetune fot adding new fonts is fas language

2018-07-02 Thread Shree Devi Kumar
You can use find_fonts with your training_text to locate the fonts to use.

Modify the following command to match your directory setup and try

echo "## FIND FONTS ##"
# Find fonts which can render your training_text. Run `fc-cache -vf` to
refresh cache.
# You can change the minimum coverage % as needed.
# This process can take a while if you have a number of installed fonts.
# Review the generated fontlist and modify, if needed.
# 2000 fonts found. Use a smaller set

nice text2image --find_fonts \
--fonts_dir $fonts_dir \
--text $langdata_dir/$Lang/$Lang.training_text \
--min_coverage 0.999  \
--render_per_font=false \
--outputbase $langdata_dir/$Lang/$Lang \
|& grep raw \
 | sed -e 's/ :.*/@ \\/g' \
 | sed -e "s/^/ '/" \
 | sed -e "s/@/'/g" > $langdata_dir/$Lang/$Lang.fontslist.txt

On Mon, Jul 2, 2018 at 12:06 PM ran go  wrote:

> in my opinion error is for font-type, for some font there is no error but
> for some other fonts there is error
>
> On Mon, Jul 2, 2018 at 9:15 AM, john  wrote:
>
>> I use tesseract 4.0.0-beta.1. downloaded from this link (UB mannheim)
>> <https://github.com/UB-Mannheim/tesseract/tree/v4.0.0-beta.1.20180414>
>>
>> On Saturday, June 30, 2018 at 7:13:30 PM UTC+4:30, shree wrote:
>>>
>>> Also check that there is no tab or other unprintable character in your
>>> training text.
>>>
>>> Which version of tesseract are you using? show output  of
>>>
>>> tesseract -v
>>>
>>>
>>> On Sat, Jun 30, 2018 at 8:04 PM Shree Devi Kumar 
>>> wrote:
>>>
>>>> Then there must be a mismatch between the unicharset you are using and
>>>> the training text. eg. check whether the copyright symbol is in your
>>>> unicharset.
>>>>
>>>> On Sat, Jun 30, 2018 at 4:48 PM john  wrote:
>>>>
>>>>> I saw that link. this error occured many times,how can i prevent that?
>>>>>
>>>>> On Saturday, June 30, 2018 at 3:17:26 PM UTC+4:30, shree wrote:
>>>>>>
>>>>>> see
>>>>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#error-messages-from-training
>>>>>>
>>>>>> On Sat, Jun 30, 2018 at 3:23 PM john  wrote:
>>>>>>
>>>>>>> Encoding of string failed! Failure bytes: ffc2 ffa9 20
>>>>>>> ffd8 ffa8 ffd8 ffa7 ffd8 ffae ffd8 ffaa
>>>>>>> ffd9 ff86 ffd8 ffa7 20 ffd9 ff84 ffd8 
>>>>>>> ffa7
>>>>>>> ffd8 ffa4 ffd8 ffb3 20 ffdb ff8c ffd9 
>>>>>>> ff86
>>>>>>> ffd8 ffa7 ffd8 ffb1 ffdb ff8c ffd8 ffa7 
>>>>>>> 20
>>>>>>> ffd8 ffa7 ffd8 ffa8 20 ffd8 ffaa ffd8 
>>>>>>> ffa8
>>>>>>> ffd8 ffab ffd9 ff87 20 ffd8 ffaf ffd8 
>>>>>>> ffa7
>>>>>>> ffd9 ff81 ffd8 ffaa ffd8 ffb3 ffd8 ffa7 
>>>>>>> 20
>>>>>>> ffd9 ff86 ffdb ff8c ffd9 ff86 ffda ff86
>>>>>>> ffd9 ff85 ffd9 ff87 20 ffd9 ff82 ffd9 
>>>>>>> ff84
>>>>>>> ffd8 ffb7 ffd9 ff85
>>>>>>> Can't encode transcription: '۱۹ 2006© باختنا لاؤس یناریا اب تبثه
>>>>>>> دافتسا نینچمه قلطم' in language ''
>>>>>>> ^C
>>>>>>>
>>>>>>> when I finetune network for fas language i see top error?
>>>>>>> what is wrong with training?
>>>>>>>
>>>>>>> --
>>>>>>> You received this message because you are subscribed to the Google
>>>>>>> Groups "tesseract-ocr" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>>> To view this discussion on the web visit
>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/11d5277e-2ef1-4ae9-8cb3-3f38290c1dfc%40

Re: [tesseract-ocr] Encoding of string failed when finetune fot adding new fonts is fas language

2018-07-02 Thread Shree Devi Kumar
also see https://github.com/tesseract-ocr/tesseract/issues/549



On Mon, Jul 2, 2018 at 7:45 PM Shree Devi Kumar 
wrote:

> You can use find_fonts with your training_text to locate the fonts to use.
>
> Modify the following command to match your directory setup and try
>
> echo "## FIND FONTS ##"
> # Find fonts which can render your training_text. Run `fc-cache -vf` to
> refresh cache.
> # You can change the minimum coverage % as needed.
> # This process can take a while if you have a number of installed fonts.
> # Review the generated fontlist and modify, if needed.
> # 2000 fonts found. Use a smaller set
>
> nice text2image --find_fonts \
> --fonts_dir $fonts_dir \
> --text $langdata_dir/$Lang/$Lang.training_text \
> --min_coverage 0.999  \
> --render_per_font=false \
> --outputbase $langdata_dir/$Lang/$Lang \
> |& grep raw \
>  | sed -e 's/ :.*/@ \\/g' \
>  | sed -e "s/^/ '/" \
>  | sed -e "s/@/'/g" > $langdata_dir/$Lang/$Lang.fontslist.txt
>
> On Mon, Jul 2, 2018 at 12:06 PM ran go  wrote:
>
>> in my opinion error is for font-type, for some font there is no error but
>> for some other fonts there is error
>>
>> On Mon, Jul 2, 2018 at 9:15 AM, john  wrote:
>>
>>> I use tesseract 4.0.0-beta.1. downloaded from this link (UB mannheim)
>>> <https://github.com/UB-Mannheim/tesseract/tree/v4.0.0-beta.1.20180414>
>>>
>>> On Saturday, June 30, 2018 at 7:13:30 PM UTC+4:30, shree wrote:
>>>>
>>>> Also check that there is no tab or other unprintable character in your
>>>> training text.
>>>>
>>>> Which version of tesseract are you using? show output  of
>>>>
>>>> tesseract -v
>>>>
>>>>
>>>> On Sat, Jun 30, 2018 at 8:04 PM Shree Devi Kumar 
>>>> wrote:
>>>>
>>>>> Then there must be a mismatch between the unicharset you are using and
>>>>> the training text. eg. check whether the copyright symbol is in your
>>>>> unicharset.
>>>>>
>>>>> On Sat, Jun 30, 2018 at 4:48 PM john  wrote:
>>>>>
>>>>>> I saw that link. this error occured many times,how can i prevent that?
>>>>>>
>>>>>> On Saturday, June 30, 2018 at 3:17:26 PM UTC+4:30, shree wrote:
>>>>>>>
>>>>>>> see
>>>>>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#error-messages-from-training
>>>>>>>
>>>>>>> On Sat, Jun 30, 2018 at 3:23 PM john  wrote:
>>>>>>>
>>>>>>>> Encoding of string failed! Failure bytes: ffc2 ffa9 20
>>>>>>>> ffd8 ffa8 ffd8 ffa7 ffd8 ffae ffd8 ffaa
>>>>>>>> ffd9 ff86 ffd8 ffa7 20 ffd9 ff84 ffd8 
>>>>>>>> ffa7
>>>>>>>> ffd8 ffa4 ffd8 ffb3 20 ffdb ff8c ffd9 
>>>>>>>> ff86
>>>>>>>> ffd8 ffa7 ffd8 ffb1 ffdb ff8c ffd8 
>>>>>>>> ffa7 20
>>>>>>>> ffd8 ffa7 ffd8 ffa8 20 ffd8 ffaa ffd8 
>>>>>>>> ffa8
>>>>>>>> ffd8 ffab ffd9 ff87 20 ffd8 ffaf ffd8 
>>>>>>>> ffa7
>>>>>>>> ffd9 ff81 ffd8 ffaa ffd8 ffb3 ffd8 
>>>>>>>> ffa7 20
>>>>>>>> ffd9 ff86 ffdb ff8c ffd9 ff86 ffda ff86
>>>>>>>> ffd9 ff85 ffd9 ff87 20 ffd9 ff82 ffd9 
>>>>>>>> ff84
>>>>>>>> ffd8 ffb7 ffd9 ff85
>>>>>>>> Can't encode transcription: '۱۹ 2006© باختنا لاؤس یناریا اب تبثه
>>>>>>>> دافتسا نینچمه قلطم' in language ''
>>>>>>>> ^C
>>>>>>>>
>>>>>>>> when I finetune network for fas language i see top error?
>>>>>>>> what is wrong with training?
>>>>>>>>
>>>>>>>> --
>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>> To unsubscribe from this group and sto

Re: [tesseract-ocr] A friendly suggestion for the "tesseract-ocr" group members (Concern to all members)

2018-07-03 Thread Shree Devi Kumar
I have added a wiki page at
https://github.com/tesseract-ocr/tesseract/wiki/Data-Files-in-different-versions
and updated for 3.04 and 4.0alpha.

You can update for older versions.

On Tue, Jul 3, 2018 at 1:30 AM  wrote:

> It seems with all  languages and revisions, people (including me) tend to
> search a lot for answers here in the group.
> So I have a suggestion,
> Can the group administrator pin a message with a spreadsheet, which
> consists the state of each revision with the corresponding  language this
> way it would be nicely organized in a single table, and people will update
> it from time to time.
>
> for example:
>
>
> *TrainingTesseract status*
>
>
>
> *Revisions*
>
> *Language*
>
> *3.0**0*
>
> *3.**01*
>
> *3.02*
>
> *3.03*
>
> *3.**04*
>
> *3.0**5*
>
> *4.0*
>
> English
>
>
>
>
>
>
>
>
>
> worked
>
>
>
> worked
>
> Hebrew
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Hindi
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Arabic
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> German
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Chinese
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Russian
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Vietnamese
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Polish
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> ...
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *Please let me know is it a hassle, if so I'll do my best to assist with
> this chore.*
>
> Thank you all,
> *Gil*
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/1850cb75-0289-4a41-8ebc-e4d2a1c38f5c%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduW2r2t1POC8qaZ1Ukf-TnvgWeq54qCL-SiV4aXHo-2-nw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] how to improve dot-matrix digits recognize accuracy

2018-07-06 Thread Shree Devi Kumar
You could try finetuning for the dotmatrix font.

On Fri, Jul 6, 2018 at 3:43 PM Wenjie Chen  wrote:

> Hi folks,
>
> Below is the dot-matrix digits picture, *tesseract *recognize it
> uncorrect without any pre-processing.
>
> 
>
> I did erode processing via opencv, the digit 1 recognize correct, but the
> digit 0 still failed.
>
> 
>
> for the special font 0, do you have any suggestion how to make *tesseract
> *know it is the digit 0?
>
> Should I apply more image processing to the image, or should I training
> the *tesseract *for the special 0?
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/68cc0256-9ac7-4647-84cd-75ae61dd4192%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUrJD9XBrDsFzSCUrVXcayroL%3DhnMAmkMoxbGtA%2Bc%2Bdeg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Really poor performance with decimal numbers

2018-07-06 Thread Shree Devi Kumar
try --psm 6

On Fri, Jul 6, 2018 at 2:23 PM Alberto Andreotti 
wrote:

> Hello,
>
> I'm having problems with the simplest image possible.
> It's a screenshot from GEdit(Ubuntu's text editor), with numbers and
> points. This is what I get,
>
> 23.78
> 15
> 1.6
> 17.6
> 25
> 225
> 2235
> 0.5
>
> Alberto
>
> version: tesseract 4.0.0-beta.1-285-g8d3f
> run from command line like this, tesseract test_image2.png  outputbase
> --oem 1 --psm 1
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/8d743eca-7a7c-4add-b754-c79b6ea55cba%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUw93TSP0%3D46Ye3jpax5p%2B5MtmQ-848FkbWp-UGrGrg9Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Explanation for training_text and wordlist files

2018-07-06 Thread Shree Devi Kumar
See the following link to comment by Ray regarding building of Training data

https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951

On Fri 6 Jul, 2018, 10:38 PM James Q,  wrote:

> No tool I can think of. What I would do is edit the file in a large text
> file editor (such as EmEditor) to remove duplicate words. You could do this
> by replacing all spaces for newlines then sorting and removing duplicates.
> After that you can randomize the unique list of words, add an appropriate
> distribution of punctuation characters and re-edit to create a block of
> text wrapped at say 100 characters. There are online tools to do the
> randomizing and wrapping.
>
> Having said this I don't know how valuable it is to have training text
> containing specific words. I have been struggling myself to train on
> specific word lists without much success. I think training text is just
> about a representative distribution of characters. Please let me know if
> you have any insights on the wordlists in langdata as I'm a bit hazy there.
>
> Thanks
> James
>
>
>
> On Wednesday, July 4, 2018 at 9:02:13 AM UTC+1, Dd U wrote:
>>
>> Hello guys.
>>
>>
>> I want to add new language script to Tesseract OCR and researching to
>> training data.
>>
>>
>> Then I want to know below things.
>>
>>1. Is there any automatic tool that make a langdata training_text and
>>wordlist files from massive text?
>>2. Is there any documentation about preparing text data and
>>explanation about text data files? I just saw directory langdata/jpn/ and
>>there are some files. But I have know idea about this files and how to
>>create files like those? What rule should I use create langdata files?
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/ccc8505c-216f-450a-9627-d85b2c9e21a9%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduW1ZK1yzGZz%2BJk%3D7ethQx4pgRnB2akZmTfn9xM%3DcpOyww%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Explanation for training_text and wordlist files

2018-07-06 Thread Shree Devi Kumar
Also see a community contributed perl script for generating langdata in
https://github.com/tesseract-ocr/tesseract/tree/master/contrib

On Fri 6 Jul, 2018, 10:52 PM Shree Devi Kumar,  wrote:

> See the following link to comment by Ray regarding building of Training
> data
>
>
> https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951
>
> On Fri 6 Jul, 2018, 10:38 PM James Q,  wrote:
>
>> No tool I can think of. What I would do is edit the file in a large text
>> file editor (such as EmEditor) to remove duplicate words. You could do this
>> by replacing all spaces for newlines then sorting and removing duplicates.
>> After that you can randomize the unique list of words, add an appropriate
>> distribution of punctuation characters and re-edit to create a block of
>> text wrapped at say 100 characters. There are online tools to do the
>> randomizing and wrapping.
>>
>> Having said this I don't know how valuable it is to have training text
>> containing specific words. I have been struggling myself to train on
>> specific word lists without much success. I think training text is just
>> about a representative distribution of characters. Please let me know if
>> you have any insights on the wordlists in langdata as I'm a bit hazy there.
>>
>> Thanks
>> James
>>
>>
>>
>> On Wednesday, July 4, 2018 at 9:02:13 AM UTC+1, Dd U wrote:
>>>
>>> Hello guys.
>>>
>>>
>>> I want to add new language script to Tesseract OCR and researching to
>>> training data.
>>>
>>>
>>> Then I want to know below things.
>>>
>>>1. Is there any automatic tool that make a langdata training_text
>>>and wordlist files from massive text?
>>>2. Is there any documentation about preparing text data and
>>>explanation about text data files? I just saw directory langdata/jpn/ and
>>>there are some files. But I have know idea about this files and how to
>>>create files like those? What rule should I use create langdata files?
>>>
>>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/ccc8505c-216f-450a-9627-d85b2c9e21a9%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/ccc8505c-216f-450a-9627-d85b2c9e21a9%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVXgCT2t_tOBcnbyLKav9Sg86FnntUZLJ-SicwXsiXxCg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] recognising roman with sanskrit diacritics

2018-07-11 Thread Shree Devi Kumar
What about ocr with

eng+iast



On Wed 11 Jul, 2018, 7:44 PM yajva,  wrote:

> shree
> namaste
>
> I am trying to OCR the attached image. Getting not so good results. Even
> for text which is apparently clear. Eg. in the first line, B is recognized
> as H, under dot for 't' in 'most' 4th line etc. The image has warping but
> still best/Latin and Google OCR produce better results. Is it possible to
> add diacritics to Latin? Can you help in any way?
>
> regards
> Venkatesh
>
>
> On Monday, July 2, 2018 at 2:05:47 PM UTC+5:30, yajva wrote:
>>
>> Many thanks. Downloaded and using.
>> Will wait for next ver.
>>
>>
>> On Sunday, July 1, 2018 at 12:21:19 AM UTC+5:30, shree wrote:
>>>
>>> I have uploaded a new version of traineddata file at
>>>
>>> https://github.com/Shreeshrii/tessdata_shreetest/blob/master/iast-layer-18003.traineddata
>>>
>>> Attached is the OCRed output for pages 13-24 of dark pdf with it.
>>>
>>> I am still training a different variation.
>>>
>>>
>>>
>>> On Wed, Jun 27, 2018 at 6:46 PM Shree Devi Kumar 
>>> wrote:
>>>
>>>> ok. I will take a look.
>>>>
>>>> On Wed, Jun 27, 2018 at 5:04 PM yajva  wrote:
>>>>
>>>>> Checked with both light & dark pdfs. The results are very good. Thanks.
>>>>>
>>>>> A few concerns. E is consistently missed in both. J is missed
>>>>> consistently in darker image but recognized as T in dark image. ṝ is
>>>>> recognized as ṛ consistently. Can these be addressed ?
>>>>> I am using tesseract 4 alpha windows build from command line.
>>>>>
>>>>> Are the dev files in repos ?
>>>>>
>>>>>
>>>>> On Tuesday, June 26, 2018 at 11:06:06 PM UTC+5:30, shree wrote:
>>>>>>
>>>>>> I had used ghostview to convert PDF to tif or png.
>>>>>>
>>>>>> You can ocr PDF directly with gimagereader using the traineddata file
>>>>>> I sent.
>>>>>>
>>>>>> See links for new windows binaries in msg below.
>>>>>>
>>>>>>
>>>>>> At last, here are some fresh builds:
>>>>>>
>>>>>>
>>>>>> https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_i686_tesseract4.git87635c1.exe
>>>>>>
>>>>>> https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_x86_64_tesseract4.git87635c1.exe
>>>>>>
>>>>>> I'd be also interested in testing of the tessdata manager, which
>>>>>> should now also properly handle script tessdatas
>>>>>>
>>>>>> On Tue 26 Jun, 2018, 10:59 PM yajva,  wrote:
>>>>>>
>>>>>>> The doc is diff ver of the same text. Here's the doc used for the
>>>>>>> first. png. This is slightly darker, but the one sent earlier is 
>>>>>>> cleaner.
>>>>>>> Let me know which is more amenable for OCRing. I use PDF Shaper to 
>>>>>>> extract
>>>>>>> images and convert to png using xnview.
>>>>>>>
>>>>>>> On Tuesday, June 26, 2018 at 7:48:28 PM UTC+5:30, shree wrote:
>>>>>>>>
>>>>>>>> Traineddata file is attached for use with tesseract4.0.0-beta.
>>>>>>>>
>>>>>>>> How did you create the test png from the pdf? I am not getting as
>>>>>>>> good quality, tried various settings with irfanview.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Jun 26, 2018 at 4:58 PM yajva  wrote:
>>>>>>>>
>>>>>>>>> Sorry for the delay, my system was down.
>>>>>>>>>
>>>>>>>>> I am getting "Page not Found" for the link given. Can you pl
>>>>>>>>> re-check?
>>>>>>>>>
>>>>>>>>> Here's the doc I am trying to OCR
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Saturday, June 23, 2018 at 9:46:08 PM UTC+5:30, shree wrote:
>>>>>>>>>>
>>>>>>>>>> Please test with traineddata file from
>>>>>>>>>> https://github.co

Re: [tesseract-ocr] why tesseract gives junk value for japanese language?

2018-07-12 Thread Shree Devi Kumar
Try traineddata from tessdata_best and tessdata_fast

On Thu 12 Jul, 2018, 6:45 PM mahendrag gajera, 
wrote:

> Hello all
>
> I am try to ocr japanese images via below code. But it give junk character.
> My tesseract version is 4.0
>
> Please let me know what is missing here.
>
> void Test(char* imagePath)
> {
> char *outText;
>
> tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
> // Initialize tesseract-ocr with English, without specifying tessdata path
> if (api->Init("D:\\tessdata", "jpn",
> tesseract::OcrEngineMode::OEM_TESSERACT_ONLY))
> {
> fprintf(stderr, "Could not initialize tesseract.\n");
> exit(1);
> }
>
> // Open input image with leptonica library
> Pix *image = pixRead(imagePath);
> api->SetImage(image);
> // Get OCR result
> outText = api->GetUTF8Text();
> printf("OCR output:\n%s", outText);
>
> // Destroy used object and release memory
> api->End();
> delete[] outText;
> pixDestroy(&image);
> }
>
> Using train data from here
>
> https://github.com/tesseract-ocr/tessdata
>
> Test data image
>
>
> 
>
> Thanks,
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/7bfe8e31-91ea-491c-8e8c-61bdab47dff4%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXNJEr83NA6OSpBZ%3D8GvSAxhXcHy8qoR%2BjdEOkZwkisAw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] recognising roman with sanskrit diacritics

2018-07-12 Thread Shree Devi Kumar
Thank you for your feedback of eng+

I will try training for this and get back.


On Thu, Jul 12, 2018 at 2:18 PM yajva  wrote:

> eng+iast-plus-3600 => no diacritics at all
> Latin+iast-plus-3600 => only macrons none other
>
>
>
> On Thursday, July 12, 2018 at 1:12:25 AM UTC+5:30, shree wrote:
>>
>> What about ocr with
>>
>> eng+iast
>>
>>
>>
>> On Wed 11 Jul, 2018, 7:44 PM yajva,  wrote:
>>
>>> shree
>>> namaste
>>>
>>> I am trying to OCR the attached image. Getting not so good results. Even
>>> for text which is apparently clear. Eg. in the first line, B is recognized
>>> as H, under dot for 't' in 'most' 4th line etc. The image has warping but
>>> still best/Latin and Google OCR produce better results. Is it possible
>>> to add diacritics to Latin? Can you help in any way?
>>>
>>> regards
>>> Venkatesh
>>>
>>>
>>> On Monday, July 2, 2018 at 2:05:47 PM UTC+5:30, yajva wrote:
>>>>
>>>> Many thanks. Downloaded and using.
>>>> Will wait for next ver.
>>>>
>>>>
>>>> On Sunday, July 1, 2018 at 12:21:19 AM UTC+5:30, shree wrote:
>>>>>
>>>>> I have uploaded a new version of traineddata file at
>>>>>
>>>>> https://github.com/Shreeshrii/tessdata_shreetest/blob/master/iast-layer-18003.traineddata
>>>>>
>>>>> Attached is the OCRed output for pages 13-24 of dark pdf with it.
>>>>>
>>>>> I am still training a different variation.
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Jun 27, 2018 at 6:46 PM Shree Devi Kumar 
>>>>> wrote:
>>>>>
>>>>>> ok. I will take a look.
>>>>>>
>>>>>> On Wed, Jun 27, 2018 at 5:04 PM yajva  wrote:
>>>>>>
>>>>>>> Checked with both light & dark pdfs. The results are very good.
>>>>>>> Thanks.
>>>>>>>
>>>>>>> A few concerns. E is consistently missed in both. J is missed
>>>>>>> consistently in darker image but recognized as T in dark image. ṝ is
>>>>>>> recognized as ṛ consistently. Can these be addressed ?
>>>>>>> I am using tesseract 4 alpha windows build from command line.
>>>>>>>
>>>>>>> Are the dev files in repos ?
>>>>>>>
>>>>>>>
>>>>>>> On Tuesday, June 26, 2018 at 11:06:06 PM UTC+5:30, shree wrote:
>>>>>>>>
>>>>>>>> I had used ghostview to convert PDF to tif or png.
>>>>>>>>
>>>>>>>> You can ocr PDF directly with gimagereader using the traineddata
>>>>>>>> file I sent.
>>>>>>>>
>>>>>>>> See links for new windows binaries in msg below.
>>>>>>>>
>>>>>>>>
>>>>>>>> At last, here are some fresh builds:
>>>>>>>>
>>>>>>>>
>>>>>>>> https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_i686_tesseract4.git87635c1.exe
>>>>>>>>
>>>>>>>> https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_x86_64_tesseract4.git87635c1.exe
>>>>>>>>
>>>>>>>> I'd be also interested in testing of the tessdata manager, which
>>>>>>>> should now also properly handle script tessdatas
>>>>>>>>
>>>>>>>> On Tue 26 Jun, 2018, 10:59 PM yajva,  wrote:
>>>>>>>>
>>>>>>>>> The doc is diff ver of the same text. Here's the doc used for the
>>>>>>>>> first. png. This is slightly darker, but the one sent earlier is 
>>>>>>>>> cleaner.
>>>>>>>>> Let me know which is more amenable for OCRing. I use PDF Shaper to 
>>>>>>>>> extract
>>>>>>>>> images and convert to png using xnview.
>>>>>>>>>
>>>>>>>>> On Tuesday, June 26, 2018 at 7:48:28 PM UTC+5:30, shree wrote:
>>>>>>>>>>
>>>>>>>>>> Traineddata file is attached for use with tesseract4.0.0-beta.
>>>>>>>>>>
>>>>>>>>>> How did you cre

Re: [tesseract-ocr] How to use tesseract 4 engineMode 2 ( Legacy + LSTM engines)?

2018-07-12 Thread Shree Devi Kumar
The traineddata files can hold both types of models. The OCR Engine mode
chooses which ones get used.

https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#format-of-traineddata-files

On Fri, Jul 13, 2018 at 9:31 AM 于洋  wrote:

> Tesseract 4 introduced new LSTM engine. The LSTM engine needs LSTM trained
> data, and the legacy engine needs old trained data. Two types of trained
> data are incompatiable with each other.
>
> When I set OCR Engine to 2, it will use Legacy and LSTM engines. But how
> can i provide two types(LSTM and legacy) trained data for tesseract?
>
> OCR Engine modes:
>   0Legacy engine only.
>   1Neural nets LSTM engine only.
>   2Legacy + LSTM engines.
>   3Default, based on what is available.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/bceef2a8-4f45-4372-8392-78679110f8b5%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVPYKrfx0Jw6ExFzZ-4eFDJioYux%3DnHCYdSaPQP2PBLXw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] using tesseract4 works fine but with oem 0 "couldn't load any languages"

2018-07-14 Thread Shree Devi Kumar
See
https://github.com/tesseract-ocr/tesseract/blob/master/unittest/osd_test.cc

On Sat 14 Jul, 2018, 8:23 PM simon mackenzie,  wrote:

> I am using tesseract4 and all working fine with english. However
> tesseract4 cannot detect page orientation so I want to use tesseract3 for
> this.
>
> I thought I just had to do tesseract --oem 0but now it says
> "couldn't load any languages"
>
> Is there a way to use tesseract3 whilst tesseract4 is installed. If not
> then is there another way to detect page orientation?
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/9669404f-1e73-45c8-9a91-83e67c37be6b%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduV6Qsjfomk31TNRy0y-RkmjOUrovuFxMGgB5zM%3DO0Rkeg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] "lstmtraining" stopped but not finished?

2018-07-15 Thread Shree Devi Kumar
Did you figure out what was causing this?


On Thu, Jul 12, 2018 at 8:15 AM Dd U  wrote:

> Hello guys please help me.
>
>
> I'm trying to training for improve Japanese language. Then I have a
> problem now.
>
>
> lstmtraining is stopped but not finished. It does not using CPU anymore
> and nothing happened few hours like this (attached screenshot).
>
>
> What should I do?  Wait or kill it?
>
>
> Thank you.
>
>
>
>
>
> 
>
>
> 
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/3e2d454e-a66f-4cc7-a263-9f42d9929a53%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVt14aXgA7p4i%3DxBcA5F%2BEasnDwd-TE35%2B1qpO5KMhUyQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] multiple problem with fine tuning

2018-07-16 Thread Shree Devi Kumar
> first of all in some words in tiff files the characters are not joined.

Make sure to include ZWNJ and ZWJ in your unicharset.

>  box file generated is from left to right but it should be RTL

According to Ray that is intentional.

>  is using lstmtraining.exe the next and final step

Yes. tesstrain.sh process only creates a 'starter traineddata' (unlike for
tesseract3).

On Mon, Jul 16, 2018 at 2:12 PM Hosein Khoshdel 
wrote:

> hi before asking my question i want to thank shree whose comments are very
> helpful both here and in github repo of tesseract.
>
> i want to fine tune fas.traineddata to support some new fonts. the first
> problem arises when i use the following command:
>
> tesstrain.sh --fonts_dir /c/folder/fonts/ --lang fas
> --noextract_font_properties --linedata_only --exposures "0" --langdata_dir
> ../langdata --tessdata_dir ../tessdata --fontlist "b nazanin" --output_dir
> ../../tessdata/fas/
>
>
> i
> put fas.traineddata, which i downloaded tessdata_best repo, in ../tessdata
> folder, but it gives error and says that it can not find eng.traineddata.
> this problem is resolved when i put eng.traineddata in ../tessdata but why
> should it want eng when i specify that lang is fas?
>
> anyway for now i pasted eng,traineddata and moved on. the second problem
> is with tiff/box pair generated with the above command. first of all in
> some words in tiff files the characters are not joined.for example there is:
>
>
> but
> it should be
>
> another problem is that the box file generated is from left to right but
> it should be RTL. this problem is addressed here
>  but i did not
> understand if there is a solution for it or not.
>
> lastly i am confused with the fine tuning process. is tesstrain.sh only
> for generating tiff/box pairs? what are the next steps. is using
> lstmtraining.exe the next and final step?
>
> btw i'm using:
>
> tesseract 4.0.0-beta.3
>  leptonica-1.76.0 (Jul 10 2018, 21:36:38) [MSC v.1900 LIB Debug x64]
>   libgif 5.1.4 : libjpeg 9b : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11
> : libwebp 0.6.1 : libopenjp2 2.3.0
>  Found AVX
>  Found SSE
>
> which i built with vs2015 also i'm using win 8.1
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/6f28256d-f2d4-4d13-a439-751465ec97dd%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduV0%2BSBqUgpPJ-B3KSXP1yTCR-0W_0QVd6w-t7cUqT8-5g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] The exposure option for tesstrain.sh

2018-07-16 Thread Shree Devi Kumar
Simply speaking, Exposure setting is similar to a scanner's setting, -1 -2
make it lighter, 1, 2, 3 etc make the text darker and thicker.



On Tue 17 Jul, 2018, 6:37 AM 'John Lee Ward' via tesseract-ocr, <
tesseract-ocr@googlegroups.com> wrote:

> Does anyone know of a document or can someone explain what the exposure
> option is all about when running tesstrain.sh ?
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/7e692933-c5dd-46d4-a550-92aa01cec12d%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWTCwzJq92SnQkhwG5-yH1askJmze5k%3DWhqYtsRqap%2B%3Dw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Questions about training korean language in tesseract 4.0

2018-07-19 Thread Shree Devi Kumar
Using tesstrain.sh with korean training text. You can see the format of
generated box files through that.



On Thu, Jul 19, 2018 at 12:06 PM Soumik Ranjan Dasgupta <
srd1...@cse.jgec.ac.in> wrote:

> 2) For checking the fonts used in generating the traineddata for your
> language, you can see training/language-specific.sh and
> langdata/font_properties under your respective language code.
>
> If I'm not wrong, the language code for korean is "kor".
>
> Check out langdata/kor directory.
>
> On Thu, Jul 19, 2018, 11:59 AM nampyo hong  wrote:
>
>> Hello,
>>
>> I have two questions about training tesseract 4.0
>>
>> 1.
>> In case of English, I can find box file and how to training
>> such as
>> T 112 4663 140 4696 0
>> e 140 4662 160 4686 0
>> s 163 4662 179 4686 0
>> s 182 4661 198 4686 0
>> e 200 4661 220 4685 0
>> r 221 4662 238 4685 0
>> a 239 4661 260 4685 0
>> c 261 4661 281 4685 0
>> t 281 4661 296 4691 0
>>
>> but, Korean, I cannot find training example, and I'm confused that
>> labelling by consonant and vowel or labelling by "one" letter
>>
>> 1) 가 10 20 30 40 0
>>
>> 2) ㄱ 10 20 30 40 0
>> ㅏ 10 20 30 40 0
>>
>> Which is the right way?
>>
>> 2. Is there a way to find types of font already trained  from .traindata
>> file?
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/442b6ab3-45f0-4d73-910a-380f0fbea34f%40googlegroups.com
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAB_aDAc-8Pt-4GsXZ-i%3Da-6OJvG1sA31P5Qca%2BzRXAkW1m-XQg%40mail.gmail.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduULmv-i-48iSspgJWkQSqyC_YK1BHRGjfC2S6LrFUdjMg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] What is the purpose of trained data files present under tessdata/script folder

2018-07-19 Thread Shree Devi Kumar
Files in tessdata are for a particular language eg. Hindi, Sanskrit,
Marathi, Nepali.

Files in tessdata/script are for a particular script used for writing the
languages eg. Devanagari.

Also note that most script files also include support for English.

So, if you have a document with Hindi+English+Sanskrit you can use
Devanagari.traineddata.

In some cases you may find it to be better than the language data,

On Thu, Jul 19, 2018 at 11:59 PM Vikas Goel 
wrote:

> After installing tesseract, there are trained data files present under
> "C:\Program Files (x86)\Tesseract-OCR\tessdata" as well as "C:\Program
> Files (x86)\Tesseract-OCR\tessdata\script". As per my uderstanding,
> tesseract engine uses the files present under "C:\Program Files
> (x86)\Tesseract-OCR\tessdata". Please confirm and let me know the purpose
> of trained data files present under "C:\Program Files
> (x86)\Tesseract-OCR\tessdata\script"
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/837c0600-c734-4ab8-9ed0-3f0d4a08b04a%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVSBB0GVK6_MLKpUBm%3DHjOXnUbvYNFNO-mOmEV4c%3DMnFQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] How to train by tesseract 4.00

2018-07-20 Thread Shree Devi Kumar
Please ask at https://github.com/OCR-D/ocrd-train/issues

for ocr-d related questions.

On Fri, Jul 20, 2018 at 11:36 AM Emiliano Isaza Villamizar 
wrote:

> Hi Shree,
>
> I've been trying to use this repo but I keep getting this error when I run
> any target with OCR-D.
>
> On Sunday, June 3, 2018 at 7:46:11 AM UTC-5, shree wrote:
>>
>> If you want to train using fonts, use tesstrain.sh. See the wiki pages
>> regarding training:
>>
>
>  make training
> combine_tessdata -u
> /home/tulip/Documents/Em/OCR/OCRtraining/ocrd-train/usr/share/tessdata
> /foo.traineddata
> /home/tulip/Documents/Em/OCR/OCRtraining/ocrd-train/usr/share/tessdata /foo.
> Failed to read
> /home/tulip/Documents/Em/OCR/OCRtraining/ocrd-train/usr/share/tessdata
> Makefile:97: recipe for target 'data/unicharset' failed
> make: *** [data/unicharset] Error 1
>
> thank you!
>
>
>
>>
>>
>
>> If you want to use scanned images, then see
>> https://github.com/OCR-D/ocrd-train for using line images and their
>> ground truth transcriptions to create box files, lstmf files and training.
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Sun, Jun 3, 2018 at 3:59 PM,  wrote:
>>
>>> I have read that on the version of 4.00, the box file can be used  only
>>> need to cover a textline instead of individual characters.
>>>
>>> So I make a box file like this
>>>
>>> 若存在,试求出实数λ的值; 0 0 256 48 0
>>>
>>> Then I want to ask how to train it.
>>>
>>> Or is it the same version 3?   【tesseract chi_my.font.exp0.tif
>>> chi_my.font.exp0 nobatch box.train】
>>>
>>> or there is other better method.
>>>
>>> Thanks!
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/f65b5c86-e921-455d-9076-c2ff230dac5b%40googlegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/7ae79317-e9ab-4d7c-a6a0-b945818c5392%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduW%3DayvJkA_0sr6NJeju2D7VPTGBBdwkaQjEmFHbnqQAdQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: unrecognized argument "unrecognised argument linedata_only"

2018-07-21 Thread Shree Devi Kumar
--linedata_only\

You need space before the continuation mark \

On Sat 21 Jul, 2018, 10:00 PM ,  wrote:

> can u please point out the place where to put the space
>
> thank you
>
> On Saturday, July 21, 2018 at 12:12:22 PM UTC-4, thiyam...@gmail.com
> wrote:
>>
>> My command is
>>
>>
>> usr/share/tesseract-ocr/./tesstrain.sh \
>>
>> --fonts_dir /usr/share/fonts \
>>
>> --lang ben \
>>
>> --linedata_only\
>>
>> --noextract_font_properties \
>>
>> --langdata_dir /home/jennil/Desktop/pro/langdata-master/ben\
>>
>> --tessdata_dir /usr/share/tesseract-ocr/4.00/tessdata –output_dir
>> /home/jennil/Desktop/pro/output/ben_output\
>>
>> --fontlist “Lohit Bengali”
>>
>>
>>
>> and here is the error
>>
>>
>>
>> *ERROR: Unrecognized argument --linedata_only--noextract_font_properties*
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/37073e8b-f628-438c-b1b9-648e90c405b8%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU3A3EqA6Ve8RGdgxg5sBODeTz8V%3DPkNQaeikRbmenhmA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: unrecognized argument "unrecognised argument linedata_only"

2018-07-22 Thread Shree Devi Kumar
needs two dashes,

On Sun, Jul 22, 2018 at 12:29 PM  wrote:

> hello again, i modified the error in the way you said and there is no
> error. but now the same error of unrecognised is occured in output_dir.
> the error is
> ERROR: Unrecognized argument -–output_dir
>
> my command is
>
> /usr/share/tesseract-ocr/./tesstrain.sh \
>
> --fonts_dir /usr/share/fonts \
>
> --lang ben \
>
> --linedata_only \
>
> --noextract_font_properties \
>
> --langdata_dir /home/jennil/Desktop/pro/langdata-master/ben \
>
> --tessdata_dir /usr/share/tesseract-ocr/4.00/tessdata \
>
> -–output_dir /home/jennil/Desktop/pro/output/ben_output \
>
> --fontlist “Lohit Bengali”
>
>
> please do help
>
> On Saturday, July 21, 2018 at 1:42:41 PM UTC-4, shree wrote:
>>
>> --linedata_only\
>>
>> You need space before the continuation mark \
>>
>> On Sat 21 Jul, 2018, 10:00 PM ,  wrote:
>>
>>> can u please point out the place where to put the space
>>>
>>> thank you
>>>
>>> On Saturday, July 21, 2018 at 12:12:22 PM UTC-4, thiyam...@gmail.com
>>> wrote:

 My command is


 usr/share/tesseract-ocr/./tesstrain.sh \

 --fonts_dir /usr/share/fonts \

 --lang ben \

 --linedata_only\

 --noextract_font_properties \

 --langdata_dir /home/jennil/Desktop/pro/langdata-master/ben\

 --tessdata_dir /usr/share/tesseract-ocr/4.00/tessdata –output_dir
 /home/jennil/Desktop/pro/output/ben_output\

 --fontlist “Lohit Bengali”



 and here is the error



 *ERROR: Unrecognized argument
 --linedata_only--noextract_font_properties*

 --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/37073e8b-f628-438c-b1b9-648e90c405b8%40googlegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/c841fc9d-e1e3-4905-a065-651320f40fa5%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWXu383FWz10VrpW__WW-eJpp5A%2BXNgRPLuDOFzxsEt6A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: unrecognized argument "unrecognised argument linedata_only"

2018-07-22 Thread Shree Devi Kumar
See https://github.com/tesseract-ocr/tesseract/wiki/Fonts

On Sun 22 Jul, 2018, 8:20 PM Jennil Thiyam,  wrote:

> you guys help me...now there is no error, but i don't know about the
> fonts, i try to train the bengali in "lohit-bengali" font thinking its
> already in the FONTS folder, but i got
>
> === Starting training for language 'ben'
> [Sun Jul 22 10:48:33 EDT 2018] /usr/bin/text2image
> --fonts_dir=/usr/share/fonts/truetype --font=“lohit-bengali”
> --outputbase=/tmp/font_tmp.z6y7AIvqyI/sample_text.txt
> --text=/tmp/font_tmp.z6y7AIvqyI/sample_text.txt
> --fontconfig_tmpdir=/tmp/font_tmp.z6y7AIvqyI
> Could not find font named “lohit-bengali”.
> Pango suggested font FreeMono.
> Please correct --font arg.
>
> === Phase I: Generating training images ===
> Rendering using “lohit-bengali”
> [Sun Jul 22 10:48:34 EDT 2018] /usr/bin/text2image
> --fontconfig_tmpdir=/tmp/font_tmp.z6y7AIvqyI
> --fonts_dir=/usr/share/fonts/truetype --strip_unrenderable_words
> --leading=32 --char_spacing=0.0 --exposure=0
> --outputbase=/tmp/tmp.pBWa4wRHmt/ben/ben.“lohit-bengali”.exp0 --max_pages=3
> --font=“lohit-bengali”
> --text=/home/jennil/Desktop/pro/langdata-master/ben/ben.training_text
> Could not find font named “lohit-bengali”.
> Pango suggested font FreeMono.
> Please correct --font arg.
> ERROR: /tmp/tmp.pBWa4wRHmt/ben/ben.“lohit-bengali”.exp0.box does not exist
> or is not readable
> ERROR: /tmp/tmp.pBWa4wRHmt/ben/ben.“lohit-bengali”.exp0.box does not exist
> or is not readable
>
> SO , please tell is all the fonts which are in this FONTS folder are
> already installed to tesseract or not?
>
>
> On Sun, Jul 22, 2018 at 7:15 AM, Jennil Thiyam 
> wrote:
>
>> Oh sorry for the mistake...I put two dashes, still it says unrecognised..
>>
>> On Sun 22 Jul, 2018, 4:27 PM Shree Devi Kumar, 
>> wrote:
>>
>>> needs two dashes,
>>>
>>> On Sun, Jul 22, 2018 at 12:29 PM  wrote:
>>>
>>>> hello again, i modified the error in the way you said and there is no
>>>> error. but now the same error of unrecognised is occured in output_dir.
>>>> the error is
>>>> ERROR: Unrecognized argument -–output_dir
>>>>
>>>> my command is
>>>>
>>>> /usr/share/tesseract-ocr/./tesstrain.sh \
>>>>
>>>> --fonts_dir /usr/share/fonts \
>>>>
>>>> --lang ben \
>>>>
>>>> --linedata_only \
>>>>
>>>> --noextract_font_properties \
>>>>
>>>> --langdata_dir /home/jennil/Desktop/pro/langdata-master/ben \
>>>>
>>>> --tessdata_dir /usr/share/tesseract-ocr/4.00/tessdata \
>>>>
>>>> -–output_dir /home/jennil/Desktop/pro/output/ben_output \
>>>>
>>>> --fontlist “Lohit Bengali”
>>>>
>>>>
>>>> please do help
>>>>
>>>> On Saturday, July 21, 2018 at 1:42:41 PM UTC-4, shree wrote:
>>>>>
>>>>> --linedata_only\
>>>>>
>>>>> You need space before the continuation mark \
>>>>>
>>>>> On Sat 21 Jul, 2018, 10:00 PM ,  wrote:
>>>>>
>>>>>> can u please point out the place where to put the space
>>>>>>
>>>>>> thank you
>>>>>>
>>>>>> On Saturday, July 21, 2018 at 12:12:22 PM UTC-4, thiyam...@gmail.com
>>>>>> wrote:
>>>>>>>
>>>>>>> My command is
>>>>>>>
>>>>>>>
>>>>>>> usr/share/tesseract-ocr/./tesstrain.sh \
>>>>>>>
>>>>>>> --fonts_dir /usr/share/fonts \
>>>>>>>
>>>>>>> --lang ben \
>>>>>>>
>>>>>>> --linedata_only\
>>>>>>>
>>>>>>> --noextract_font_properties \
>>>>>>>
>>>>>>> --langdata_dir /home/jennil/Desktop/pro/langdata-master/ben\
>>>>>>>
>>>>>>> --tessdata_dir /usr/share/tesseract-ocr/4.00/tessdata –output_dir
>>>>>>> /home/jennil/Desktop/pro/output/ben_output\
>>>>>>>
>>>>>>> --fontlist “Lohit Bengali”
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> and here is the error
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *ERROR: Unrecognized argument
>>>>>>> 

Re: [tesseract-ocr] Unnecessary extra space with Japanese.traineddata

2018-07-23 Thread Shree Devi Kumar
Which tessdata repository are you using for your trained data files?

tessdata
tessdata_best
tessdata_fast



On Tue 24 Jul, 2018, 9:01 AM Atsuyoshi Suzuki, 
wrote:

> Hi.
>
> I tried new tesseract and  traineddata for Japanese (both jpn.traineddata
> and Japanese.traineddata).
>
> It's very good recognition result with jpn.traineddata.
>
> Japanese.traineddata provide good result  but unnecessary space is
> inserted in words or characters.
>
>
>
> Is this behavior expected? In Japanese, there is no space between each
> words.
>
> If this behavior is expected, what kind of usage is assumed for
> Japanese.traineddata?
>
>
>
> jpn.traineddata (very good, and I expected):
>
> --- start ---
> $ tesseract -l jpn  test_jpn_04.jpg stdout
> Warning. Invalid resolution 0 dpi. Using 70 instead.
> Estimating resolution as 168
> OCR 機能を提供する Web API はいくつか存在しますが、用途によってカスタマイズすることが
> できません。Tesseract は多数の言語に対応し、Linux、macOS、Windows で動作します。
>
> --- end ---
>
>
> Japanese.traineddata:
>
> --- start ---
> $ tesseract -l Japanese  test_jpn_04.jpg stdout
> Warning. Invalid resolution 0 dpi. Using 70 instead.
> Estimating resolution as 168
> OCR 機能 を 提供 する Web API は いく つか 存在 し ます が 、 用 途 に よっ て カス タマ イズ する こと が
> で きま せん 。Tesseract は 多数 の 言語 に 対応 し 、Linux、macOS、Windows で 動作 し ます 。
>
> --- end ---
>
>
> This result is same between Ubuntu (beta.1) and macOS
> (4.0.0-beta.2-586-g607e).
>
>
>
> Thanks.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/ccfcb61b-3afa-4ecc-b6ac-ae3aebc55465%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVDx5_gDmipLsM5Md98_RP4tri9dH100O6_3tgq-5Q5Pw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Assert failed:in file weightmatrix.cpp, line 244

2018-07-23 Thread Shree Devi Kumar
Which version of tesseract are you using?

Please post output of

tesseract -v

On Tue 24 Jul, 2018, 2:26 AM Emiliano Isaza Villamizar, 
wrote:

> Hello everyone,
>
>
> 'm trying to train tesseract to improve the detection of some prices such
> as: CN¥2,400.48. I got got to a point that I keep getting this error:
>
> *total=`cat data/all-lstmf | wc -l` \*
> *   no=`echo "$total * 0.90 / 1" | bc`; \*
> *   head -n "$no" data/all-lstmf > "data/list.train"*
> *total=`cat data/all-lstmf | wc -l` \*
> *   no=`echo "($total - $total * 0.90) / 1" | bc`; \*
> *   tail -n "+$no" data/all-lstmf > "data/list.eval"*
> *combine_lang_model \*
> *  --input_unicharset data/unicharset \*
> *  --script_dir
> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/langdata-master
> \*
> *  --words
> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/langdata-master/eng/eng.wordlist
> \*
> *  --numbers
> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/langdata-master/eng/eng.numbers
> \*
> *  --puncs
> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/langdata-master/eng/eng.punc
> \*
> *  --output_dir data/ \*
> *  --lang eng*
> *Loaded unicharset of size 113 from file data/unicharset*
> *Setting unichar properties*
> *Other case É of é is not in unicharset*
> *Setting script properties*
> *Config file is optional, continuing...*
> *Failed to read data from:
> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/langdata-master/eng/eng.config*
> *Null char=2*
> *Reducing Trie to SquishedDawg*
> *Reducing Trie to SquishedDawg*
> *Reducing Trie to SquishedDawg*
> *mkdir -p data/checkpoints*
> *lstmtraining \*
> *  --continue_from
>  
> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/tessdata/eng.lstm
> \*
> *  --old_traineddata
> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/tessdata/eng.traineddata
> \*
> *  --traineddata data/eng/eng.traineddata \*
> *  --model_output data/checkpoints/eng \*
> *  --debug_interval -1 \*
> *  --train_listfile data/list.train \*
> *  --eval_listfile data/list.eval \*
> *  --sequential_training \*
> *  --max_iterations 3000*
> *Loaded file
> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/tessdata/eng.lstm,
> unpacking...*
> *Warning: LSTMTrainer deserialized an LSTMRecognizer!*
> *Code range changed from 111 to 112!*
> *Num (Extended) outputs,weights in Series:*
> *  1,36,0,1:1, 0*
> *Num (Extended) outputs,weights in Series:*
> *  C3,3:9, 0*
> *  Ft16:16, 160*
> *Total weights = 160*
> *  [C3,3Ft16]:16, 160*
> *  Mp3,3:16, 0*
> *  Lfys64:64, 20736*
> *  Lfx96:96, 61824*
> *  Lrx96:96, 74112*
> *  Lfx512:512, 1247232*
> *  Fc112:112, 0*
> *Total weights = 1404064*
> *Previous null char=110 mapped to 111*
> *Continuing from
> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/tessdata/eng.lstm*
> *Loaded 1/1 pages (1-1) of document
> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/data/train/72b.lstmf*
> *Loaded 1/1 pages (1-1) of document
> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/data/train/67e.lstmf*
> *Loaded 1/1 pages (1-1) of document
> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/data/train/75c.lstmf*
> *Loaded 1/1 pages (1-1) of document
> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/data/train/48b.lstmf*
> *Iteration 0: ALIGNED TRUTH : CN¥2,400.48*
> *Iteration 0: BEST OCR TEXT : ₩₩₩N₩₩4₩0₩0₩4₩8*
> *File
> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/data/train/72b.lstmf
> page 0 :*
> *!int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244*
> *!int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244*
> *!int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244*
> *Makefile:111: recipe for target 'data/checkpoints/eng_checkpoint' failed*
> *make: *** [data/checkpoints/eng_checkpoint] Segmentation fault (core
> dumped)*
>
> I already tried to download the best/tessdata eng.traineddata and
> replacing it in the continue_from but I haven't been able to pass this
> mistake. Any thoughts?
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/6152d324-0713-4de6-b646-162923273b63%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and

Re: [tesseract-ocr] Unnecessary extra space with Japanese.traineddata

2018-07-24 Thread Shree Devi Kumar
Please see
https://github.com/tesseract-ocr/tessdata_fast#example---jpn-and--japanese
for Ray's comment regarding the 'script' traineddata.


preserve_interword_spaces 1

  was added via  jpn.config to jpn.traineddata file and other CJK languages
to fix this issue - see
https://github.com/tesseract-ocr/tessdata_fast/pull/7

We probably did not make the changes for the script traineddata files

you can test by giving the config variable on command line by adding

-c  preserve_interword_spaces 1


(Please check the syntax, it might need a = sign)

On Tue, Jul 24, 2018 at 10:40 AM Atsuyoshi Suzuki <
atuyosi.unloc...@gmail.com> wrote:

> Hi Shree.
>
> I use tessdata_fast.
>
>
> 2018年7月24日火曜日 13時44分40秒 UTC+9 shree:
>>
>> Which tessdata repository are you using for your trained data files?
>>
>> tessdata
>> tessdata_best
>> tessdata_fast
>>
>>
>>
>> On Tue 24 Jul, 2018, 9:01 AM Atsuyoshi Suzuki, 
>> wrote:
>>
>>> Hi.
>>>
>>> I tried new tesseract and  traineddata for Japanese (both
>>> jpn.traineddata and Japanese.traineddata).
>>>
>>> It's very good recognition result with jpn.traineddata.
>>>
>>> Japanese.traineddata provide good result  but unnecessary space is
>>> inserted in words or characters.
>>>
>>>
>>>
>>> Is this behavior expected? In Japanese, there is no space between each
>>> words.
>>>
>>> If this behavior is expected, what kind of usage is assumed for
>>> Japanese.traineddata?
>>>
>>>
>>>
>>> jpn.traineddata (very good, and I expected):
>>>
>>> --- start ---
>>> $ tesseract -l jpn  test_jpn_04.jpg stdout
>>> Warning. Invalid resolution 0 dpi. Using 70 instead.
>>> Estimating resolution as 168
>>> OCR 機能を提供する Web API はいくつか存在しますが、用途によってカスタマイズすることが
>>> できません。Tesseract は多数の言語に対応し、Linux、macOS、Windows で動作します。
>>>
>>> --- end ---
>>>
>>>
>>> Japanese.traineddata:
>>>
>>> --- start ---
>>> $ tesseract -l Japanese  test_jpn_04.jpg stdout
>>> Warning. Invalid resolution 0 dpi. Using 70 instead.
>>> Estimating resolution as 168
>>> OCR 機能 を 提供 する Web API は いく つか 存在 し ます が 、 用 途 に よっ て カス タマ イズ する こと が
>>> で きま せん 。Tesseract は 多数 の 言語 に 対応 し 、Linux、macOS、Windows で 動作 し ます 。
>>>
>>> --- end ---
>>>
>>>
>>> This result is same between Ubuntu (beta.1) and macOS
>>> (4.0.0-beta.2-586-g607e).
>>>
>>>
>>>
>>> Thanks.
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/ccfcb61b-3afa-4ecc-b6ac-ae3aebc55465%40googlegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/e009654e-7f40-42fb-bc56-6946a60105aa%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUcETMrrUZSCUJEqexXWo%3DPzMYzD1RK_rvBoyYLV40aqw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: How to use the "latin sanskrit" language?

2018-07-26 Thread Shree Devi Kumar
There is no official traineddata for san_latn or last. I have created some
experimental versions but the output is not fully accurate.



On Fri 27 Jul, 2018, 12:21 AM John Muccigrosso,  wrote:

> You're telling tesseract that your text is in Latin. You need the
> traineddata for san-lat.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/d2fc7942-16a2-48f0-9651-920616179d54%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWoznoeg%2Btu1ZA6V_A9gfF5uBeNU92Mc6L4_05daP5J9g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: How to use the "latin sanskrit" language?

2018-07-26 Thread Shree Devi Kumar
You can try IAST ones from
https://github.com/Shreeshrii/tessdata_shreetest?files=1

On Fri 27 Jul, 2018, 8:27 AM Shree Devi Kumar,  wrote:

> There is no official traineddata for san_latn or last. I have created some
> experimental versions but the output is not fully accurate.
>
>
>
> On Fri 27 Jul, 2018, 12:21 AM John Muccigrosso, 
> wrote:
>
>> You're telling tesseract that your text is in Latin. You need the
>> traineddata for san-lat.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/d2fc7942-16a2-48f0-9651-920616179d54%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/d2fc7942-16a2-48f0-9651-920616179d54%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUV17Eh62Q3LhENqrNTugXoMBQvQ_6q_fOdDtY-h1%2Bt9A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Can't symlink into tessdata anymore?

2018-07-27 Thread Shree Devi Kumar
@zdenko podobny 

Please see https://github.com/tesseract-ocr/tessdata/issues/18
ita.special-words missing #18

On Fri, Jul 27, 2018 at 11:55 AM Zdenko Podobny  wrote:

> symlink is filesystem feature and tesseract use standard C++ function for
> reading/writing files from filesystem, so there is no reason why there
> would be bug in tesseract.
>
> But it seems that you do something non standard because ita.special-words
> is not file that would tesseract open if you just specified  "-l ita".
>
> Zdenko
>
>
> št 26. 7. 2018 o 20:49 John Muccigrosso  napísal(a):
>
>> Relevant earlier discussion here
>> 
>> .
>>
>> I install tesseract via Homebrew and have been symlinking tessdata into
>> the appropriate directory. When I did this this morning, I got complaints
>> from tesseract that the language files were not there:
>>
>> Error: failed to load /usr/local/Cellar/tesseract/3.05.02/share/tessdata/
>> ita.special-words
>>
>> if I try "-l ita". This problem goes away when I copy the relevant files
>> instead of symlinking them.
>>
>> This seems like a bug to me.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/a4abd081-76f9-4c96-bab8-713bbd5615bf%40googlegroups.com
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wO7TjbPrSngzO2WwJEtUgm_KsKSO_m4Og8jOUQ7fQomw%40mail.gmail.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUjkYdL-uKOASGJifqXb_XS1miKxDCn%3DfcDXN1o2QJXdw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] tesseract-4.0.0-beta.3 - testing problem

2018-07-28 Thread Shree Devi Kumar
Test related info has been moved to a new repo under tesseract-ocr
https://github.com/tesseract-ocr/test

You need to update that submodule (similar to googletest) for all files to
be available.

It's possible that the wiki has not been updated for the same, you can add
appropriate instructions to it.

On Sat, Jul 28, 2018 at 1:32 PM Marco Atzeri  wrote:

> With cygwin 64bit
>
> 1) I see an excess of "ln" during testing
>
> make[2]: 'libgmock_main.la' is up to date.
> mkdir -p ../test/testing
> ln -s
> /cygdrive/d/cyg_pub/devel/tesseract/prove2/tesseract-ocr-4.0.0-0.3.x86_64/src/tesseract-4.0.0-beta.3/test/testing/phototest.tif
>
> ../test/testing/phototest.tif
> mkdir -p ../test/testing
> ln -s
> /cygdrive/d/cyg_pub/devel/tesseract/prove2/tesseract-ocr-4.0.0-0.3.x86_64/src/tesseract-4.0.0-beta.3/test/testing/phototest.txt
>
> ../test/testing/phototest.txt
> ...
> make[2]: 'tesseracttests.exe' is up to date.
> make[2]: Leaving directory
>
> '/cygdrive/d/cyg_pub/devel/tesseract/prove2/tesseract-ocr-4.0.0-0.3.x86_64/build/unittest'
> make  check-TESTS
> make[2]: Entering directory
>
> '/cygdrive/d/cyg_pub/devel/tesseract/prove2/tesseract-ocr-4.0.0-0.3.x86_64/build/unittest'
> make[3]: Entering directory
>
> '/cygdrive/d/cyg_pub/devel/tesseract/prove2/tesseract-ocr-4.0.0-0.3.x86_64/build/unittest'
> mkdir -p ../test/testing
> ln -s
> /cygdrive/d/cyg_pub/devel/tesseract/prove2/tesseract-ocr-4.0.0-0.3.x86_64/src/tesseract-4.0.0-beta.3/test/testing/phototest.tif
>
> ../test/testing/phototest.tif
> ln: failed to create symbolic link '../test/testing/phototest.tif': File
> exists
> make[3]: *** [Makefile:1318:
> /cygdrive/d/cyg_pub/devel/tesseract/prove2/tesseract-ocr-4.0.0-0.3.x86_64/build/test/testing/phototest.tif]
>
> Error 1
> mkdir -p ../test/testing
> ln -s
> /cygdrive/d/cyg_pub/devel/tesseract/prove2/tesseract-ocr-4.0.0-0.3.x86_64/src/tesseract-4.0.0-beta.3/test/testing/phototest.txt
>
> ../test/testing/phototest.txt
> ln: failed to create symbolic link '../test/testing/phototest.txt': File
> exists
> make[3]: *** [Makefile:1322:
> /cygdrive/d/cyg_pub/devel/tesseract/prove2/tesseract-ocr-4.0.0-0.3.x86_64/build/test/testing/phototest.txt]
>
> Error 1
>
>
> 2) Question: there is any additional download needed for testing
> tesseract, other than https://github.com/google/googletest ?
> The instruction are missing any detail on "make check"
>
> FAIL: apiexample_test.exe
> FAIL: progress_test.exe
> PASS: intsimdmatrix_test.exe
> PASS: matrix_test.exe
> FAIL: osd_test.exe
> FAIL: loadlang_test.exe
> PASS: tesseracttests.exe
>
> Regards
> Marco
>
> ---
> Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft.
> https://www.avast.com/antivirus
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/c941a037-1b02-98c3-54aa-7f2345ea7771%40gmail.com
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUm1SD-Ets6NPpcZT8haVt899wAp5RguNJR%2Bv0fy7Ek_w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] combine_tessdata. Failed to read /usr/share/tesseract-ocr/tessdata/foo.traineddata

2018-07-29 Thread Shree Devi Kumar
Continue_from should be used when you want to train a new language based on
an existing language or to add some characters to an existing language.

There is no existing language called 'foo' - you should replace it with the
lang code for the language you are training.

On Sun, Jul 29, 2018 at 9:44 PM  wrote:

> I duplicated the tessdata and still getting this error
>
> combine_tessdata -u /mnt/e/projects/Training_Tesseract/ocrd-train/usr/
> share/tessdata/foo.traineddata  /mnt/e/projects/Training_Tesseract/ocrd-
> train/usr/share/tessdata/foo.
> Failed to read /mnt/e/projects/Training_Tesseract/ocrd-train/usr/share/
> tessdata/foo.traineddata
> Makefile:97: recipe for target 'data/unicharset' failed
>
>  I can't found the foo.traineddata in this folder.
>
>
>
>
> On Sunday, July 29, 2018 at 5:19:05 PM UTC+2, chandra churh chatterjee
> wrote:
>>
>> keep the foo.traineddata inside the tessdata folder and then run the
>> command.
>>
>> On Sun, Jul 29, 2018 at 5:00 AM  wrote:
>>
>>> I am using a bash script to train LSTM model. I have the images and box
>>> file.
>>>
>>>
>>> My problem is the error returns when the command  combine_tessdata
>>> executed . also i have checked and no file called foo.traineddata created.
>>>
>>>
>>> Here is the bash code .
>>> export
>>>
>>>
>>> SHELL := /bin/bash
>>> LOCAL := $(PWD)/usr
>>> PATH := $(LOCAL)/bin:$(PATH)
>>> TESSDATA =  /usr/share/tesseract-ocr/tessdata
>>> LANGDATA = $(PWD)/langdata
>>>
>>>
>>> # Name of the model to be built. Default: $(MODEL_NAME)
>>> MODEL_NAME = foo
>>>
>>>
>>> # Name of the model to continue from. Default: $(CONTINUE_FROM)
>>> CONTINUE_FROM = $(MODEL_NAME)
>>>
>>>
>>> # No of cores to use for compiling leptonica/tesseract. Default: $(CORES)
>>> CORES = 4
>>>
>>>
>>> # Leptonica version. Default: $(LEPTONICA_VERSION)
>>> LEPTONICA_VERSION := 1.75.3
>>>
>>>
>>> # Tesseract commit. Default: $(TESSERACT_VERSION)
>>> TESSERACT_VERSION := 9ae97508aed1e5508458f1181b08501f984bf4e2
>>>
>>>
>>> # Tesseract langdata version. Default: $(LANGDATA_VERSION)
>>> LANGDATA_VERSION := master
>>>
>>>
>>> # Tesseract model repo to use. Default: $(TESSDATA_REPO)
>>> TESSDATA_REPO = _fast
>>>
>>>
>>> # Train directory. Default: $(TRAIN)
>>> TRAIN := data/train
>>>
>>>
>>> # Normalization Mode - see src/training/language_specific.sh for
>>> details. Default: $(NORM_MODE)
>>> NORM_MODE = 2
>>>
>>>
>>> # Page segmentation mode. Default: $(PSM)
>>> PSM = 6
>>>
>>>
>>> # Ratio of train / eval training data. Default: $(RATIO_TRAIN)
>>> RATIO_TRAIN := 0.90
>>>
>>>
>>> # BEGIN-EVAL makefile-parser --make-help Makefile
>>>
>>>
>>> help:
>>>  @echo ""
>>>  @echo "  Targets"
>>>  @echo ""
>>>  @echo "unicharset   Create unicharset"
>>>  @echo "listsCreate lists of lstmf filenames for
>>> training and eval"
>>>  @echo "training Start training"
>>>  @echo "proto-model  Build the proto model"
>>>  @echo "leptonicaBuild leptonica"
>>>  @echo "tesseractBuild tesseract"
>>>  @echo "tesseract-langs  Download tesseract-langs"
>>>  @echo "langdata Download langdata"
>>>  @echo "cleanClean all generated files"
>>>  @echo ""
>>>  @echo "  Variables"
>>>  @echo ""
>>>  @echo "MODEL_NAME Name of the model to be built. Default:
>>> $(MODEL_NAME)"
>>>  @echo "CONTINUE_FROM  Name of the model to continue from.
>>> Default: $(CONTINUE_FROM)"
>>>  @echo "CORES  No of cores to use for compiling
>>> leptonica/tesseract. Default: $(CORES)"
>>>  @echo "LEPTONICA_VERSION  Leptonica version. Default:
>>> $(LEPTONICA_VERSION)"
>>>  @echo "TESSERACT_VERSION  Tesseract commit. Default:
>>> $(TESSERACT_VERSION)"
>>>  @echo "LANGDATA_VERSION   Tesseract langdata version. Default:
>>> $(LANGDATA_VERSION)"
>>>  @echo "TESSDATA_REPO  Tesseract model repo to use. Default:
>>> $(TESSDATA_REPO)"
>>>  @echo "TRAIN  Train directory. Default: $(TRAIN)"
>>>  @echo "NORM_MODE  Normalization Mode - see
>>> src/training/language_specific.sh for details. Default: $(NORM_MODE)"
>>>  @echo "PSMPage segmentation mode. Default: $(PSM)"
>>>  @echo "RATIO_TRAINRatio of train / eval training data.
>>> Default: $(RATIO_TRAIN)"
>>>
>>>
>>> # END-EVAL
>>>
>>>
>>> ALL_BOXES = data/all-boxes
>>> ALL_LSTMF = data/all-lstmf
>>>
>>>
>>> # Create unicharset
>>> unicharset: data/unicharset
>>>
>>>
>>> # Create lists of lstmf filenames for training and eval
>>> lists: $(ALL_LSTMF) data/list.train data/list.eval
>>>
>>>
>>> data/list.train: $(ALL_LSTMF)
>>>  total=`cat $(ALL_LSTMF) | wc -l` \
>>> no=`echo "$$total * $(RATIO_TRAIN) / 1" | bc`; \
>>> head -n "$$no" $(ALL_LSTMF) > "$@"
>>>
>>>
>>> data/list.eval: $(ALL_LSTMF)
>>>  total=`cat $(ALL_LSTMF) | wc -l` \
>>> no=`echo "($$total - $$total * $(RATIO_TRAIN)) / 1" | bc`; \
>>> tail -n "+$$no" $(ALL_LSTMF) > "$@"
>>>
>>>
>>> # Start training
>>> training:

Re: [tesseract-ocr] Re: OCR-d failed at Unicharset line -Help!

2018-08-02 Thread Shree Devi Kumar
Please use latest scripts from https://github.com/OCR-D/ocrd-train

On Fri, Aug 3, 2018 at 4:41 AM May  wrote:

>
> 
>
>
>
> 
>
>
>
> Here are attached photos
>
>
> On Thursday, August 2, 2018 at 4:08:11 PM UTC-7, May wrote:
>>
>> Hey all,
>>
>> I am following Shree's script for OCR-d in the google groups for
>> ocrd-training (
>> https://groups.google.com/forum/#!topic/tesseract-ocr/be4-rjvY2tQ). I
>> managed to pass the combine tessdata stage but got stuck at the
>> unicharset stage:
>>
>>
>>
>> I have edited the script to direct it to my path:
>>
>> I do find a unicharset file named "unicharset" but not as
>> "my.unicharset". Changing the script by removing "my." also did not solve
>> the problem. Do you know what's causing the issue?
>>
>> Best
>> May
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/48347dd8-7b7e-4d0d-9cb5-b21e3ec23f31%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWyP%3D%3DAyHBA7yCeywPGjJ%3D%2Bx5eRzDctjEPP_ArRKO5MVA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Error on combine_lang_model script; Null char=2 Invalid format in radical table at line 4: 3400 1.4 Creation of encoded unicharset failed!! Error writing recoder!!

2018-08-05 Thread Shree Devi Kumar
You are using an old version of tesseract. Please use the latest version
from github.

Make sure you remove/uninstall old version.

You error is related to radical stroke file in langdata. Make sure you use
latest version of langdata repo.

>Invalid format in radical table at line 4: 34001.4

On Mon 6 Aug, 2018, 9:41 AM Shandigutt,  wrote:

> Hi,
>
> I am trying to train Tesseract for Sinhala language. I was following training
> guidelines
> 
> mentioned in Github wiki. I get an error with reference to the 4th step
> which is "Creating Starter Traineddata". Please find the below command I
> executed,
>
> training/combine_lang_model --input_unicharset
> ../training/sin/sin.unicharset --script_dir ../langdata --words
> ../langdata/sin/sin.wordlist --puncs ../langdata/sin/sin.punc --numbers
> ../langdata/sin/sin.numbers --output_dir ../training/combined_sin
> --version_str 1.0 --lang sin
>
> I get the following output,
>
> Loaded unicharset of size 94 from file ../training/sin/sin.unicharset
> Setting unichar properties
> Setting script properties
> Warning: properties incomplete for index 4 = ී
> Warning: properties incomplete for index 6 = ි
> Warning: properties incomplete for index 11 = ු
> Warning: properties incomplete for index 15 = ්‌
> Warning: properties incomplete for index 33 = ූ
> Warning: properties incomplete for index 52 = ්‍ර
> Warning: properties incomplete for index 56 = ්‍ය
> Warning: properties incomplete for index 87 = ක්‍
> Warning: properties incomplete for index 93 = ර්‍
> Config file is optional, continuing...
> Null char=2
> Invalid format in radical table at line 4: 34001.4
> Creation of encoded unicharset failed!!
> Error writing recoder!!
> Reducing Trie to SquishedDawg
> Reducing Trie to SquishedDawg
> Reducing Trie to SquishedDawg
>
> For more information I have attached my sin.unicharset file and sin.config
> files.
>
> I use below Tesseract version,
>
> tesseract -v
> tesseract 4.00.00dev-696-geba0ae3
>  leptonica-1.74.4
>   libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib
> 1.2.8
>
>  Found SSE
>
> I use below OS,
>
> uname -a
> Linux shandigutt-laptop-ubuntu 4.4.0-130-generic #156-Ubuntu SMP Thu Jun
> 14 08:53:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
>
> Appreciate if somebody can please help me on this.
>
> Thannks
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/84872636-f425-4cc0-b228-00e7a3f5b6a3%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWEE%2B%3DrL8PTB-yWvECmSHHkJ%3DTXjOin%3DzHkK6FDHR87iA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] tesseract-4.0.0-beta.3 - testing problem

2018-08-06 Thread Shree Devi Kumar
One of the tests is for developers to verify that all traineddata files are
valid and load ok, so it needs the complete repo for tessdata_fast and
tessdata_best.

The tests have not been setup for users.




On Mon 6 Aug, 2018, 1:44 PM Marco Atzeri,  wrote:

> Am 28.07.2018 um 10:08 schrieb Shree Devi Kumar:
> > Test related info has been moved to a new repo under tesseract-ocr
> > https://github.com/tesseract-ocr/test
> >
> > You need to update that submodule (similar to googletest) for all files
> > to be available.
> >
> > It's possible that the wiki has not been updated for the same, you can
> > add appropriate instructions to it.
> >
>
> for what I see after "googletest" and "test" to successful test
> tesseract also "tessdata" tessdata_fast" "tessdata_best" are needed.
>
> As the last three are 2.3 GB of data (compressed) may be the
> needed subset should be listed somewhere, so that only
> the needed file for the test are downloaded.
>
> Testing Beta 4 on cygwin 64bit
>
>
> 
> Testsuite summary for tesseract 4.0.0-beta.3
>
> 
> # TOTAL: 7
> # PASS:  7
> # SKIP:  0
> # XFAIL: 0
> # FAIL:  0
> # XPASS: 0
> # ERROR: 0
>
> 
>
> ---
> Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft.
> https://www.avast.com/antivirus
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/2356dcde-fc9f-dadc-03cb-64cfd53ca3ad%40gmail.com
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXO-mzj2GUfVL8f2o%2B9zG1sZasJ%3DXzQKkY7yj7x6VqUnw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: OCR-d failed at Unicharset line -Help!

2018-08-06 Thread Shree Devi Kumar
Ocr-d scripts are geared towards tesseract 4.0.x. you are trying to use it
with tesseract 3.05.

On Tue 7 Aug, 2018, 10:50 AM May,  wrote:

> Hey Shree
>
> I also tried with the orignal script from the github. But faced the same
> issue with the process stuck at unicharset_output.
>
>
> 
>
>
> These are the versions:
> tesseract 3.05.02
>  leptonica-1.75.3
>   libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 :
> libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.2.0
>
>
> On Thursday, August 2, 2018 at 8:52:38 PM UTC-7, shree wrote:
>>
>> Please use latest scripts from https://github.com/OCR-D/ocrd-train
>>
>> On Fri, Aug 3, 2018 at 4:41 AM May  wrote:
>>
>>>
>>> 
>>>
>>>
>>>
>>> 
>>>
>>>
>>>
>>> Here are attached photos
>>>
>>>
>>> On Thursday, August 2, 2018 at 4:08:11 PM UTC-7, May wrote:

 Hey all,

 I am following Shree's script for OCR-d in the google groups for
 ocrd-training (
 https://groups.google.com/forum/#!topic/tesseract-ocr/be4-rjvY2tQ). I
 managed to pass the combine tessdata stage but got stuck at the
 unicharset stage:



 I have edited the script to direct it to my path:

 I do find a unicharset file named "unicharset" but not as
 "my.unicharset". Changing the script by removing "my." also did not solve
 the problem. Do you know what's causing the issue?

 Best
 May

>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/48347dd8-7b7e-4d0d-9cb5-b21e3ec23f31%40googlegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>> --
>>
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/af43b995-7e24-4dca-827c-080755211544%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVvGG9%3Dnn9GHn4oq_UvzXTNdHD_ZbRSaeTEpL-9%3Dr49-A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: OCR-d failed at Unicharset line -Help!

2018-08-07 Thread Shree Devi Kumar
lstm training can take weeks, days, hours depending on the options chosen.

you have given complete network spec, so that is training from scratch.

Please see the following training wiki page for training related info:

https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00

On Tue, Aug 7, 2018 at 12:39 PM May  wrote:

> Oh the training started by itself after a long while and still processing.
> Does it normally take that long to train 6 images?
>
>
>
>
> 
>
>
> On Monday, August 6, 2018 at 11:42:40 PM UTC-7, May wrote:
>>
>> Thanks a lot Shree. I tried the tesseract 4.0 and the training is working
>> well until it reaches the lstm-training step and got stuck there. I am
>> totally new in the training so hope you don't mind if I am asking silly
>> questions. Do you know why I got stuck? Also, would you call this training
>> fine-tuning? As I just want to improve the accuracy of existing
>> eng.langdata.
>>
>>
>> 
>>
>>
>>
>> On Monday, August 6, 2018 at 10:26:12 PM UTC-7, shree wrote:
>>>
>>> Ocr-d scripts are geared towards tesseract 4.0.x. you are trying to use
>>> it with tesseract 3.05.
>>>
>>> On Tue 7 Aug, 2018, 10:50 AM May,  wrote:
>>>
 Hey Shree

 I also tried with the orignal script from the github. But faced the
 same issue with the process stuck at unicharset_output.


 


 These are the versions:
 tesseract 3.05.02
  leptonica-1.75.3
   libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 :
 libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.2.0


 On Thursday, August 2, 2018 at 8:52:38 PM UTC-7, shree wrote:
>
> Please use latest scripts from https://github.com/OCR-D/ocrd-train
>
> On Fri, Aug 3, 2018 at 4:41 AM May  wrote:
>
>>
>> 
>>
>>
>>
>> 
>>
>>
>>
>> Here are attached photos
>>
>>
>> On Thursday, August 2, 2018 at 4:08:11 PM UTC-7, May wrote:
>>>
>>> Hey all,
>>>
>>> I am following Shree's script for OCR-d in the google groups for
>>> ocrd-training (
>>> https://groups.google.com/forum/#!topic/tesseract-ocr/be4-rjvY2tQ).
>>> I managed to pass the combine tessdata stage but got stuck at the
>>> unicharset stage:
>>>
>>>
>>>
>>> I have edited the script to direct it to my path:
>>>
>>> I do find a unicharset file named "unicharset" but not as
>>> "my.unicharset". Changing the script by removing "my." also did not 
>>> solve
>>> the problem. Do you know what's causing the issue?
>>>
>>> Best
>>> May
>>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it,
>> send an email to tesseract-oc...@googlegroups.com.
>> To post to this group, send email to tesser...@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/48347dd8-7b7e-4d0d-9cb5-b21e3ec23f31%40googlegroups.com
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
> --
>
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
 --
 You received this message because you are subscribed to the Google
 Groups "tesseract-ocr" group.
 To unsubscribe from this group and stop receiving emails from it, send
 an email to tesseract-oc...@googlegroups.com.
 To post to this group, send email to tesser...@googlegroups.com.
 Visit this group at https://groups.google.com/group/tesseract-ocr.
 To view this discussion on the web visit
 https://groups.google.com/d/msgid/tesseract-ocr/af43b995-7e24-4dca-827c-080755211544%40googlegroups.com
 
 .
 For more options, visit https://groups.google.com/d/optout.

>>> --
> You

Re: [tesseract-ocr] Re: OCR-d failed at Unicharset line -Help!

2018-08-07 Thread Shree Devi Kumar
question: why are you trying to do training?

There are hundreds of languages already supported by tesseract. Have you
tried them?

If none of them work, then you need to define what is required - eg. Is a
particular type face required? Is the traineddata missing some required
characters? Is the language not fully supported ?

Answering these questions will help you decide what training, if any, is
required.

On Tue, Aug 7, 2018 at 1:59 PM Shree Devi Kumar 
wrote:

> lstm training can take weeks, days, hours depending on the options chosen.
>
> you have given complete network spec, so that is training from scratch.
>
> Please see the following training wiki page for training related info:
>
> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
>
> On Tue, Aug 7, 2018 at 12:39 PM May  wrote:
>
>> Oh the training started by itself after a long while and still
>> processing. Does it normally take that long to train 6 images?
>>
>>
>>
>>
>> <https://lh3.googleusercontent.com/-S-zqe4mmBWA/W2lFl6LakEI/AOY/1g2tCBu6-cUDZjSj8-DsyvMhl3ypueJggCLcBGAs/s1600/Capture.PNG>
>>
>>
>> On Monday, August 6, 2018 at 11:42:40 PM UTC-7, May wrote:
>>>
>>> Thanks a lot Shree. I tried the tesseract 4.0 and the training is
>>> working well until it reaches the lstm-training step and got stuck there. I
>>> am totally new in the training so hope you don't mind if I am asking silly
>>> questions. Do you know why I got stuck? Also, would you call this training
>>> fine-tuning? As I just want to improve the accuracy of existing
>>> eng.langdata.
>>>
>>>
>>> <https://lh3.googleusercontent.com/-dWRkYql4AKA/W2k9PoNsndI/AOM/zWVkkPvUCT44moZPpvt6xgYFnQ0StwxUQCLcBGAs/s1600/Capture.PNG>
>>>
>>>
>>>
>>> On Monday, August 6, 2018 at 10:26:12 PM UTC-7, shree wrote:
>>>>
>>>> Ocr-d scripts are geared towards tesseract 4.0.x. you are trying to use
>>>> it with tesseract 3.05.
>>>>
>>>> On Tue 7 Aug, 2018, 10:50 AM May,  wrote:
>>>>
>>>>> Hey Shree
>>>>>
>>>>> I also tried with the orignal script from the github. But faced the
>>>>> same issue with the process stuck at unicharset_output.
>>>>>
>>>>>
>>>>> <https://lh3.googleusercontent.com/-rFB69WQGLIg/W2krzHUjFfI/AOA/SZ4CEzUIEGMIhQUWXHfHMS9H4Yxk-ADGwCLcBGAs/s1600/Capture.PNG>
>>>>>
>>>>>
>>>>> These are the versions:
>>>>> tesseract 3.05.02
>>>>>  leptonica-1.75.3
>>>>>   libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 :
>>>>> libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.2.0
>>>>>
>>>>>
>>>>> On Thursday, August 2, 2018 at 8:52:38 PM UTC-7, shree wrote:
>>>>>>
>>>>>> Please use latest scripts from https://github.com/OCR-D/ocrd-train
>>>>>>
>>>>>> On Fri, Aug 3, 2018 at 4:41 AM May  wrote:
>>>>>>
>>>>>>>
>>>>>>> <https://lh3.googleusercontent.com/-LnwUni4-lLw/W2OPUqJpn_I/ANs/Xd_-CVCdiMk0cjMmxBpVgfOSU1JeAacAgCLcBGAs/s1600/Capture.PNG>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> <https://lh3.googleusercontent.com/-j3_B1CmVv9w/W2OPbuUYH3I/ANw/xmBXrNakKuMHm2L9cj-K3sCXCjFxuF80QCLcBGAs/s1600/Capture.PNG>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Here are attached photos
>>>>>>>
>>>>>>>
>>>>>>> On Thursday, August 2, 2018 at 4:08:11 PM UTC-7, May wrote:
>>>>>>>>
>>>>>>>> Hey all,
>>>>>>>>
>>>>>>>> I am following Shree's script for OCR-d in the google groups for
>>>>>>>> ocrd-training (
>>>>>>>> https://groups.google.com/forum/#!topic/tesseract-ocr/be4-rjvY2tQ).
>>>>>>>> I managed to pass the combine tessdata stage but got stuck at the
>>>>>>>> unicharset stage:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I have edited the script to direct it to my path:
>>>>>>>>
>>>>>>>> I do find a unicharset file named "unicharset" but not as
>>>>>>>> "my.unicharset

Re: [tesseract-ocr] tesseract not able to detect handwritten text even after improving image quality

2018-08-07 Thread Shree Devi Kumar
see FAQ
https://github.com/tesseract-ocr/tesseract/wiki/FAQ#can-i-use-tesseract-for-handwriting-recognition

Recently a lot of people have tried to train 4.0 using handwriting fonts,
however, there has been no report as to the level of success they have had
doing it.



On Tue, Aug 7, 2018 at 3:28 PM  wrote:

> Hi,
>
> I'm trying to extract the handwritten data from image but even after
> improving the image quality tesseract is not able to detect handwritten
> text . can you please suggest me the steps to detect handwritten from given
> image.
>
> Reference link -
> https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality#rescaling
>
> Image link
> https://drive.google.com/open?id=1wv0JvuedQl99fYLNn8BMUId74PtLt_Ej
>
>
>
> Regards
> Rahul
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/209fc076-43c9-4c70-b680-7eef103b92ef%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduW%2BT%2BWS-POUMqcnXy105Lm-7P5ad6%2BFKHTSJ-8ucPdZoQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: OCR-d failed at Unicharset line -Help!

2018-08-07 Thread Shree Devi Kumar
Re finetuning - see
https://github.com/tesseract-ocr/tesseract/issues/1782#issuecomment-411018986

Have you tried to provide each word separately (eg. using opencv ) for
recognition?

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXgsWMOguKbCZy3KSfzhePjJ%2BOY6d15oSoAeck-Z88-FA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] tesseract not able to detect handwritten text even after improving image quality

2018-08-08 Thread Shree Devi Kumar
https://groups.google.com/forum/#!searchin/tesseract-ocr/handwriting%7Csort:date

On Wed, Aug 8, 2018 at 6:21 PM  wrote:

> Hi Shree,
>
> I'm still unable to extract images from given png can you please suggest
> me any other links
>
> Regards
> Rahul
>
> On Wednesday, August 8, 2018 at 11:29:07 AM UTC+5:30, tri9...@gmail.com
> wrote:
>>
>> Thanks shree.
>>
>> On Tuesday, August 7, 2018 at 6:01:48 PM UTC+5:30, shree wrote:
>>>
>>> see FAQ
>>>
>>> https://github.com/tesseract-ocr/tesseract/wiki/FAQ#can-i-use-tesseract-for-handwriting-recognition
>>>
>>> Recently a lot of people have tried to train 4.0 using handwriting
>>> fonts, however, there has been no report as to the level of success they
>>> have had doing it.
>>>
>>>
>>>
>>> On Tue, Aug 7, 2018 at 3:28 PM  wrote:
>>>
 Hi,

 I'm trying to extract the handwritten data from image but even after
 improving the image quality tesseract is not able to detect handwritten
 text . can you please suggest me the steps to detect handwritten from given
 image.

 Reference link -
 https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality#rescaling

 Image link
 https://drive.google.com/open?id=1wv0JvuedQl99fYLNn8BMUId74PtLt_Ej



 Regards
 Rahul

 --
 You received this message because you are subscribed to the Google
 Groups "tesseract-ocr" group.
 To unsubscribe from this group and stop receiving emails from it, send
 an email to tesseract-oc...@googlegroups.com.
 To post to this group, send email to tesser...@googlegroups.com.
 Visit this group at https://groups.google.com/group/tesseract-ocr.
 To view this discussion on the web visit
 https://groups.google.com/d/msgid/tesseract-ocr/209fc076-43c9-4c70-b680-7eef103b92ef%40googlegroups.com
 
 .
 For more options, visit https://groups.google.com/d/optout.

>>>
>>>
>>> --
>>>
>>> 
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/95e91341-a53b-465e-94c5-7f9d35ce347a%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXzEkT7XtPnuQ%2BEPnon10yXH1nEWSMo6qM6Pv12g7Jy3w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Problem with using two trained.data files in combination for a better result.

2018-08-08 Thread Shree Devi Kumar
i think this could be if your new traineddats is not trained to as high a
accuracy level as the eng traineddata.

You can setup a debug log to verify this. see
https://github.com/tesseract-ocr/tesseract/issues/1275#issuecomment-360367865
for details

On Wed, Aug 8, 2018 at 6:04 PM  wrote:

> i'm trying to use the combination of two traineddata dictionaries together
> due to one of them being able to recognise specific numbers better than the
> other.
>
> Here is an example of the code line.
>
>  $codeLine .= 'magick convert "'.$filePath.'" -quality
> 90 -density 300x300  -units PixelsPerInch "'.$output.'.jpg"'; //
>  $codeLine .= 'tesseract "'.$output.'.jpg"
> "'.$output.'" -l fo+eng txt pdf';
>
> Despite the fact i put "fo" in front (this is the one that recognises the
> number 4 better), it still gives me an output text file that is exactly
> identical to the "eng" dictionary output when i run that solo on it's own.
>
> For some reason, it chooses to not just prioritise eng but also completely
> ignoring the fo traineddata file completely.
>
> The "fo" file definitely works as i've tested it solo.
>
> I have attached an image example of the text i'd like to OCR and the two
> relevant traineddata files.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/1a5a6768-baeb-4ba9-9cbd-adda6cba957c%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXF6eSn8cfFLUJrTjJ-ojDuATy_wogH-5ugS4CHt5PFQQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Problem with using two trained.data files in combination for a better result.

2018-08-09 Thread Shree Devi Kumar
output tesseract.log file should be produced in the directory from where
you are running the command, usually where your OCR output is created.

On Thu, Aug 9, 2018 at 3:48 PM  wrote:

> Hello Shree, thank you for your prompt reply.
>
> I have now changed the logfile as instructed. Where can i find the output
> tesseract.log file? will it be produced in the same location as the
> logfile? in C:\Program Files (x86)\Tesseract-OCR\tessdata\configs ? I'm
> guessing the tesseract.log file will be produced once i've used logfile in
> the commands.
>
> Kind Regards,
>
> Damon
>
>
> On Wednesday, 8 August 2018 19:07:02 UTC+1, shree wrote:
>>
>> i think this could be if your new traineddats is not trained to as high a
>> accuracy level as the eng traineddata.
>>
>> You can setup a debug log to verify this. see
>> https://github.com/tesseract-ocr/tesseract/issues/1275#issuecomment-360367865
>> for details
>>
>> On Wed, Aug 8, 2018 at 6:04 PM  wrote:
>>
>>> i'm trying to use the combination of two traineddata dictionaries
>>> together due to one of them being able to recognise specific numbers better
>>> than the other.
>>>
>>> Here is an example of the code line.
>>>
>>>  $codeLine .= 'magick convert "'.$filePath.'"
>>> -quality 90 -density 300x300  -units PixelsPerInch "'.$output.'.jpg"'; //
>>>  $codeLine .= 'tesseract "'.$output.'.jpg"
>>> "'.$output.'" -l fo+eng txt pdf';
>>>
>>> Despite the fact i put "fo" in front (this is the one that recognises
>>> the number 4 better), it still gives me an output text file that is exactly
>>> identical to the "eng" dictionary output when i run that solo on it's own.
>>>
>>> For some reason, it chooses to not just prioritise eng but also
>>> completely ignoring the fo traineddata file completely.
>>>
>>> The "fo" file definitely works as i've tested it solo.
>>>
>>> I have attached an image example of the text i'd like to OCR and the two
>>> relevant traineddata files.
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/1a5a6768-baeb-4ba9-9cbd-adda6cba957c%40googlegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>> --
>>
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/befd629e-e433-45dd-bf1a-7a5c955e9a61%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWK2gdGYGq_BX21YAAo5tuAFcs_eFkaLho9Hz0T4OegpQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: tesseract training flags to rtl languages

2018-08-09 Thread Shree Devi Kumar
There is an Urdu traineddata for tesseract 4. Have you tried it

See
https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#updated-data-files-for-version-400-september-15-2017


You can also check script/Arabic which should also support Urdu.

Please provide feedback as to its accuracy for Urdu.

On Thu 9 Aug, 2018, 8:01 PM ,  wrote:

> Hello daniel,
> i am developing ocr for urdu language, iam also facing the same problem,
> model is working correctly but output is printing ltr, will you please
> sahre the solution. thankyou in advance.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/30996f4c-8d03-4daa-9a27-037f07f82e58%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVN9%3DD9bc%2Bg%2BO8yxs5cq1amhFWQ%2BcugteitgJqr5PwS%2Bg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: tesseract training flags to rtl languages

2018-08-09 Thread Shree Devi Kumar
Are you training for tesseract 3 or tesseract 4(LSTM training)?

On Thu 9 Aug, 2018, 8:13 PM Mohammad Moin,  wrote:

> this is not much accurate, i am trying to develop my own traineddata from
> scratch, i have completed every thing but the output is ltr in testing,
> dont know whats the wrong in training. can you please point out. thank you.
>
> On Thu, Aug 9, 2018 at 7:39 PM Shree Devi Kumar 
> wrote:
>
>> There is an Urdu traineddata for tesseract 4. Have you tried it
>>
>> See
>> https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#updated-data-files-for-version-400-september-15-2017
>>
>>
>> You can also check script/Arabic which should also support Urdu.
>>
>> Please provide feedback as to its accuracy for Urdu.
>>
>> On Thu 9 Aug, 2018, 8:01 PM ,  wrote:
>>
>>> Hello daniel,
>>> i am developing ocr for urdu language, iam also facing the same problem,
>>> model is working correctly but output is printing ltr, will you please
>>> sahre the solution. thankyou in advance.
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/30996f4c-8d03-4daa-9a27-037f07f82e58%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/30996f4c-8d03-4daa-9a27-037f07f82e58%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVN9%3DD9bc%2Bg%2BO8yxs5cq1amhFWQ%2BcugteitgJqr5PwS%2Bg%40mail.gmail.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVN9%3DD9bc%2Bg%2BO8yxs5cq1amhFWQ%2BcugteitgJqr5PwS%2Bg%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
> --
> Regards : Mohammad Moin Ud Din
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAFNnr9apemip-SS1MDtNgsFA7Hdt_gveaN6at12%3DP%2BAjtOX%2B2Q%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CAFNnr9apemip-SS1MDtNgsFA7Hdt_gveaN6at12%3DP%2BAjtOX%2B2Q%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVtrAzv_C_oeTrqr-cGxMGJxDJkVbDQLL9DbjGD7FkAvg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Problem with using two trained.data files in combination for a better result.

2018-08-10 Thread Shree Devi Kumar
I do not know about the internal algorithms used by tesseract.

If you are having accuracy issues with certain letters and digits, I will
suggest that you fine-tune  for impact using the images or similar font.

Please see wiki page on training 4.0 for the command - look for fine tuning
for new font/impact. Use eng.traineddata as base, 50-100 lines of training
text and 300-400 iterations max.

On Fri 10 Aug, 2018, 8:39 PM ,  wrote:

> Hi Shree, just a quick update.
>
> I've now looked into this output tesseract.log further and now understand
> how it works and how it will go through different choices and eventually
> decides on a "best choice". However the output doesn't explain how it then
> decides what has overriding priority on giving the best outcome. The fact
> that even after it scours through the "fo" dictionary, it decides on best
> choice for this dictionary, immediately it will move onto eng dictionary
> and seems to decide to use eng dictionary output because (i'm guessing), it
> regards it as more accurate. This means your theory about our custom "fo"
> dictionary not being trained to a high enough accuracy level seems to be
> correct. Is there any possible way i can train either eng or fo to improve
> it's accuracy or override another dictionary on specific characters it's
> getting wrong? for example, in our case, the eng.traneddata dictionary
> sometimes gets 3's and 5's mixed up and it has a lot of trouble with 4's.
>
> Your help on this would be greatly appreciated!
>
> Kind Regards,
>
> Damon
>
> On Thursday, 9 August 2018 11:29:11 UTC+1, shree wrote:
>>
>> output tesseract.log file should be produced in the directory from where
>> you are running the command, usually where your OCR output is created.
>>
>> On Thu, Aug 9, 2018 at 3:48 PM  wrote:
>>
>>> Hello Shree, thank you for your prompt reply.
>>>
>>> I have now changed the logfile as instructed. Where can i find the
>>> output tesseract.log file? will it be produced in the same location as the
>>> logfile? in C:\Program Files (x86)\Tesseract-OCR\tessdata\configs ? I'm
>>> guessing the tesseract.log file will be produced once i've used logfile in
>>> the commands.
>>>
>>> Kind Regards,
>>>
>>> Damon
>>>
>>>
>>> On Wednesday, 8 August 2018 19:07:02 UTC+1, shree wrote:

 i think this could be if your new traineddats is not trained to as high
 a accuracy level as the eng traineddata.

 You can setup a debug log to verify this. see
 https://github.com/tesseract-ocr/tesseract/issues/1275#issuecomment-360367865
 for details

 On Wed, Aug 8, 2018 at 6:04 PM  wrote:

> i'm trying to use the combination of two traineddata dictionaries
> together due to one of them being able to recognise specific numbers 
> better
> than the other.
>
> Here is an example of the code line.
>
>  $codeLine .= 'magick convert "'.$filePath.'"
> -quality 90 -density 300x300  -units PixelsPerInch "'.$output.'.jpg"'; //
>  $codeLine .= 'tesseract "'.$output.'.jpg"
> "'.$output.'" -l fo+eng txt pdf';
>
> Despite the fact i put "fo" in front (this is the one that recognises
> the number 4 better), it still gives me an output text file that is 
> exactly
> identical to the "eng" dictionary output when i run that solo on it's own.
>
> For some reason, it chooses to not just prioritise eng but also
> completely ignoring the fo traineddata file completely.
>
> The "fo" file definitely works as i've tested it solo.
>
> I have attached an image example of the text i'd like to OCR and the
> two relevant traineddata files.
>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to tesseract-oc...@googlegroups.com.
> To post to this group, send email to tesser...@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/1a5a6768-baeb-4ba9-9cbd-adda6cba957c%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>


 --

 
 भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> T

Re: [tesseract-ocr] cannot install new version, please help me

2018-08-10 Thread Shree Devi Kumar
uninstall all versions of tesseract and libtesseract-dev

then install using ppa from

https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr


On Sat, Aug 11, 2018 at 11:08 AM Kimchi  wrote:

> Environment
>
>- Tesseract Version: 3.04
>- Commit Number: 3.04
>- Platform: ubuntu 16.04
>
> Current Behavior:Expected Behavior:Suggested Fix:
>
> I installed tesseract 3.04 from source but now I cannot install up to 4
> via command:
> sudo apt install tesseract-ocr
> It still install 3.04
>
> Log:
> sudo apt install tesseract-ocr
> Reading package lists... Done
> Building dependency tree
> Reading state information... Done
> The following NEW packages will be installed:
> tesseract-ocr
> 0 upgraded, 1 newly installed, 0 to remove and 157 not upgraded.
> Need to get 132 kB of archives.
> After this operation, 562 kB of additional disk space will be used.
> Get:1 http://vn.archive.ubuntu.com/ubuntu xenial/universe amd64
> tesseract-ocr amd64 3.04.01-4 [132 kB]
> Fetched 132 kB in 1s (90,3 kB/s)
> Selecting previously unselected package tesseract-ocr.
> dpkg: warning: files list file for package 'tesseract-ocr-osd' missing;
> assuming package has no files currently installed
> dpkg: warning: files list file for package 'tesseract-ocr-eng' missing;
> assuming package has no files currently installed
> dpkg: warning: files list file for package 'libtesseract-dev' missing;
> assuming package has no files currently installed
> dpkg: warning: files list file for package 'libtesseract3' missing;
> assuming package has no files currently installed
> dpkg: warning: files list file for package 'tesseract-ocr-equ' missing;
> assuming package has no files currently installed
> (Reading database ... 293846 files and directories currently installed.)
> Preparing to unpack .../tesseract-ocr_3.04.01-4_amd64.deb ...
> Unpacking tesseract-ocr (3.04.01-4) ...
> Processing triggers for man-db (2.7.5-1) ...
> Setting up tesseract-ocr (3.04.01-4) ...
> Thanks
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/c5a8c5ae-293a-431b-abe2-855dd68d978b%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXvL5nnO-CeTKsFRetyu5sESaHt97S_V7vATOFB8u7Siw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Training tools don't get built when building tesseract from souce

2018-08-12 Thread Shree Devi Kumar
sudo apt-get remove tesseract-ocr
sudo apt-get remove libtesseract-dev

sudo add-apt-repository ppa:alex-p/tesseract-ocr
sudo apt-get update

sudo apt install tesseract-ocr
sudo apt install libtesseract-dev

The above will  unintsall and then install the latest version binaries from
ppa.

Why are you then again building from source?

ppa installs the training scripts and puts them in the path too.

On Mon, Aug 13, 2018 at 3:24 AM Shandigutt  wrote:

> Hi,
>
>
>- I was working with a previous version of Tesseract and I was asked
>to get the latest snapshot when I came across the below error,
>
> https://groups.google.com/forum/#!topic/tesseract-ocr/8fw9P-WoooQ
>
>
>- I followed the wiki and below video tutorials to build Tesseract
>from source,
>
> https://www.youtube.com/watch?v=vOdnt2h1U8U
> https://www.youtube.com/watch?v=WZLJucXZy-g
>
>
>- I didn't build Leptonica as I had already built and installed 1.74.4
>version,
>
>
>
>- I followed exactly below commands. I have attached the output of
>some of them,
>
>
> sudo apt-get remove tesseract-ocr
> sudo apt-get remove libtesseract-dev
>
> sudo add-apt-repository ppa:alex-p/tesseract-ocr
> sudo apt-get update
>
> sudo apt install tesseract-ocr
> sudo apt install libtesseract-dev
>
> sudo apt-get install g++ # or clang++ (presumably)
> sudo apt-get install autoconf automake libtool
> sudo apt-get install pkg-config
> sudo apt-get install libpng-dev
> sudo apt-get install libjpeg8-dev
> sudo apt-get install libtiff5-dev
> sudo apt-get install zlib1g-dev
>
> sudo apt-get install libicu-dev
> sudo apt-get install libpango1.0-dev
> sudo apt-get install libcairo2-dev
>
> git clone https://github.com/tesseract-ocr/tesseract.git #output:
> git_clone.txt
> cd tesseract
> autoreconf -vi   #output: autoreconf_vi.txt
> ./autogen.sh #output: autogen_sh.txt
> ./configure --enable-debug  #output: configure_enable_debug.txt
> LDFLAGS="L/usr/local/lib" CFLAGS="-I/usr/local/include" make
> #output: make.txt
> sudo make install   #output: make_install.txt
> sudo ldconfig#output: ldconfig.txt
> make training#output: make_training.txt
> sudo make training-install   #output: make_training_install.txt
>
>
>
>- My OS details are as below,
>
> Linux shandigutt-laptop-ubuntu 4.4.0-130-generic #156-Ubuntu SMP Thu Jun
> 14 08:53:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
>
> After building Tesseract following above steps, when I queried Tesseract
> version I get the below output,
> tesseract -v
> tesseract 4.0.0-beta.4-26-gfd49
>  leptonica-1.74.4
>   libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib
> 1.2.8
>
>  Found SSE
>
>
>- Files tree in tesseract directory after building is attached as
>tesseract_tree.txt
>
>
> I can't find all the training scripts I had previously. Appreciate your
> help on this.
>
> Thanks,
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/76b4e521-7e0e-4189-befb-b97b24b0f354%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWLHspmGdsBf84m22BLHxHdtyic5JyCDu%3DZa15tbc0jbA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Training tools don't get built when building tesseract from souce

2018-08-14 Thread Shree Devi Kumar
libtool: install: /usr/bin/install -c .libs/combine_lang_model
/usr/local/bin/combine_lang_model
libtool: install: /usr/bin/install -c .libs/combine_tessdata
/usr/local/bin/combine_tessdata
libtool: install: /usr/bin/install -c .libs/dawg2wordlist
/usr/local/bin/dawg2wordlist
libtool: install: /usr/bin/install -c .libs/lstmeval /usr/local/bin/lstmeval
libtool: install: /usr/bin/install -c .libs/lstmtraining
/usr/local/bin/lstmtraining
libtool: install: /usr/bin/install -c .libs/merge_unicharsets
/usr/local/bin/merge_unicharsets
libtool: install: /usr/bin/install -c .libs/set_unicharset_properties
/usr/local/bin/set_unicharset_properties
libtool: install: /usr/bin/install -c .libs/text2image /usr/local/bin/text2image
libtool: install: /usr/bin/install -c .libs/unicharset_extractor
/usr/local/bin/unicharset_extractor
libtool: install: /usr/bin/install -c .libs/wordlist2dawg
/usr/local/bin/wordlist2dawg
libtool: install: /usr/bin/install -c .libs/ambiguous_words
/usr/local/bin/ambiguous_words
libtool: install: /usr/bin/install -c .libs/classifier_tester
/usr/local/bin/classifier_tester
libtool: install: /usr/bin/install -c .libs/cntraining /usr/local/bin/cntraining
libtool: install: /usr/bin/install -c .libs/mftraining /usr/local/bin/mftraining
libtool: install: /usr/bin/install -c .libs/shapeclustering
/usr/local/bin/shapeclustering


The files are installed in /usr/local/bin

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduV3EDypRoWKMSp6U_7C--o8-r9UAcZSUuzdt99uWpoRBw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Training tools don't get built when building tesseract from souce

2018-08-14 Thread Shree Devi Kumar
│   │   ├── training
│   │   │   ├── ambiguous_words
│   │   │   ├── ambiguous_words.o
│   │   │   ├── boxchar.lo
│   │   │   ├── boxchar.o
│   │   │   ├── classifier_tester
│   │   │   ├── classifier_tester.o
│   │   │   ├── cntraining
│   │   │   ├── cntraining.o
│   │   │   ├── combine_lang_model
│   │   │   ├── combine_lang_model.o
│   │   │   ├── combine_tessdata
│   │   │   ├── combine_tessdata.o
│   │   │   ├── commandlineflags.lo
│   │   │   ├── commandlineflags.o
│   │   │   ├── commontraining.lo
│   │   │   ├── commontraining.o
│   │   │   ├── dawg2wordlist
│   │   │   ├── dawg2wordlist.o
│   │   │   ├── degradeimage.lo
│   │   │   ├── degradeimage.o
│   │   │   ├── fileio.lo
│   │   │   ├── fileio.o
│   │   │   ├── lang_model_helpers.lo
│   │   │   ├── lang_model_helpers.o
│   │   │   ├── libtesseract_tessopt.la
│   │   │   ├── libtesseract_training.la
│   │   │   ├── ligature_table.lo
│   │   │   ├── ligature_table.o
│   │   │   ├── lstmeval
│   │   │   ├── lstmeval.o
│   │   │   ├── lstmtester.lo
│   │   │   ├── lstmtester.o
│   │   │   ├── lstmtraining
│   │   │   ├── lstmtraining.o
│   │   │   ├── Makefile
│   │   │   ├── mergenf.o
│   │   │   ├── merge_unicharsets
│   │   │   ├── merge_unicharsets.o
│   │   │   ├── mftraining
│   │   │   ├── mftraining.o
│   │   │   ├── normstrngs.lo
│   │   │   ├── normstrngs.o
│   │   │   ├── pango_font_info.lo
│   │   │   ├── pango_font_info.o
│   │   │   ├── set_unicharset_properties
│   │   │   ├── set_unicharset_properties.o
│   │   │   ├── shapeclustering
│   │   │   ├── shapeclustering.o
│   │   │   ├── stringrenderer.lo
│   │   │   ├── stringrenderer.o
│   │   │   ├── tessopt.lo
│   │   │   ├── tessopt.o
│   │   │   ├── text2image
│   │   │   ├── text2image.o
│   │   │   ├── tlog.lo
│   │   │   ├── tlog.o
│   │   │   ├── unicharset_extractor
│   │   │   ├── unicharset_extractor.o
│   │   │   ├── unicharset_training_utils.lo
│   │   │   ├── unicharset_training_utils.o
│   │   │   ├── validate_grapheme.lo
│   │   │   ├── validate_grapheme.o
│   │   │   ├── validate_indic.lo
│   │   │   ├── validate_indic.o
│   │   │   ├── validate_javanese.lo
│   │   │   ├── validate_javanese.o
│   │   │   ├── validate_khmer.lo
│   │   │   ├── validate_khmer.o
│   │   │   ├── validate_myanmar.lo
│   │   │   ├── validate_myanmar.o
│   │   │   ├── validator.lo
│   │   │   ├── validator.o
│   │   │   ├── wordlist2dawg
│   │   │   └── wordlist2dawg.o




On Wed, Aug 15, 2018 at 9:35 AM Shree Devi Kumar 
wrote:

> libtool: install: /usr/bin/install -c .libs/combine_lang_model 
> /usr/local/bin/combine_lang_model
> libtool: install: /usr/bin/install -c .libs/combine_tessdata 
> /usr/local/bin/combine_tessdata
> libtool: install: /usr/bin/install -c .libs/dawg2wordlist 
> /usr/local/bin/dawg2wordlist
> libtool: install: /usr/bin/install -c .libs/lstmeval /usr/local/bin/lstmeval
> libtool: install: /usr/bin/install -c .libs/lstmtraining 
> /usr/local/bin/lstmtraining
> libtool: install: /usr/bin/install -c .libs/merge_unicharsets 
> /usr/local/bin/merge_unicharsets
> libtool: install: /usr/bin/install -c .libs/set_unicharset_properties 
> /usr/local/bin/set_unicharset_properties
> libtool: install: /usr/bin/install -c .libs/text2image 
> /usr/local/bin/text2image
> libtool: install: /usr/bin/install -c .libs/unicharset_extractor 
> /usr/local/bin/unicharset_extractor
> libtool: install: /usr/bin/install -c .libs/wordlist2dawg 
> /usr/local/bin/wordlist2dawg
> libtool: install: /usr/bin/install -c .libs/ambiguous_words 
> /usr/local/bin/ambiguous_words
> libtool: install: /usr/bin/install -c .libs/classifier_tester 
> /usr/local/bin/classifier_tester
> libtool: install: /usr/bin/install -c .libs/cntraining 
> /usr/local/bin/cntraining
> libtool: install: /usr/bin/install -c .libs/mftraining 
> /usr/local/bin/mftraining
> libtool: install: /usr/bin/install -c .libs/shapeclustering 
> /usr/local/bin/shapeclustering
>
>
> The files are installed in /usr/local/bin
>
>

-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWiENY6Hg3WaFOzX%3DH%2BQdj%2BxWZBDJ4zOxUOJSqEH3UiNQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] How to read texts from a table into arrays with tesseract, given the cooridnates of the column and row boundaries?

2018-08-15 Thread Shree Devi Kumar
check whether HOCR or TSV outputs are useful.

On Wed, Aug 15, 2018 at 4:24 PM, Bec Zhao  wrote:

> Hi,
>
> I want to extract texts from tables into arrays that represents the rows
> and columns of the table.
> I have already used opensv to obtain the precise boundaries of the table,
> now I want to know which syntax can extract the texts from the table, and
> put them into arrays according to the coorindates of the boundaries (or
> perhaps the joints of the boundaries)?
>
>
> Thanks!
>
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/ec86fad1-c257-4085-9a41-2ec9cecf15f6%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>



-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVXHLAzUuHkmcgk0c57AH2%2B5Hep6xZ-eJ72moTO6yNQuQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Make lstm for some files

2018-08-16 Thread Shree Devi Kumar
You need to make lstmf file for each of these.

eg.  tesseract  fas.B_Mitra.exp0.tif  fas.B_Mitra.exp0 --psm 6 lstm.train

will create  fas.B_Mitra.exp0.lstmf



On Thu, Aug 16, 2018 at 5:40 PM, Zohreh Khosrobeygi 
wrote:

> I have some tif and box files for each font for example:
> fas.B_Mitra.exp0.box
> fas.B_Mitra.exp0.tif
> fas.B_Mitra.exp1.box
> fas.B_Mitra.exp1.tif
> fas.B_Mitra.exp2.box
> fas.B_Mitra.exp2.tif
> .
> .
> .
> How can I make lstm for each of them?
> Thx.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/c011d8f3-75b1-471f-a772-35327390bf78%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>



-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduW4aFq-XA0N8UpiWUiL1HDaUbttK%3D%2Bkp%2Bf69UB8bVngng%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Infinite Loop of Compute CTC targets failed!

2018-08-17 Thread Shree Devi Kumar
Please build the latest code beta.4 and run the same test.

On Fri, Aug 17, 2018 at 4:44 PM,  wrote:

> ### Environment
>
> * **Tesseract Version**: 4.0.0-beta.1-306-g45b11cd
> * **Commit Number**: 4.0.0-beta.1-306-g45b11cd
> * **Platform**: Ubuntu x86_64 GNU/Linux
> ### Current Behavior:
>
> Infinite loop of Compute CTC targets failed
>
> I have a box file and tif images and i run the below script for training.
> ```
>
> ALL_BOXES = data/all-boxes
> ALL_LSTMF = data/all-lstmf
>
> # Create unicharset
> unicharset: data/unicharset
>
> # Create lists of lstmf filenames for training and eval
> lists: $(ALL_LSTMF) data/list.train data/list.eval
>
> data/list.train: $(ALL_LSTMF)
> total=`cat $(ALL_LSTMF) | wc -l` \
>no=`echo "$$total * $(RATIO_TRAIN) / 1" | bc`; \
>head -n "$$no" $(ALL_LSTMF) > "$@"
>
> data/list.eval: $(ALL_LSTMF)
> total=`cat $(ALL_LSTMF) | wc -l` \
>no=`echo "($$total - $$total * $(RATIO_TRAIN)) / 1" | bc`; \
>tail -n "+$$no" $(ALL_LSTMF) > "$@"
>
> # Start training
> training: data/$(MODEL_NAME).traineddata
>
> data/unicharset: $(ALL_BOXES)
>
> combine_tessdata -u $(TESSDATA)/eng.traineddata  $(TESSDATA)/$(MODEL_NAME).
> unicharset_extractor --output_unicharset "$(TRAIN)/my.unicharset"
> --norm_mode $(NORM_MODE) "$(ALL_BOXES)"
> merge_unicharsets $(TESSDATA)/$(MODEL_NAME).lstm-unicharset
> $(TRAIN)/my.unicharset  "$@"
>
> $(ALL_BOXES): $(sort $(patsubst %.tif,%.box,$(wildcard $(TRAIN)/*.tif)))
> find $(TRAIN) -name '*.box' -exec cat {} \; > "$@"
>
> #$(TRAIN)/%.box: $(TRAIN)/%.tif $(TRAIN)/%.gt.txt
> #python3 generate_line_box.py -i "$(TRAIN)/$*.tif" -t "$(TRAIN)/$*.gt.txt"
> > "$@"
>
> $(ALL_LSTMF): $(sort $(patsubst %.tif,%.lstmf,$(wildcard $(TRAIN)/*.tif)))
> find $(TRAIN) -name '*.lstmf' -exec echo {} \; | sort -R -o "$@"
>
> $(TRAIN)/%.lstmf: $(TRAIN)/%.box
> tesseract $(TRAIN)/$*.tif $(TRAIN)/$* --psm $(PSM) lstm.train
>
> # Build the proto model
> proto-model: data/$(MODEL_NAME)/$(MODEL_NAME).traineddata
>
> data/$(MODEL_NAME)/$(MODEL_NAME).traineddata: $(LANGDATA) data/unicharset
> combine_lang_model \
>   --input_unicharset data/unicharset \
>   --script_dir $(LANGDATA) \
>   --output_dir data/ \
>   --lang $(MODEL_NAME)
>
> data/checkpoints/$(MODEL_NAME)_checkpoint: unicharset lists proto-model
> mkdir -p data/checkpoints
> lstmtraining \
>   --traineddata data/$(MODEL_NAME)/$(MODEL_NAME).traineddata \
>   --net_spec "[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c`head
> -n1 data/unicharset`]" \
>   --model_output data/checkpoints/$(MODEL_NAME) \
>   --learning_rate 20e-4 \
>   --train_listfile data/list.train \
>   --eval_listfile data/list.eval \
>   --max_iterations 1
> ```
>
> Here is the Logs including the error .
>
> ```
> find data/train -name '*.box' -exec cat {} \; > "data/all-boxes"
> #python3 generate_line_box.py -i "data/train/.tif" -t "data/train/.gt.txt"
> > "data/all-boxes"
> combine_tessdata -u /mnt/Training_Tesseract/ocrd-
> train/usr/share/tessdata/eng.traineddata  /mnt/Training_Tesseract/ocrd-
> train/usr/share/tessdata/Invoice.
> Version string:4.00.00alpha:eng:synth20170629:[1,36,0,1Ct3,3,16Mp3,
> 3Lfys64Lfx96Lrx96Lfx512O1c1]
> 1:unicharset:size=7477, offset=192
> 2:unicharambigs:size=1047, offset=7669
> 3:inttemp:size=976552, offset=8716
> 4:pffmtable:size=844, offset=985268
> 5:normproto:size=13408, offset=986112
> 6:punc-dawg:size=4322, offset=999520
> 7:word-dawg:size=1082890, offset=1003842
> 8:number-dawg:size=6426, offset=2086732
> 9:freq-dawg:size=1410, offset=2093158
> 13:shapetable:size=63346, offset=2094568
> 14:bigram-dawg:size=16109842, offset=2157914
> 17:lstm:size=1487588, offset=18267756
> 18:lstm-punc-dawg:size=4322, offset=19755344
> 19:lstm-word-dawg:size=3694794, offset=19759666
> 20:lstm-number-dawg:size=4738, offset=23454460
> 21:lstm-unicharset:size=6360, offset=23459198
> 22:lstm-recoder:size=1012, offset=23465558
> 23:version:size=80, offset=23466570
> Extracting tessdata components from /mnt/Training_Tesseract/ocrd-
> train/usr/share/tessdata/eng.traineddata
> Wrote /mnt/Training_Tesseract/ocrd-train/usr/share/tessdata/
> Invoice.unicharset
> Wrote /mnt/Training_Tesseract/ocrd-train/usr/share/tessdata/
> Invoice.unicharambigs
> Wrote /mnt/Training_Tesseract/ocrd-train/usr/share/tessdata/
> Invoice.inttemp
> Wrote /mnt/Training_Tesseract/ocrd-train/usr/share/tessdata/
> Invoice.pffmtable
> Wrote /mnt/Training_Tesseract/ocrd-train/usr/share/tessdata/
> Invoice.normproto
> Wrote /mnt/Training_Tesseract/ocrd-train/usr/share/tessdata/
> Invoice.punc-dawg
> Wrote /mnt/Training_Tesseract/ocrd-train/usr/share/tessdata/
> Invoice.word-dawg
> Wrote /mnt/Training_Tesseract/ocrd-train/usr/share/tessdata/
> Invoice.number-dawg
> Wrote /mnt/Training_Tesseract/ocrd-train/usr/share/tessdata/
> Invoice.freq-dawg
> Wrote /mnt/Training_Tesseract/ocrd-train/usr/share/tessdata/
> Invoice.shapetable
> Wrote /mnt/Training_Tesseract/ocrd-train/usr/share/tessdata/
> Invoice.bigram-dawg
> Wrote /mnt/Training_Tesseract/ocrd-train/usr

Re: [tesseract-ocr] Make lstm for some files

2018-08-19 Thread Shree Devi Kumar
> tesseract 4.0.0-beta.1

Please upgrade to latest code.

On Sun, Aug 19, 2018 at 11:18 PM, Khosrobeigy.zohreh  wrote:

> Hi, when I run tesstrain.sh I get this error:
> + err_exit '/tmp/tmp.N31LQSCg1a/fas/fas.Times_New_Roman.exp0.lstmf does
> not exist or is not readable'
> + echo -e 'ERROR: /tmp/tmp.N31LQSCg1a/fas/fas.Times_New_Roman.exp0.lstmf'
> does not exist or is not readable
> + tee -a /tmp/tmp.N31LQSCg1a/fas/tesstrain.log
> ERROR: /tmp/tmp.N31LQSCg1a/fas/fas.Times_New_Roman.exp0.lstmf does not
> exist or is not readable
> + exit 1
>
> Tesseract -v:
> tesseract 4.0.0-beta.1
>  leptonica-1.74.4
>   libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib
> 1.2.8
>
>  Found AVX2
>  Found AVX
>  Found SSE
>
>
>
>
>
> On Thu, Aug 16, 2018 at 6:28 PM Shree Devi Kumar 
> wrote:
>
>> You need to make lstmf file for each of these.
>>
>> eg.  tesseract  fas.B_Mitra.exp0.tif  fas.B_Mitra.exp0 --psm 6
>> lstm.train
>>
>> will create  fas.B_Mitra.exp0.lstmf
>>
>>
>>
>> On Thu, Aug 16, 2018 at 5:40 PM, Zohreh Khosrobeygi <
>> beigy.zoh...@gmail.com> wrote:
>>
>>> I have some tif and box files for each font for example:
>>> fas.B_Mitra.exp0.box
>>> fas.B_Mitra.exp0.tif
>>> fas.B_Mitra.exp1.box
>>> fas.B_Mitra.exp1.tif
>>> fas.B_Mitra.exp2.box
>>> fas.B_Mitra.exp2.tif
>>> .
>>> .
>>> .
>>> How can I make lstm for each of them?
>>> Thx.
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/
>>> msgid/tesseract-ocr/c011d8f3-75b1-471f-a772-35327390bf78%
>>> 40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/c011d8f3-75b1-471f-a772-35327390bf78%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>>
>> --
>>
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> --
>> You received this message because you are subscribed to a topic in the
>> Google Groups "tesseract-ocr" group.
>> To unsubscribe from this topic, visit https://groups.google.com/d/
>> topic/tesseract-ocr/QpAIHg4SPME/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to
>> tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit https://groups.google.com/d/
>> msgid/tesseract-ocr/CAG2NduW4aFq-XA0N8UpiWUiL1HDaUbttK%3D%2Bkp%
>> 2Bf69UB8bVngng%40mail.gmail.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduW4aFq-XA0N8UpiWUiL1HDaUbttK%3D%2Bkp%2Bf69UB8bVngng%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
> --
> Zohreh Khosrobeygi
> University of Tehran, 2016
> Tel: +989196042887
> khosrobeygi.zo...@ut.ac.ir 
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/CAE1QSgwTFcbwffgFnkVZVvSB3RavFJs213%2BbZ-xFXhpQ06i7Yw%
> 40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CAE1QSgwTFcbwffgFnkVZVvSB3RavFJs213%2BbZ-xFXhpQ06i7Yw%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>



-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUFE1LdqHn7KfPycy%3DLXRY2rqHxjGzWE75uOC6g0Zn2PQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Changing Parameters when Fine Tuning

2018-08-20 Thread Shree Devi Kumar
If you want to change parameters, please look at the replace layers option.
With fine tuning you cannot change them.

On Tue 21 Aug, 2018, 7:27 AM ,  wrote:

> Is it possible to change the parameters when Fine Tuning?
>
> The documentation says "Fine tuning is the process of training an existing
> model on new data without changing any part of the network", but does that
> mean that parameters like momentum, learning rate, etc.  cannot be changed?
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/c671d6ad-7437-4d5c-8072-3c5a77c88268%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVyb9NMLmA%2BCnY2ZSi%2BSv0LT1cmQ8SgcuP5xLHBoa69fg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Changing Parameters when Fine Tuning

2018-08-21 Thread Shree Devi Kumar
>lstmtraining --model_output ./layer_from_deva/layer --continue_from
./layer_from_deva/Devanagari.lstm --append_index 5 --net_spec '[Lfx192
O1c1]' --traineddata ./sanplustrain/san/san.traineddata --train_listfile
./sanplustrain/san.training_files.txt --eval_listfile
./sanpluseval/san.training_files.txt --debug_interval 0 --max_image_MB 6000
--max_iterations 500
Loaded file ./layer_from_deva/Devanagari.lstm, unpacking...
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Continuing from ./layer_from_deva/Devanagari.lstm
Appending a new network to an old one!!Warning: given outputs 1 not equal
to unicharset of 149.
Num outputs,weights in Series:
  Lfx192:192, 197376
  Fc149:149, 28757
Total weights = 226133
Built network:[1,48,0,1[C3,3Ft16]Mp3,3Lfys64Lfx64Lrx64Lfx192Fc149] from
request [Lfx192 O1c1]
Training parameters:
  Debug interval = 0, weights = 0.1, learning rate = 0.001, momentum=0.5


I  have not tried changing the parameters even with replace layers. Do
provide feedback on your experience,


On Tue, Aug 21, 2018 at 12:00 PM Jacob Biros 
wrote:

> Thank you for the prompt response!
>
> On Tue, Aug 21, 2018 at 1:14 PM, Shree Devi Kumar 
> wrote:
>
>> If you want to change parameters, please look at the replace layers
>> option. With fine tuning you cannot change them.
>>
>> On Tue 21 Aug, 2018, 7:27 AM ,  wrote:
>>
>>> Is it possible to change the parameters when Fine Tuning?
>>>
>>> The documentation says "Fine tuning is the process of training an
>>> existing model on new data without changing any part of the network", but
>>> does that mean that parameters like momentum, learning rate, etc.  cannot
>>> be changed?
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/c671d6ad-7437-4d5c-8072-3c5a77c88268%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/c671d6ad-7437-4d5c-8072-3c5a77c88268%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
>> You received this message because you are subscribed to a topic in the
>> Google Groups "tesseract-ocr" group.
>> To unsubscribe from this topic, visit
>> https://groups.google.com/d/topic/tesseract-ocr/hjZizqQm2wA/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to
>> tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVyb9NMLmA%2BCnY2ZSi%2BSv0LT1cmQ8SgcuP5xLHBoa69fg%40mail.gmail.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVyb9NMLmA%2BCnY2ZSi%2BSv0LT1cmQ8SgcuP5xLHBoa69fg%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAGO_jRmrNJ%2BYG0E_xuKhhs5DFdvamKWL3NdZLNkLcjxzxCt4pQ%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CAGO_jRmrNJ%2BYG0E_xuKhhs5DFdvamKWL3NdZLNkLcjxzxCt4pQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUyeO%2BsoKOBLje9j3%2B7cCDmvmTeLf%2BmXVEQx_hyhRMf0w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Changing Parameters when Fine Tuning

2018-08-21 Thread Shree Devi Kumar
When you specify the complete network spec as in

--net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \

It probably treats it as a training from scratch and ignores the
continue_from.

I haven't looked at how the training command is parsed. I just followed
Ray's examples for the different kinds of training.

example: from scratch

lstmtraining \
  --debug_interval -1 \
  --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
  --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx192 O1c111]' \
  --model_output ~/tesstutorial/engoutput/base --learning_rate 20e-4 \
  --train_listfile ~/tesstutorial/engtrain/eng.training_files.txt \
  --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt \
  --max_iterations 5000

example: finetune for impact

lstmtraining \
  --model_output ~/tesstutorial/impact_from_full/impact \
  --continue_from ~/tesstutorial/impact_from_full/eng.lstm \
  --traineddata ../tessdata_best/eng.traineddata \
  --train_listfile ~/tesstutorial/engeval/eng.training_files.txt \
  --max_iterations 400

example: finetune plusminus

lstmtraining \
  --model_output ./plus_from_deva/sanplustrain \
  --continue_from ./plus_from_deva/Devanagari.lstm \
  --old_traineddata ../tessdata_best/script/Devanagari.traineddata \
  --traineddata ./sanplustrain/san/san.traineddata \
  --train_listfile ./sanplustrain/san.training_files.txt \
  --debug_interval 0 \
  --max_image_MB 6000 \
  --max_iterations $num_iterations

example: replace top layer

lstmtraining \
  --model_output ./layer_from_deva/layer \
  --continue_from ./layer_from_deva/Devanagari.lstm \
  --append_index 5 --net_spec '[Lfx192 O1c1]' \
  --traineddata ./sanplustrain/san/san.traineddata \
  --train_listfile ./sanplustrain/san.training_files.txt \
  --eval_listfile ./sanpluseval/san.training_files.txt \
  --debug_interval 0 \
  --max_image_MB 6000 \
  --max_iterations $num_iterations

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWn9Q_h97cp2q7HRdN-a6u%3DFiGGeH2oAmykywfbEGPUew%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Changing Parameters when Fine Tuning

2018-08-21 Thread Shree Devi Kumar
On Tue, Aug 21, 2018 at 2:47 PM Jacob Biros 
wrote:

> Thank you for your response.  The command that I posted previously was for
> fine tuning, so I was hoping to figure out what was causing the changes in
> accuracy when changes to the parameters should have no effect at all.j
>
> I'm not sure if we will move forward with replacing the layers, but I will
> post about it if we do.
>
> Thanks again!
>
> On Tue, Aug 21, 2018 at 6:08 PM, Shree Devi Kumar 
> wrote:
>
>> >lstmtraining --model_output ./layer_from_deva/layer --continue_from
>> ./layer_from_deva/Devanagari.lstm --append_index 5 --net_spec '[Lfx192
>> O1c1]' --traineddata ./sanplustrain/san/san.traineddata --train_listfile
>> ./sanplustrain/san.training_files.txt --eval_listfile
>> ./sanpluseval/san.training_files.txt --debug_interval 0 --max_image_MB 6000
>> --max_iterations 500
>> Loaded file ./layer_from_deva/Devanagari.lstm, unpacking...
>> Warning: LSTMTrainer deserialized an LSTMRecognizer!
>> Continuing from ./layer_from_deva/Devanagari.lstm
>> Appending a new network to an old one!!Warning: given outputs 1 not equal
>> to unicharset of 149.
>> Num outputs,weights in Series:
>>   Lfx192:192, 197376
>>   Fc149:149, 28757
>> Total weights = 226133
>> Built network:[1,48,0,1[C3,3Ft16]Mp3,3Lfys64Lfx64Lrx64Lfx192Fc149] from
>> request [Lfx192 O1c1]
>> Training parameters:
>>   Debug interval = 0, weights = 0.1, learning rate = 0.001, momentum=0.5
>>
>>
>> I  have not tried changing the parameters even with replace layers. Do
>> provide feedback on your experience,
>>
>>
>> On Tue, Aug 21, 2018 at 12:00 PM Jacob Biros 
>> wrote:
>>
>>> Thank you for the prompt response!
>>>
>>> On Tue, Aug 21, 2018 at 1:14 PM, Shree Devi Kumar 
>>> wrote:
>>>
>>>> If you want to change parameters, please look at the replace layers
>>>> option. With fine tuning you cannot change them.
>>>>
>>>> On Tue 21 Aug, 2018, 7:27 AM ,  wrote:
>>>>
>>>>> Is it possible to change the parameters when Fine Tuning?
>>>>>
>>>>> The documentation says "Fine tuning is the process of training an
>>>>> existing model on new data without changing any part of the network", but
>>>>> does that mean that parameters like momentum, learning rate, etc.  cannot
>>>>> be changed?
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/c671d6ad-7437-4d5c-8072-3c5a77c88268%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/c671d6ad-7437-4d5c-8072-3c5a77c88268%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>> --
>>>> You received this message because you are subscribed to a topic in the
>>>> Google Groups "tesseract-ocr" group.
>>>> To unsubscribe from this topic, visit
>>>> https://groups.google.com/d/topic/tesseract-ocr/hjZizqQm2wA/unsubscribe
>>>> .
>>>> To unsubscribe from this group and all its topics, send an email to
>>>> tesseract-ocr+unsubscr...@googlegroups.com.
>>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVyb9NMLmA%2BCnY2ZSi%2BSv0LT1cmQ8SgcuP5xLHBoa69fg%40mail.gmail.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVyb9NMLmA%2BCnY2ZSi%2BSv0LT1cmQ8SgcuP5xLHBoa69fg%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr&

Re: [tesseract-ocr] Changing Parameters when Fine Tuning

2018-08-21 Thread Shree Devi Kumar
On Tue, Aug 21, 2018 at 1:16 PM  wrote:

> Sorry, one more question.  We set up 4 different machines all running the
> command below except for minor differences in the momentum and the learning
> rate.  Changing the momentum and learning rate in this situation, because
> it is fine tuning, shouldn't affect anything right?  In our case though
> each machine produced different results.  Do you have any idea what exactly
> is causing this?  I can provide more information as necessary.  Thanks.
>
> training/lstmtraining --traineddata
> ~/tesstutorial/jpntrain/jpn/jpn.traineddata \
>   --continue_from ~/fine_tuning/models/jpn.lstm \
>   --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \
>   --model_output
> ~/bouch_fine_tuning/0819_fine20fonts_mom05_lr1e-4_010/base --learning_rate
> 1e-4 \
>   --momentum 0.5 \
>   --train_listfile ~/bouch_train/jpntrain/jpn.training_files.txt \
>   --eval_listfile ~/bouch_train/jpneval/jpn.training_files.txt \
>   --max_iterations 10
> &>~/bouch_fine_tuning/0819_fine20fonts_mom05_lr1e-4_010/basetrain.log \
>   --old_traineddata /usr/local/share/tessdata/__official_jpn.traineddata \
>   --debug_interval -1
>
> The above is NOT a finetuning command, since you are providing the
complete network spec.

With finetuning for impact (new font), recommended iterations is 400.

With finetuning for plusminus (adding a new character) recommnded
iterations is 3000-3600.

However, these required iterations numbers as well as Ray's tutorial is for
English.

I have found that these do not directly apply to other

languages which require recoding of the unicharset.

You will get quicker results if you replace top layer (compared to your
earlier version which might have started from scratch).

Yo can try the different commands with --debug_interval -1 that will show
you the debug output on console itself, giving you an idea of the training.
eg.

File /tmp/tmp.o98cvEGUNe/akk/akk.CuneiformOB.exp-1.lstmf page 1 (Perfect):
Mean rms=0.167%, delta=0.772%, train=2.703%(4.359%), skip ratio=0.2%
Iteration 600506: ALIGNED TRUTH : 𒀀𒈾 𒉺𒉌𒅀
Iteration 600506: BEST OCR TEXT : 𒀀𒈾 𒉺𒉌𒅀

With finetuning, iteration 1 should start with a very low error rate.

For training from scratch it may be even 400% error rate.

For replacing a layer it may start around 150% error rate and come down to
100% after about 600 iterations,

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXLmhx0X%2B_y%2BeF5ZU9KHvNiReqP8ufyoDvindaqQYNh-w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] IF I could make .unicharset by box/tif pairs instead of fonts files by tesstrain.sh?

2018-08-27 Thread Shree Devi Kumar
When using tesstrain.sh, you can add --save_box_tiff to the command line.

Original tesstrain.sh did not move box/tiff alongwith lstmf files (they
remained in /tmp directory).

I had modified it first to move box/tiff in all cases along with lstmf
files.

This option now gives the user the choice whether to save the box/tiff
pairs or not. Default is NOT to save them, since they are not needed for
training and are useful just for reference/review.

On Tue, Aug 28, 2018 at 6:56 AM, 王思远  wrote:

> I see there is a new flag in the tesseract
> /src
> /training
> /tesstrain.sh
> in the change on 2018/8/20.
> add variable --save_box_tiff to Save box/tiff pairs along with lstmf files
> 
> So how can i use this new flag? Is there a demo that i can refer to?
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/b71db956-9137-4a16-84af-7f4462ac53e9%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>



-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVSeuyM1y-EHMu-FfoGOgCO%3Dg3jy%2BRf1AbY6a9MUtpuBQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Experiment with Thai language

2018-08-31 Thread Shree Devi Kumar
>Can't encode transcription: 'คุย เดีย ระบบ๑๙ 77 และมี." มิเมือง' in
language ''
I don't know what causes this kind of warning and how to solve it so I just
continue the training.

These are related to normalization and validation of the training text.
Please see
https://github.com/tesseract-ocr/tesseract/blob/master/src/training/validate_grapheme.cpp
for the rules applied for Thai.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUDqP-j-6eeAQvepYfuV409GrawyhYytzCBzNZq_kB9Wg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Which repo should I use? langdata_lstm or langdata?

2018-08-31 Thread Shree Devi Kumar
langdata_lstm has the source training data used for 4.0.0alpha traineddata
files. However, the exact usage of files or scripts to use them have not
been provided.

The training text in langdata_lstm is much larger. If you want only to
finetune, you should consider limiting the pages when running tesstrain.sh.

The unicharset is usually built from the training text so make sure that
the trainingtext has all the required characters in it.

On Fri, Aug 31, 2018 at 5:42 PM,  wrote:

> I've seen there is a new repository in the tesseract-ocr directory called
> *langdata_lstm*. Is it better to use train lstm or should I use the
> simple old *langdata* repository? If *langdata_lstm* is more likely to
> use which *.unicharset* file is recommended for *combine_lang_model*
> script? *langdata_lstm/Latin/Latin.unicharset* or
> *langdata_lstm/Latin.unicharset*?
>
> I did not find any documentation about this.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/807d0766-ecd3-4ddd-add6-150a0c39e7a6%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>



-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXC0F1t7byHVLd1_7PCD0CcHX4juYYB1u9Dsph1%2B-T5KQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Error in creating LSTM training data using tesstrain.sh

2018-09-01 Thread Shree Devi Kumar
> read_params_file: Can't open lstm.train

lstm.train is a config file which is not found.

It is there in tesseract/tessdata/configs

Make sure it is there in your tessdata directory or your path and can be
found.

On Sun, Sep 2, 2018 at 3:40 AM, Shandigutt  wrote:

> Hi,
>
> I was trying to create LSTM training data using tesstrain.sh. I got the
> below error. Can somebody explain me what has gone wrong,
>
> *Command I used:*
> ./src/training/tesstrain.sh --fonts_dir ../Support/font --lang sin
> --linedata_only \
>   --noextract_font_properties --langdata_dir ../langdata \
>   --tessdata_dir ../tessdata --output_dir ../training/sintrain --fontlist
> "BhashitaComplex" --training_text ../langdata/sin/sin.training_text
>
> *Extract of the output:*
> === Phase E: Generating lstmf files ===
> Using TESSDATA_PREFIX=../tessdata
> [2018 සැප්තැම්බර් 1 වැනි සෙනසුරාදා 21:41:25 +0300]
> /usr/local/bin/tesseract /tmp/sin-2018-09-01.E4T/sin.BhashitaComplex.exp0.tif
> /tmp/sin-2018-09-01.E4T/sin.BhashitaComplex.exp0 --psm 6 lstm.train
> ../langdata/sin/sin.config
> read_params_file: Can't open lstm.train
> Tesseract Open Source OCR Engine v4.0.0-beta.4-74-gd8237 with Leptonica
> Page 1
> Page 2
> Page 3
> ERROR: /tmp/sin-2018-09-01.E4T/sin.BhashitaComplex.exp0.lstmf does not
> exist or is not readable
>
> *For the complete output please see the attached err.txt*
>
> *After executing the command I checked the tmp directory it created. It
> was shown as below,*
>
> tharaka@tharaka-laptop-ubuntu:~$ cd /tmp/sin-2018-09-01.E4T/
> tharaka@tharaka-laptop-ubuntu:/tmp/sin-2018-09-01.E4T$ ll
> total 776
> drwx--  2 tharaka tharaka   4096 සැප්   1 21:41 ./
> drwxrwxrwt 50 rootroot  4096 සැප්   2 00:10 ../
> -rw-r--r--  1 tharaka tharaka 249413 සැප්   1 21:41
> sin.BhashitaComplex.exp0.box
> -rw-r--r--  1 tharaka tharaka 436290 සැප්   1 21:41
> sin.BhashitaComplex.exp0.tif
> -rw-r--r--  1 tharaka tharaka   9099 සැප්   1 23:27
> sin.BhashitaComplex.exp0.txt
> -rw-r--r--  1 tharaka tharaka   6543 සැප්   1 21:41 sin.unicharset
> -rw-r--r--  1 tharaka tharaka   3053 සැප්   1 21:41 sin.xheights
> -rw-r--r--  1 tharaka tharaka  71704 සැප්   1 23:27 tesstrain.log
> tharaka@tharaka-laptop-ubuntu:/tmp/sin-2018-09-01.E4T$
>
> *My tesseract  version:*
> tesseract 4.0.0-beta.4-74-gd8237
>  leptonica-1.77.0
>   libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib
> 1.2.11
>  Found SSE
>
> *My OS details,*
> tharaka@tharaka-laptop-ubuntu:/tmp/sin-2018-09-01.E4T$ lsb_release -a
> No LSB modules are available.
> Distributor ID: Ubuntu
> Description: Ubuntu 18.04.1 LTS
> Release: 18.04
> Codename: bionic
>
> Appreciate your support on this.
> Thanks
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/7d771008-c142-4302-8b5e-e1fd130cc140%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>



-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXNYUwcsgMCq7OmNRvEmzewgMVwuLYY_TjOng%2BOcdMDdA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


  1   2   3   4   5   6   7   8   9   >