Re: [tesseract-ocr] Retrain Tesseract 4.0.0 beta to recognise handwritten digits

chandra churh chatterjee Thu, 19 Jul 2018 03:37:06 -0700

Environment : Ubuntu 16.04 LTS
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Check : Running tesseract -v in terminal gives:
________________________________________________


tesseract 4.0.0-beta.1-376-gb1f79
 leptonica-1.74.1
  libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib
1.2.8

 Found AVX2
 Found AVX
 Found SSE
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
DOWNLOAD HANDWRITTEN FONTS FROM fonts.google.com AND TRAIN USING THE
GENERAL PROCEDURE.

  THE TEXT CORPUS WAS CREATED BY TWEAKING THE CODE OF create_corpus.py AND
STORING THE RESULT IN corpus.txt

  WHICH WAS THEN RENAMED TO [lang].training_text AND REPLACED IN
langdata/[lang] DIRECTORY.


[Step 1] Download the required fonts and install them on the system. For
Linux Machine, copy the fonts to ~/.fonts directory and run <sudo fc-cache
-rv> from there.

[Step 2] Get the fonts you want to train tesseract on by running the
following command :

text2image --find_fonts --fonts_dir /usr/share/fonts --text
./langdata/[lang]/[lang].training_text --min_coverage .9  --outputbase
./langdata/[lang]/[lang] |& grep raw  | sed -e 's/ :.*/@ \\/g'  | sed -e
"s/^/  '/"  | sed -e "s/@/'/g" >path/to/langdata/[lang]/fontslist.txt

[Step 3] Go to langdata/[lang]/fontslist.txt, open it and copy the
contents. Paste the same in "language-specific.sh" under Latin fonts.
Generate the format of the new fonts according to the convention mentioned
in

https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-3.03%E2%80%933.05#the-font_properties-file

and enlist them. Add the same to langdata/font_properties.


 [Step 4] Generate starter traineddata by running the following command.

  training/tesstrain.sh --lang eng --linedata_only
 --noextract_font_properties --langdata_dir ~/langdata --output_dir
~/tesstutorial/newoutput

  [Make sure to mention the full path of tesstrain.sh]


[Step 5]  Run lstmtraining on the starter traineddata with the following
command :

training/lstmtraining --debug_interval 0   --traineddata
~/tesstutorial/newoutput/eng/eng.traineddata   --net_spec '[1,36,0,1
Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]'   --model_output
~/tesstutorial/newoutput/output/base --learning_rate 20e-4
 --train_listfile ~/tesstutorial/newoutput/eng.training_files.txt
--max_iterations 10000 &>~/tesstutorial/newoutput/output/basetrain.log

Follow the tesseract 4 official wiki to get details about all parameters
that can be specified. This step will take a long time to complete.
--debug_interval should be kept either 0 or -1 if ScrollView.jar was not
made. Also make sure the output and input directories are writable and
readable, respectively.

[Step 6] Create the final traineddata that is used by the software by
running the following command:

training/lstmtraining --stop_training --continue_from
~/tesstutorial/newoutput/output/base_checkpoint --traineddata
~/tesstutorial/newoutput/eng/eng.traineddata --model_output
~/tesstutorial/newoutput/output/eng.traineddata

[Again, make sure the complete path to lstmtraining is given to ensure the
proper version is used.]

[Step 7] Rename the eng.traineddata file to digits.traineddata and copy the
same to tessdata directory from where tesseract reads the languages.
To integrate with the Reader (in Windows) , copy it to tessdata directory.

Run from ~/tesstutorial/digoutput directory :

sudo cp digits.traineddata
/usr/share/tesseract-ocr/tessdata/digits.traineddata


ACCURACY ACHIEVED : ~ 90%-95%
HIGHEST ACCURACY : 100%

On Thu, Jul 19, 2018 at 4:02 PM Ramakant Kushwaha <
ramakant.sing...@gmail.com> wrote:

> Thanks @Chandra, I am beginner for this, Please help me with the complete
> documentation.
>
>
> On Thu, Jul 19, 2018 at 3:38 PM, chandra churh chatterjee <
> chandrachurh.chatterje...@gmail.com> wrote:
>
>> I have already used tesseract 4.0 version for training on hand written
>> digits.
>> The steps are as follows:
>> 1.The best way to do is use some handwriten fonts from Google or any
>> where else.
>> 2.use the "tesstrain.sh" script to generate the starter trained data
>> using the text corpus containing only 0-9 digits in a random function ,
>> create such a text corpus and generate the starter trained .
>> 3. Use the starter trained data to generate final traineed data after
>> lstm training
>>
>>
>> If you want a detailed description, I can supply you with a complete
>> documentation of steps.
>>
>> Chandra Churh Chatterjee
>>
>>
>> On Tue, Jul 17, 2018, 8:43 PM Ramakant Kushwaha <
>> ramakant.sing...@gmail.com> wrote:
>>
>>> *Hi,*
>>>
>>> *Recently I trying to retrain Tesseract 4.0 for recognising handwritten
>>> digits. I am following official page but finding it very difficult. It
>>> would be great if someone can elaborate below steps*
>>>
>>>
>>>    - Prepare training text.
>>>    
>>> <https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951>(I
>>>    am using jTessBoxEditor for creating box files )
>>>    - Render text to image + box file. (Or create hand-made box files
>>>    for existing image data.)
>>>    - Make unicharset file. (Can be partially specified, ie created
>>>    manually). (Do not how to do this)
>>>    - Make a starter traineddata from the unicharset and optional
>>>    dictionary data.
>>>    
>>> <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#creating-starter-traineddata>
>>>    - Run tesseract to process image + box file to make training data
>>>    set.
>>>    - Run training on training data set.
>>>    - Combine data files.
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/97e29010-f602-42e9-b3b8-121fb151a49e%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/97e29010-f602-42e9-b3b8-121fb151a49e%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/CAD_EDkaz3cM5UOgBEA1KXKdfARj_JTdtW%3DC-B4ffBr7XL4NvRw%40mail.gmail.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/CAD_EDkaz3cM5UOgBEA1KXKdfARj_JTdtW%3DC-B4ffBr7XL4NvRw%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAJkcRioxN-rmzE8KKZh_xHtgvefar-sVdGtw-gp3cZnURLi6%3DA%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CAJkcRioxN-rmzE8KKZh_xHtgvefar-sVdGtw-gp3cZnURLi6%3DA%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAD_EDkYR5TRRKv%2B183Boy0vKoVGeT1g%2BZGFnR28RoeaiRiMSjw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Retrain Tesseract 4.0.0 beta to recognise handwritten digits

Reply via email to