Environment : Ubuntu 16.04 LTS ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Check : Running tesseract -v in terminal gives: ________________________________________________
tesseract 4.0.0-beta.1-376-gb1f79 leptonica-1.74.1 libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 Found AVX2 Found AVX Found SSE ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ DOWNLOAD HANDWRITTEN FONTS FROM fonts.google.com AND TRAIN USING THE GENERAL PROCEDURE. THE TEXT CORPUS WAS CREATED BY TWEAKING THE CODE OF create_corpus.py AND STORING THE RESULT IN corpus.txt WHICH WAS THEN RENAMED TO [lang].training_text AND REPLACED IN langdata/[lang] DIRECTORY. [Step 1] Download the required fonts and install them on the system. For Linux Machine, copy the fonts to ~/.fonts directory and run <sudo fc-cache -rv> from there. [Step 2] Get the fonts you want to train tesseract on by running the following command : text2image --find_fonts --fonts_dir /usr/share/fonts --text ./langdata/[lang]/[lang].training_text --min_coverage .9 --outputbase ./langdata/[lang]/[lang] |& grep raw | sed -e 's/ :.*/@ \\/g' | sed -e "s/^/ '/" | sed -e "s/@/'/g" >path/to/langdata/[lang]/fontslist.txt [Step 3] Go to langdata/[lang]/fontslist.txt, open it and copy the contents. Paste the same in "language-specific.sh" under Latin fonts. Generate the format of the new fonts according to the convention mentioned in https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-3.03%E2%80%933.05#the-font_properties-file and enlist them. Add the same to langdata/font_properties. [Step 4] Generate starter traineddata by running the following command. training/tesstrain.sh --lang eng --linedata_only --noextract_font_properties --langdata_dir ~/langdata --output_dir ~/tesstutorial/newoutput [Make sure to mention the full path of tesstrain.sh] [Step 5] Run lstmtraining on the starter traineddata with the following command : training/lstmtraining --debug_interval 0 --traineddata ~/tesstutorial/newoutput/eng/eng.traineddata --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' --model_output ~/tesstutorial/newoutput/output/base --learning_rate 20e-4 --train_listfile ~/tesstutorial/newoutput/eng.training_files.txt --max_iterations 10000 &>~/tesstutorial/newoutput/output/basetrain.log Follow the tesseract 4 official wiki to get details about all parameters that can be specified. This step will take a long time to complete. --debug_interval should be kept either 0 or -1 if ScrollView.jar was not made. Also make sure the output and input directories are writable and readable, respectively. [Step 6] Create the final traineddata that is used by the software by running the following command: training/lstmtraining --stop_training --continue_from ~/tesstutorial/newoutput/output/base_checkpoint --traineddata ~/tesstutorial/newoutput/eng/eng.traineddata --model_output ~/tesstutorial/newoutput/output/eng.traineddata [Again, make sure the complete path to lstmtraining is given to ensure the proper version is used.] [Step 7] Rename the eng.traineddata file to digits.traineddata and copy the same to tessdata directory from where tesseract reads the languages. To integrate with the Reader (in Windows) , copy it to tessdata directory. Run from ~/tesstutorial/digoutput directory : sudo cp digits.traineddata /usr/share/tesseract-ocr/tessdata/digits.traineddata ACCURACY ACHIEVED : ~ 90%-95% HIGHEST ACCURACY : 100% On Thu, Jul 19, 2018 at 4:02 PM Ramakant Kushwaha < ramakant.sing...@gmail.com> wrote: > Thanks @Chandra, I am beginner for this, Please help me with the complete > documentation. > > > On Thu, Jul 19, 2018 at 3:38 PM, chandra churh chatterjee < > chandrachurh.chatterje...@gmail.com> wrote: > >> I have already used tesseract 4.0 version for training on hand written >> digits. >> The steps are as follows: >> 1.The best way to do is use some handwriten fonts from Google or any >> where else. >> 2.use the "tesstrain.sh" script to generate the starter trained data >> using the text corpus containing only 0-9 digits in a random function , >> create such a text corpus and generate the starter trained . >> 3. Use the starter trained data to generate final traineed data after >> lstm training >> >> >> If you want a detailed description, I can supply you with a complete >> documentation of steps. >> >> Chandra Churh Chatterjee >> >> >> On Tue, Jul 17, 2018, 8:43 PM Ramakant Kushwaha < >> ramakant.sing...@gmail.com> wrote: >> >>> *Hi,* >>> >>> *Recently I trying to retrain Tesseract 4.0 for recognising handwritten >>> digits. I am following official page but finding it very difficult. It >>> would be great if someone can elaborate below steps* >>> >>> >>> - Prepare training text. >>> >>> <https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951>(I >>> am using jTessBoxEditor for creating box files ) >>> - Render text to image + box file. (Or create hand-made box files >>> for existing image data.) >>> - Make unicharset file. (Can be partially specified, ie created >>> manually). (Do not how to do this) >>> - Make a starter traineddata from the unicharset and optional >>> dictionary data. >>> >>> <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#creating-starter-traineddata> >>> - Run tesseract to process image + box file to make training data >>> set. >>> - Run training on training data set. >>> - Combine data files. >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to tesseract-ocr+unsubscr...@googlegroups.com. >>> To post to this group, send email to tesseract-ocr@googlegroups.com. >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/97e29010-f602-42e9-b3b8-121fb151a49e%40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/97e29010-f602-42e9-b3b8-121fb151a49e%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-ocr+unsubscr...@googlegroups.com. >> To post to this group, send email to tesseract-ocr@googlegroups.com. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/CAD_EDkaz3cM5UOgBEA1KXKdfARj_JTdtW%3DC-B4ffBr7XL4NvRw%40mail.gmail.com >> <https://groups.google.com/d/msgid/tesseract-ocr/CAD_EDkaz3cM5UOgBEA1KXKdfARj_JTdtW%3DC-B4ffBr7XL4NvRw%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To post to this group, send email to tesseract-ocr@googlegroups.com. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/CAJkcRioxN-rmzE8KKZh_xHtgvefar-sVdGtw-gp3cZnURLi6%3DA%40mail.gmail.com > <https://groups.google.com/d/msgid/tesseract-ocr/CAJkcRioxN-rmzE8KKZh_xHtgvefar-sVdGtw-gp3cZnURLi6%3DA%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAD_EDkYR5TRRKv%2B183Boy0vKoVGeT1g%2BZGFnR28RoeaiRiMSjw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.