[tesseract-ocr] I'm reading Using tesstrain (tesseract 4.0) wiki passage _ I have a question

이경준 Tue, 27 Feb 2018 23:21:41 -0800

Hi I'm studying this passage. But I cannot understand  what is that meaning 
flag "--noextract_font_properties" ? . so I saw the file 
/tesseract/training/tesstrain.sh

But I cannot Find "--noextract_font_properites"

Here usage :

# USAGE:
#
# tesstrain.sh
# --fontlist FONTS # A list of fontnames to train on.
# --fonts_dir FONTS_PATH # Path to font files.
# --lang LANG_CODE # ISO 639 code.
# --langdata_dir DATADIR # Path to tesseract/training/langdata
directory.
# --output_dir OUTPUTDIR # Location of output traineddata file.
# --overwrite # Safe to overwrite files in output_dir.
# --linedata_only # Only generate training data for
lstmtraining.
# --run_shape_clustering # Run shape clustering (use for Indic
langs).
# --exposures EXPOSURES # A list of exposure levels to use (e.g.
"-1 0 1").
#
# OPTIONAL flags for input data. If unspecified we will look for them in
# the langdata_dir directory.
# --training_text TEXTFILE # Text to render and use for training.
# --wordlist WORDFILE # Word list for the language ordered by
# # decreasing frequency.
#
# OPTIONAL flag to specify location of existing traineddata files, required
# during feature extraction. If unspecified will use TESSDATA_PREFIX
defined in
# the current environment.
# --tessdata_dir TESSDATADIR # Path to tesseract/tessdata directory.
#
# NOTE:
# The font names specified in --fontlist need to be recognizable by Pango
using
# fontconfig. An easy way to list the canonical names of all fonts
available on
# your system is to run text2image with --list_available_fonts and the
# appropriate --fonts_dir path.

Using tesstrain

The setup for running tesstrain.sh
<https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-%E2%80%93-tesstrain.sh>
is
the same as for base Tesseract. Use --linedata_onlyoption for LSTM
training. Note that it is beneficial to have more training text and make
more pages though, as neural nets don't generalize as well and need to
train on something similar to what they will be running on. If the target
domain is severely limited, then all the dire warnings about needing a lot
of training data may not apply, but the network specification may need to
be changed.

Training data is created using tesstrain.sh
<https://github.com/tesseract-ocr/tesseract/blob/master/training/tesstrain.sh>
as
follows: Note that your fonts location may vary.

training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only \
--noextract_font_properties --langdata_dir ../langdata \
--tessdata_dir ./tessdata --output_dir ~/tesstutorial/engtrain

Thank U Very much . I want to reply Everybody

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/05a54fa0-b5c0-48eb-b7a1-7db0fe8dfe81%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] I'm reading Using tesstrain (tesseract 4.0) wiki passage _ I have a question

Reply via email to