Re: [tesseract-ocr] Diacriticals Training

Shree Devi Kumar Mon, 14 Dec 2020 01:47:23 -0800

Appreciate your offer to help and provide feedback as well as training data.

Let me try to answer your queries:

1. > I have been using san. But was unaware that you can also use
Devanagari. What is the difference?

san has been trained for Sanskrit. But it is missing certain Devanagari
characters. See https://github.com/tesseract-ocr/tessdata/issues/64
<https://github.com/tesseract-ocr/tessdata/issues/64script/Devanagari>

script/Devanagari has been trained for san, hin, mar, nep and eng. So the
missing characters are all trained in this, though the language model is
not strictly for san.

2. >>These have the float models, to improve speed they can be compressed
using `combine_tessdata -c`

Tesseract has two kinds of traineddata files, those with
best/float/double models and those with fast/integer models.

tessdata_best repo has the best/float/double models. These have better
accuracy but are much slower. These can be used as START_MODEL for further
finetune training.

tessdata_fast repo has fast/integer models. These are 'best value for
money' models and are the ones included in the official distributions. They
have slightly less accuracy but are much faster.

The traineddata files I had uploaded were only the `best/float` models
after finetune training. These can be compressed to `fast/iinteger` models
using the command

`combine_tessdata -c my.traineddata`

I will upload the fast versions also to the repo so that both types are
available without the need for the extra step.

3. >> I’m not sure exactly what to do with these links or the files they
access?

See https://github.com/tesseract-ocr/tessdoc/blob/master/Data-Files.md and
https://github.com/tesseract-ocr/tessdoc/blob/master/Compiling.md#language-data

The traineddata files are the files in the tessdata folder eg.
eng.traineddata, san.traineddata script/Devanagari.traineddata

https://github.com/Shreeshrii/tesstrain-Sanskrit-IAST/tree/main/data/tessdata_best
has links to traineddata files after different runs of finetuning.

Sample script on Linux, if the finetuned traineddata files are in
$HOME/tess5training-iast/tessdata

```
my_files=$(ls */*{*.jpg,*.tif,*.tiff,*.png,*.jp2,*.gif})
for my_file in ${my_files}; do
for LANG in Sanskrit-1017 ; do
echo -e "\n ***** " $my_file "LANG" $LANG PSM $PSM "****"
OMP_THREAD_LIMIT=1 tesseract $my_file "${my_file%.*}" --oem 1
--psm 3 -l "$LANG" --dpi 300 --tessdata-dir
$HOME/tess5training-iast/tessdata -c page_separator='' -c
tessedit_char_blacklist="¢£¥€₹™$¬©®¶‡†&@"
done
done
```
4. tell me how to make “actual line images” and “groundtruth transcription”?

For using tesstrain repo for training, we use single line images and
their groundtruth transcription in UTF-8 text.

Files names need to have same basename with groundtruth extension being
.gt.txt

Example
https://github.com/Shreeshrii/tesstrain-sanPlusMinus/blob/master/data/sanPlusMinus-ground-truth/Adishila/san.Adishila.0000001.exp0_0.png

https://github.com/Shreeshrii/tesstrain-sanPlusMinus/blob/master/data/sanPlusMinus-ground-truth/Adishila/san.Adishila.0000001.exp0_0.gt.txt

I have generated a lot of synthetic data using fonts and training text. It
will be useful to have line images from scanned pages with their
transcription. These can be used first to evaluate the different models and
also for further finetuning.

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWRk7Xjsie9Rr_9kEyrHVHbw1NJtg0Pn8yAFkoe0hyQEw%40mail.gmail.com.

Re: [tesseract-ocr] Diacriticals Training

Reply via email to