Appreciate your offer to help and provide feedback as well as training data.
Let me try to answer your queries: 1. > I have been using san. But was unaware that you can also use Devanagari. What is the difference? san has been trained for Sanskrit. But it is missing certain Devanagari characters. See https://github.com/tesseract-ocr/tessdata/issues/64 <https://github.com/tesseract-ocr/tessdata/issues/64script/Devanagari> script/Devanagari has been trained for san, hin, mar, nep and eng. So the missing characters are all trained in this, though the language model is not strictly for san. 2. >>These have the float models, to improve speed they can be compressed using `combine_tessdata -c` Tesseract has two kinds of traineddata files, those with best/float/double models and those with fast/integer models. tessdata_best repo has the best/float/double models. These have better accuracy but are much slower. These can be used as START_MODEL for further finetune training. tessdata_fast repo has fast/integer models. These are 'best value for money' models and are the ones included in the official distributions. They have slightly less accuracy but are much faster. The traineddata files I had uploaded were only the `best/float` models after finetune training. These can be compressed to `fast/iinteger` models using the command `combine_tessdata -c my.traineddata` I will upload the fast versions also to the repo so that both types are available without the need for the extra step. 3. >> I’m not sure exactly what to do with these links or the files they access? See https://github.com/tesseract-ocr/tessdoc/blob/master/Data-Files.md and https://github.com/tesseract-ocr/tessdoc/blob/master/Compiling.md#language-data The traineddata files are the files in the tessdata folder eg. eng.traineddata, san.traineddata script/Devanagari.traineddata https://github.com/Shreeshrii/tesstrain-Sanskrit-IAST/tree/main/data/tessdata_best has links to traineddata files after different runs of finetuning. Sample script on Linux, if the finetuned traineddata files are in $HOME/tess5training-iast/tessdata ``` my_files=$(ls */*{*.jpg,*.tif,*.tiff,*.png,*.jp2,*.gif}) for my_file in ${my_files}; do for LANG in Sanskrit-1017 ; do echo -e "\n ***** " $my_file "LANG" $LANG PSM $PSM "****" OMP_THREAD_LIMIT=1 tesseract $my_file "${my_file%.*}" --oem 1 --psm 3 -l "$LANG" --dpi 300 --tessdata-dir $HOME/tess5training-iast/tessdata -c page_separator='' -c tessedit_char_blacklist="¢£¥€₹™$¬©®¶‡†&@" done done ``` 4. tell me how to make “actual line images” and “groundtruth transcription”? For using tesstrain repo for training, we use single line images and their groundtruth transcription in UTF-8 text. Files names need to have same basename with groundtruth extension being .gt.txt Example https://github.com/Shreeshrii/tesstrain-sanPlusMinus/blob/master/data/sanPlusMinus-ground-truth/Adishila/san.Adishila.0000001.exp0_0.png https://github.com/Shreeshrii/tesstrain-sanPlusMinus/blob/master/data/sanPlusMinus-ground-truth/Adishila/san.Adishila.0000001.exp0_0.gt.txt I have generated a lot of synthetic data using fonts and training text. It will be useful to have line images from scanned pages with their transcription. These can be used first to evaluate the different models and also for further finetuning. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWRk7Xjsie9Rr_9kEyrHVHbw1NJtg0Pn8yAFkoe0hyQEw%40mail.gmail.com.