which version do you use? 在2021年8月17日星期二 UTC+8 上午1:18:22<samee...@gmail.com> 写道:
> Hello, I am trying to train form scratch/fine tune tesseract for "Jameel > Noori Nastaleeq" font for Urdu. The steps i did for training from scratch: > 1. Create unicharset from all groundtruth files: > ``` > unicharset_extractor --output_unicharset file.unicharset --norm_mode 3 file > ``` > 2. Create starter traineddata using above unicharset > ``` > combine_lang_model --input_unicharset file.unicharset --script_dir > "langdata/" --output_dir "output/" --lang JNUrd > ``` > 3. Create wordstrbox for each image > ``` > tesseract file1.png file1 --psm 6 wordstrbox > ``` > 4. Manually correct wordstrbox files using the ground truth > 5. Create lstmf file from each png and its corresponding box file > ``` > tesseract file.png file --psm 6 lstm.train > ``` > 6. Create list of lstmf files to use for training > ``` > ls *.lstmf -1 > mylang.trainingfiles_text > ``` > the unicharset the .lstmf file on the training step I am getting this > error: > ``` > Encoding of string failed! Failure bytes: ffffffd9 ffffff8a ffffffd9 > ffffff94 ffffffdb ffffff92 20 ffffffd9 ffffff88 ffffffd8 ffffffb2 ffffffdb > ffffff8c ffffffd8 ffffffb1 20 ffffffd8 ffffffae ffffffd8 ffffffa7 ffffffd8 > ffffffb1 ffffffd8 ffffffac ffffffdb ffffff81 20 ffffffd8 ffffffb4 ffffffd8 > ffffffa7 ffffffdb ffffff81 20 ffffffd9 ffffff85 ffffffd8 ffffffad ffffffd9 > ffffff85 ffffffd9 ffffff88 ffffffd8 ffffffaf 20 ffffffd9 ffffff82 ffffffd8 > ffffffb1 ffffffdb ffffff8c ffffffd8 ffffffb4 ffffffdb ffffff8c 20 ffffffd9 > ffffff86 ffffffdb ffffff92 20 ffffffd8 ffffffa8 ffffffd8 ffffffaa ffffffd8 > ffffffa7 ffffffdb ffffff8c ffffffd8 ffffffa7 20 ffffffda ffffffa9 ffffffdb > ffffff81 20 ffffffd9 ffffff85 ffffffd9 ffffff84 ffffffd8 ffffffa7 ffffffd9 > ffffff82 ffffffd8 ffffffa7 ffffffd8 ffffffaa > > Can't encode transcription: 'بعد نجی ٹی وی سے گفتگو کرتے ہوئے وزیر خارجہ > شاہ محمود قریشی نے بتایا کہ ملاقات' in language '' > ``` > > I have tried normalizing the text using the normalize.py file. > > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0824d687-2136-405e-a42a-8d365a3f7db4n%40googlegroups.com.