[tesseract-ocr] Re: Can't encode transcription

红狮子 Thu, 21 Oct 2021 05:40:37 -0700

which version do you use?

在2021年8月17日星期二 UTC+8 上午1:18:22<samee...@gmail.com> 写道：


> Hello, I am trying to train form scratch/fine tune tesseract for "Jameel 
> Noori Nastaleeq" font for Urdu. The steps i did for training from scratch:
> 1. Create unicharset from all groundtruth files:
> ```
> unicharset_extractor --output_unicharset file.unicharset --norm_mode 3 file
> ```
> 2. Create starter traineddata using above unicharset
> ```
> combine_lang_model --input_unicharset file.unicharset --script_dir 
> "langdata/" --output_dir "output/" --lang JNUrd
> ```
> 3. Create wordstrbox for each image
> ```
> tesseract file1.png file1 --psm 6 wordstrbox
> ```
> 4. Manually correct wordstrbox files using the ground truth
> 5. Create lstmf file from each png and its corresponding box file
> ```
> tesseract file.png file --psm 6 lstm.train
> ```
> 6. Create list of lstmf files to use for training
> ```
> ls *.lstmf -1 > mylang.trainingfiles_text
> ```
> the unicharset the .lstmf file on the training step I am getting this 
> error:
> ```
> Encoding of string failed! Failure bytes: ffffffd9 ffffff8a ffffffd9 
> ffffff94 ffffffdb ffffff92 20 ffffffd9 ffffff88 ffffffd8 ffffffb2 ffffffdb 
> ffffff8c ffffffd8 ffffffb1 20 ffffffd8 ffffffae ffffffd8 ffffffa7 ffffffd8 
> ffffffb1 ffffffd8 ffffffac ffffffdb ffffff81 20 ffffffd8 ffffffb4 ffffffd8 
> ffffffa7 ffffffdb ffffff81 20 ffffffd9 ffffff85 ffffffd8 ffffffad ffffffd9 
> ffffff85 ffffffd9 ffffff88 ffffffd8 ffffffaf 20 ffffffd9 ffffff82 ffffffd8 
> ffffffb1 ffffffdb ffffff8c ffffffd8 ffffffb4 ffffffdb ffffff8c 20 ffffffd9 
> ffffff86 ffffffdb ffffff92 20 ffffffd8 ffffffa8 ffffffd8 ffffffaa ffffffd8 
> ffffffa7 ffffffdb ffffff8c ffffffd8 ffffffa7 20 ffffffda ffffffa9 ffffffdb 
> ffffff81 20 ffffffd9 ffffff85 ffffffd9 ffffff84 ffffffd8 ffffffa7 ffffffd9 
> ffffff82 ffffffd8 ffffffa7 ffffffd8 ffffffaa
>
> Can't encode transcription: 'بعد نجی ٹی وی سے گفتگو کرتے ہوئے وزیر خارجہ 
> شاہ محمود قریشی نے بتایا کہ ملاقات' in language ''
> ```
>
> I have tried normalizing the text using the normalize.py file.
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/0824d687-2136-405e-a42a-8d365a3f7db4n%40googlegroups.com.

[tesseract-ocr] Re: Can't encode transcription

Reply via email to