[tesseract-ocr] Can't encode transcription

Samee Arif Mon, 16 Aug 2021 10:16:59 -0700

Before you submit an issue, please review [the guidelines for this 
repository](https://github.com/tesseract-ocr/tesseract/blob/master/CONTRIBUTING.md).

Please report an issue only for a BUG, not for asking questions.

Note that it will be much easier for us to fix the issue if a test case that
reproduces the problem is provided. Ideally this test case should not have
any
external dependencies. Provide a copy of the image or link to files for the
test case.

Please delete this text and fill in the template below.

------------------------

Hello, I am trying to train form scratch/fine tune tesseract for "Jameel
Noori Nastaleeq" font for Urdu. The steps i did for training from scratch:
1. Create unicharset from all groundtruth files:
```
unicharset_extractor --output_unicharset file.unicharset --norm_mode 3 file
```
2. Create starter traineddata using above unicharset
```
combine_lang_model --input_unicharset file.unicharset --script_dir
"langdata/" --output_dir "output/" --lang JNUrd
```
3. Create wordstrbox for each image
```
tesseract file1.png file1 --psm 6 wordstrbox
```
4. Manually correct wordstrbox files using the ground truth
5. Create lstmf file from each png and its corresponding box file
```
tesseract file.png file --psm 6 lstm.train
```
6. Create list of lstmf files to use for training
```
ls *.lstmf -1 > mylang.trainingfiles_text
```
the unicharset the .lstmf file on the training step I am getting this error:
```
Encoding of string failed! Failure bytes: ffffffd9 ffffff8a ffffffd9
ffffff94 ffffffdb ffffff92 20 ffffffd9 ffffff88 ffffffd8 ffffffb2 ffffffdb
ffffff8c ffffffd8 ffffffb1 20 ffffffd8 ffffffae ffffffd8 ffffffa7 ffffffd8
ffffffb1 ffffffd8 ffffffac ffffffdb ffffff81 20 ffffffd8 ffffffb4 ffffffd8
ffffffa7 ffffffdb ffffff81 20 ffffffd9 ffffff85 ffffffd8 ffffffad ffffffd9
ffffff85 ffffffd9 ffffff88 ffffffd8 ffffffaf 20 ffffffd9 ffffff82 ffffffd8
ffffffb1 ffffffdb ffffff8c ffffffd8 ffffffb4 ffffffdb ffffff8c 20 ffffffd9
ffffff86 ffffffdb ffffff92 20 ffffffd8 ffffffa8 ffffffd8 ffffffaa ffffffd8
ffffffa7 ffffffdb ffffff8c ffffffd8 ffffffa7 20 ffffffda ffffffa9 ffffffdb
ffffff81 20 ffffffd9 ffffff85 ffffffd9 ffffff84 ffffffd8 ffffffa7 ffffffd9
ffffff82 ffffffd8 ffffffa7 ffffffd8 ffffffaa

Can't encode transcription: 'بعد نجی ٹی وی سے گفتگو کرتے ہوئے وزیر خارجہ
شاہ محمود قریشی نے بتایا کہ ملاقات' in language ''
```

I have tried normalizing the text using the normalize.py file.

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/30b876ae-21d8-4caf-8435-b54b33a517cdn%40googlegroups.com.

[tesseract-ocr] Can't encode transcription

Reply via email to