[tesseract-ocr] Tesseract-OCR Training Arabic text & numbers

Eliyaz L Sun, 12 Jul 2020 03:06:15 -0700

Hi,

My use case is on Arabic document, the pre retrained ara.traineddata are
good but not perfect. so i wish to fine tune ara.traineddata, if the
results are not satisfying then have train my own custom data.

please suggest me for the following:

1. for my use case in Arabic text, problem is in one character which is
always predicting wrong. so do i need to add the document font (traditional
arabic font) and train? if so pls provide the procedure or link to add one
font in pre training ara.traineddata.
2. if fine tuning or training from scratch, how many gt.txt files i need
and how many characters needs to be there in each file? and any apx
iterations if you know?
3. for number, the prediction is totally wrong on Arabic numbers, so do
i need to start from scratch or need to fine tune? if any then how to
prepare datasets for the same.
4. how to decide the max_iterations is there any ratio of datasets and
iteration.

*Below are my **trails**:*

*For Arabic Numbers:*

-> i tried to custom train only Arabic numbers.
-> i wrote a script to write 100,000 numbers in multiple gt.txt files. 100s
of character in each gt.txt file.
-> then one script to convert text to image (text2image) which should be
more like scanned image.
-> parameters used in the below order.

text2image --text test.gt.txt --outputbase /home/user/output --fonts_dir
/usr/share/fonts/truetype/msttcorefonts/ --font 'Arial' --degrade_image
false --rotate_image --exposure 2 --resolution 300

1. How much dataset i need to prepare for arabic number, as of now
required only for 2 specific fonts which i already have.
2. Will dateset be duplicate if i follow this procedure, if yes is there
any way to avoid it.
3. Is that good way to create more gt.txt files with less characters in
it (for eg 50,000 gt files with 10 numbers in each file) or less gt.txt
files with more characters (for eg 1000 gt files with 500 numbers in each
file).

If possible please guide me the procedure for datasets preparation.

For testing I tried 50,000 eng number, with each number in one gt.txt file
(for eg wrote "2500" data in 2500.gt.txt file) with 20,000 iteration but it
fails.

*For Arabic Text:*

-> prepared around 23k gt.txt files each having one sentence

-> generated .box and small .tifs files for all gt.txt files using 1 font
(traditional Arabic font)

-> used the tesstrain git and trained for 20,000 iteration

-> after training generated foo.traineddata with 0.03 error rate

-> did prediction an the real data, it is working perfect for the
perticular character which on pre trained (ara.traineddata) failes. but
when comes to overall accuracy the pre trained (ara.traineddata) performs
better except that one character.

*Summery:*

- how to fix one character in pre
trained (ara.traineddata) model or if not possible how to custom
train from scratch or is there a way to annotate on real image and prepare
dateset, pls suggest the best practice?
- how to prepare Arabic number dataset and train it. if custom training
on number not possible then can arabic numbers added with pre trained model
(ara.traineddata)

GitHub link used for custom training Arabic text and numbers:
https://github.com/tesseract-ocr/tesstrain

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/09cff705-838f-4ccb-b6e9-06326fea1cdbo%40googlegroups.com.

[tesseract-ocr] Tesseract-OCR Training Arabic text & numbers

Reply via email to