Hi,


My use case is on Arabic document, the pre retrained ara.traineddata are 
good but not perfect. so i wish to fine tune ara.traineddata, if the 
results are not satisfying then have train my own custom data.


please suggest me for the following:

   1. for my use case in Arabic text, problem is in one character which is 
   always predicting wrong. so do i need to add the document font (traditional 
   arabic font) and train? if so pls provide the procedure or link to add one 
   font in pre training ara.traineddata.
   2. if fine tuning or training from scratch, how many gt.txt files i need 
   and how many characters needs to be there in each file? and any apx 
   iterations if you know?
   3. for number, the prediction is totally wrong on Arabic numbers, so do 
   i need to start from scratch or need to fine tune? if any then how to 
   prepare datasets for the same.
   4. how to decide the max_iterations is there any ratio of datasets and 
   iteration.


*Below are my **trails**:*


*For Arabic Numbers:*


-> i tried to custom train only Arabic numbers.
-> i wrote a script to write 100,000 numbers in multiple gt.txt files. 100s 
of character in each gt.txt file.
-> then one script to convert text to image (text2image) which should be 
more like scanned image.
-> parameters used in the below order.

text2image --text test.gt.txt --outputbase /home/user/output --fonts_dir 
/usr/share/fonts/truetype/msttcorefonts/ --font 'Arial' --degrade_image 
false --rotate_image --exposure 2 --resolution 300

   1. How much dataset i need to prepare for arabic number, as of now 
   required only for 2 specific fonts which i already have.
   2. Will dateset be duplicate if i follow this procedure, if yes is there 
   any way to avoid it.
   3. Is that good way to create more gt.txt files with less characters in 
   it (for eg 50,000 gt files with 10 numbers in each file) or less gt.txt 
   files with more characters (for eg 1000 gt files with 500 numbers in each 
   file).  

If possible please guide me the procedure for datasets preparation.

For testing I tried 50,000 eng number, with each number in one gt.txt file 
(for eg wrote "2500" data in 2500.gt.txt file) with 20,000 iteration but it 
fails.


*For Arabic Text:*


-> prepared around 23k gt.txt files each having one sentence

-> generated .box and small .tifs files for all gt.txt files using 1 font 
(traditional Arabic font)

-> used the tesstrain git and trained for 20,000 iteration

-> after training generated foo.traineddata with 0.03 error rate

-> did prediction an the real data, it is working perfect for the 
perticular character which on pre trained (ara.traineddata) failes. but 
when comes to overall accuracy the pre trained (ara.traineddata) performs 
better except that one character.



*Summery:*



   - how to fix one character in pre 
   trained (ara.traineddata) model or if not possible how to custom 
   train from scratch or is there a way to annotate on real image and prepare 
   dateset, pls suggest the best practice?
   - how to prepare Arabic number dataset and train it. if custom training 
   on number not possible then can arabic numbers added with pre trained model 
   (ara.traineddata)  

 

GitHub link used for custom training Arabic text and numbers: 
https://github.com/tesseract-ocr/tesstrain

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/09cff705-838f-4ccb-b6e9-06326fea1cdbo%40googlegroups.com.

Reply via email to