Hi,
My use case is on Arabic document, the pre retrained ara.traineddata are good but not perfect. so i wish to fine tune ara.traineddata, if the results are not satisfying then have train my own custom data. please suggest me for the following: 1. for my use case in Arabic text, problem is in one character which is always predicting wrong. so do i need to add the document font (traditional arabic font) and train? if so pls provide the procedure or link to add one font in pre training ara.traineddata. 2. if fine tuning or training from scratch, how many gt.txt files i need and how many characters needs to be there in each file? and any apx iterations if you know? 3. for number, the prediction is totally wrong on Arabic numbers, so do i need to start from scratch or need to fine tune? if any then how to prepare datasets for the same. 4. how to decide the max_iterations is there any ratio of datasets and iteration. *Below are my **trails**:* *For Arabic Numbers:* -> i tried to custom train only Arabic numbers. -> i wrote a script to write 100,000 numbers in multiple gt.txt files. 100s of character in each gt.txt file. -> then one script to convert text to image (text2image) which should be more like scanned image. -> parameters used in the below order. text2image --text test.gt.txt --outputbase /home/user/output --fonts_dir /usr/share/fonts/truetype/msttcorefonts/ --font 'Arial' --degrade_image false --rotate_image --exposure 2 --resolution 300 1. How much dataset i need to prepare for arabic number, as of now required only for 2 specific fonts which i already have. 2. Will dateset be duplicate if i follow this procedure, if yes is there any way to avoid it. 3. Is that good way to create more gt.txt files with less characters in it (for eg 50,000 gt files with 10 numbers in each file) or less gt.txt files with more characters (for eg 1000 gt files with 500 numbers in each file). If possible please guide me the procedure for datasets preparation. For testing I tried 50,000 eng number, with each number in one gt.txt file (for eg wrote "2500" data in 2500.gt.txt file) with 20,000 iteration but it fails. *For Arabic Text:* -> prepared around 23k gt.txt files each having one sentence -> generated .box and small .tifs files for all gt.txt files using 1 font (traditional Arabic font) -> used the tesstrain git and trained for 20,000 iteration -> after training generated foo.traineddata with 0.03 error rate -> did prediction an the real data, it is working perfect for the perticular character which on pre trained (ara.traineddata) failes. but when comes to overall accuracy the pre trained (ara.traineddata) performs better except that one character. *Summery:* - how to fix one character in pre trained (ara.traineddata) model or if not possible how to custom train from scratch or is there a way to annotate on real image and prepare dateset, pls suggest the best practice? - how to prepare Arabic number dataset and train it. if custom training on number not possible then can arabic numbers added with pre trained model (ara.traineddata) GitHub link used for custom training Arabic text and numbers: https://github.com/tesseract-ocr/tesstrain -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/09cff705-838f-4ccb-b6e9-06326fea1cdbo%40googlegroups.com.