here it is: https://github.com/tesseract-ocr/tessdoc/blob/main/Data-Files-in-tessdata_best.md
On Sunday, October 22, 2023 at 12:45:40 PM UTC+3 Des Bw wrote: > This is the code I used to train from a layer: > *make training MODEL_NAME=amh START_MODEL=amh APPEND_INDEX=5 > NET_SPEC='[Lfx256 O1c105]' TESSDATA=../tesseract/tessdata EPOCHS=3 > TARGET_ERROR_RATE=0.0001 training >> data/amh.log &* > *- I took it from Scheer' training *tesstrain-JSTORArabic*: * > https://github.com/Shreeshrii/tesstrain-JSTORArabic > > - The net_spec of ben might not be the same to amh. Shreeshrii has sent a > link on the netspecs of languages, in this forum. > > On Sunday, October 22, 2023 at 12:09:25 PM UTC+3 mdalihu...@gmail.com > wrote: > >> you can test by changes '--char spacing=1.0 . i think it would be problem >> accuracy of result on it also. >> On Sunday, 22 October, 2023 at 3:07:16 pm UTC+6 Ali hussain wrote: >> >>> i haven't tried by cut the top layer of the network. you can share your >>> knowledge what you done by cut the top layer of the network. or github >>> project link. >>> On Sunday, 22 October, 2023 at 12:27:32 pm UTC+6 desal...@gmail.com >>> wrote: >>> >>>> That is massive data. Have you tried to train by cut the top layer of >>>> the network? >>>> I think that is the most promising approach. I was getting really good >>>> results with that. But, the result is not getting translated to scanned >>>> documents. I get best results with the syntethic data. I am no >>>> experimenting with the settings in text2image if it is possible to emulate >>>> the scanned documents. >>>> I am also suspecting this setting '--char_spacing=1.0', in our setup >>>> is causing more trouble. Scanned documents come with characters spacing >>>> close to zero.If you are planning to train more, try removing this >>>> parameter. >>>> >>>> On Sunday, October 22, 2023 at 4:09:46 AM UTC+3 mdalihu...@gmail.com >>>> wrote: >>>> >>>>> 600000 lines of text and the itarations higher then 600000. but some >>>>> time i got better result in lower itarations in finetune like 100000 >>>>> lines >>>>> of text and itaration is only 5000 to10000. >>>>> On Saturday, 21 October, 2023 at 11:37:13 am UTC+6 desal...@gmail.com >>>>> wrote: >>>>> >>>>>> How many lines of text and iterations did you use? >>>>>> >>>>>> On Saturday, October 21, 2023 at 8:36:38 AM UTC+3 Des Bw wrote: >>>>>> >>>>>>> Yah, that is what I am getting as well. I was able to add the >>>>>>> missing letter. But, the overall accuracy become lower than the default >>>>>>> model. >>>>>>> >>>>>>> On Saturday, October 21, 2023 at 3:22:44 AM UTC+3 >>>>>>> mdalihu...@gmail.com wrote: >>>>>>> >>>>>>>> not good result. that's way i stop to training now. default >>>>>>>> traineddata is overall good then scratch. >>>>>>>> On Thursday, 19 October, 2023 at 11:32:08 pm UTC+6 >>>>>>>> desal...@gmail.com wrote: >>>>>>>> >>>>>>>>> Hi Ali, >>>>>>>>> How is your training going? >>>>>>>>> Do you get good results with the training-from-the-scratch? >>>>>>>>> >>>>>>>>> On Friday, September 15, 2023 at 6:42:26 PM UTC+3 tesseract-ocr >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> yes, two months ago when I started to learn OCR I saw that. it >>>>>>>>>> was very helpful at the beginning. >>>>>>>>>> On Friday, 15 September, 2023 at 4:01:32 pm UTC+6 >>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>> >>>>>>>>>>> Just saw this paper: https://osf.io/b8h7q >>>>>>>>>>> >>>>>>>>>>> On Thursday, September 14, 2023 at 9:02:22 PM UTC+3 >>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>> >>>>>>>>>>>> I will try some changes. thx >>>>>>>>>>>> >>>>>>>>>>>> On Thursday, 14 September, 2023 at 2:46:36 pm UTC+6 >>>>>>>>>>>> elvi...@gmail.com wrote: >>>>>>>>>>>> >>>>>>>>>>>>> I also faced that issue in the Windows. Apparently, the issue >>>>>>>>>>>>> is related with unicode. You can try your luck by changing "r" >>>>>>>>>>>>> to "utf8" >>>>>>>>>>>>> in the script. >>>>>>>>>>>>> I end up installing Ubuntu because i was having too many >>>>>>>>>>>>> errors in the Windows. >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, Sep 14, 2023, 9:33 AM Ali hussain < >>>>>>>>>>>>> mdalihu...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> you faced this error, Can't encode transcription? if you >>>>>>>>>>>>>> faced how you have solved this? >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thursday, 14 September, 2023 at 10:51:52 am UTC+6 >>>>>>>>>>>>>> elvi...@gmail.com wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> I was using my own text >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Thu, Sep 14, 2023, 6:58 AM Ali hussain < >>>>>>>>>>>>>>> mdalihu...@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> you are training from Tessearact default text data or your >>>>>>>>>>>>>>>> own collected text data? >>>>>>>>>>>>>>>> On Thursday, 14 September, 2023 at 12:19:53 am UTC+6 >>>>>>>>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I now get to 200000 iterations; and the error rate is >>>>>>>>>>>>>>>>> stuck at 0.46. The result is absolutely trash: nowhere close >>>>>>>>>>>>>>>>> to the >>>>>>>>>>>>>>>>> default/Ray's training. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 2:47:05 PM UTC+3 >>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> after Tesseact recognizes text from images. then you can >>>>>>>>>>>>>>>>>> apply regex to replace the wrong word with to correct word. >>>>>>>>>>>>>>>>>> I'm not familiar with paddleOcr and scanTailor also. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 5:06:12 pm UTC+6 >>>>>>>>>>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> At what stage are you doing the regex replacement? >>>>>>>>>>>>>>>>>>> My process has been: Scan (tif)--> ScanTailor --> >>>>>>>>>>>>>>>>>>> Tesseract --> pdf >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >EasyOCR I think is best for ID cards or something like >>>>>>>>>>>>>>>>>>> that image process. but document images like books, here >>>>>>>>>>>>>>>>>>> Tesseract is >>>>>>>>>>>>>>>>>>> better than EasyOCR. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> How about paddleOcr?, are you familiar with it? >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 1:45:54 PM UTC+3 >>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I know what you mean. but in some cases, it helps me. >>>>>>>>>>>>>>>>>>>> I have faced specific characters and words are always not >>>>>>>>>>>>>>>>>>>> recognized by >>>>>>>>>>>>>>>>>>>> Tesseract. That way I use these regex to replace those >>>>>>>>>>>>>>>>>>>> characters and >>>>>>>>>>>>>>>>>>>> words if those characters are incorrect. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> see what I have done: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> " ী": "ী", >>>>>>>>>>>>>>>>>>>> " ্": " ", >>>>>>>>>>>>>>>>>>>> " ে": " ", >>>>>>>>>>>>>>>>>>>> জ্া: "জা", >>>>>>>>>>>>>>>>>>>> " ": " ", >>>>>>>>>>>>>>>>>>>> " ": " ", >>>>>>>>>>>>>>>>>>>> " ": " ", >>>>>>>>>>>>>>>>>>>> "্প": " ", >>>>>>>>>>>>>>>>>>>> " য": "র্য", >>>>>>>>>>>>>>>>>>>> য: "য", >>>>>>>>>>>>>>>>>>>> " া": "া", >>>>>>>>>>>>>>>>>>>> আা: "আ", >>>>>>>>>>>>>>>>>>>> ম্ি: "মি", >>>>>>>>>>>>>>>>>>>> স্ু: "সু", >>>>>>>>>>>>>>>>>>>> "হূ ": "হূ", >>>>>>>>>>>>>>>>>>>> " ণ": "ণ", >>>>>>>>>>>>>>>>>>>> র্্: "র", >>>>>>>>>>>>>>>>>>>> "চিন্ত ": "চিন্তা ", >>>>>>>>>>>>>>>>>>>> ন্া: "না", >>>>>>>>>>>>>>>>>>>> "সম ূর্ন": "সম্পূর্ণ", >>>>>>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 4:18:22 pm UTC+6 >>>>>>>>>>>>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> The problem for regex is that Tesseract is not >>>>>>>>>>>>>>>>>>>>> consistent in its replacement. >>>>>>>>>>>>>>>>>>>>> Think of the original training of English data doesn't >>>>>>>>>>>>>>>>>>>>> contain the letter /u/. What does Tesseract do when it >>>>>>>>>>>>>>>>>>>>> faces /u/ in actual >>>>>>>>>>>>>>>>>>>>> processing?? >>>>>>>>>>>>>>>>>>>>> In some cases, it replaces it with closely similar >>>>>>>>>>>>>>>>>>>>> letters such as /v/ and /w/. In other cases, it >>>>>>>>>>>>>>>>>>>>> completely removes it. That >>>>>>>>>>>>>>>>>>>>> is what is happening with my case. Those characters re >>>>>>>>>>>>>>>>>>>>> sometimes completely >>>>>>>>>>>>>>>>>>>>> removed; other times, they are replaced by closely >>>>>>>>>>>>>>>>>>>>> resembling characters. >>>>>>>>>>>>>>>>>>>>> Because of this inconsistency, applying regex is very >>>>>>>>>>>>>>>>>>>>> difficult. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 1:02:01 PM UTC+3 >>>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> if Some specific characters or words are always >>>>>>>>>>>>>>>>>>>>>> missing from the OCR result. then you can apply logic >>>>>>>>>>>>>>>>>>>>>> with the Regular >>>>>>>>>>>>>>>>>>>>>> expressions method on your applications. After OCR, >>>>>>>>>>>>>>>>>>>>>> these specific >>>>>>>>>>>>>>>>>>>>>> characters or words will be replaced by current >>>>>>>>>>>>>>>>>>>>>> characters or words that >>>>>>>>>>>>>>>>>>>>>> you defined in your applications by Regular >>>>>>>>>>>>>>>>>>>>>> expressions. it can be done in >>>>>>>>>>>>>>>>>>>>>> some major problems. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 3:51:29 pm UTC+6 >>>>>>>>>>>>>>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> The characters are getting missed, even after >>>>>>>>>>>>>>>>>>>>>>> fine-tuning. >>>>>>>>>>>>>>>>>>>>>>> I never made any progress. I tried many different >>>>>>>>>>>>>>>>>>>>>>> ways. Some specific characters are always missing from >>>>>>>>>>>>>>>>>>>>>>> the OCR result. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 12:49:20 PM >>>>>>>>>>>>>>>>>>>>>>> UTC+3 mdalihu...@gmail.com wrote: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> EasyOCR I think is best for ID cards or something >>>>>>>>>>>>>>>>>>>>>>>> like that image process. but document images like >>>>>>>>>>>>>>>>>>>>>>>> books, here Tesseract is >>>>>>>>>>>>>>>>>>>>>>>> better than EasyOCR. Even I didn't use EasyOCR. you >>>>>>>>>>>>>>>>>>>>>>>> can try it. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> I have added words of dictionaries but the result >>>>>>>>>>>>>>>>>>>>>>>> is the same. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> what kind of problem you have faced in fine-tuning >>>>>>>>>>>>>>>>>>>>>>>> in few new characters as you said (*but, I failed >>>>>>>>>>>>>>>>>>>>>>>> in every possible way to introduce a few new >>>>>>>>>>>>>>>>>>>>>>>> characters into the database.)* >>>>>>>>>>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 3:33:48 pm >>>>>>>>>>>>>>>>>>>>>>>> UTC+6 desal...@gmail.com wrote: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Yes, we are new to this. I find the instructions >>>>>>>>>>>>>>>>>>>>>>>>> (the manual) very hard to follow. The video you >>>>>>>>>>>>>>>>>>>>>>>>> linked above was really >>>>>>>>>>>>>>>>>>>>>>>>> helpful to get started. My plan at the beginning was >>>>>>>>>>>>>>>>>>>>>>>>> to fine tune the >>>>>>>>>>>>>>>>>>>>>>>>> existing .traineddata. But, I failed in every >>>>>>>>>>>>>>>>>>>>>>>>> possible way to introduce a >>>>>>>>>>>>>>>>>>>>>>>>> few new characters into the database. That is why I >>>>>>>>>>>>>>>>>>>>>>>>> started from scratch. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Sure, I will follow Lorenzo's suggestion: will run >>>>>>>>>>>>>>>>>>>>>>>>> more the iterations, and see if I can improve. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Another areas we need to explore is usage of >>>>>>>>>>>>>>>>>>>>>>>>> dictionaries actually. May be adding millions of >>>>>>>>>>>>>>>>>>>>>>>>> words into the >>>>>>>>>>>>>>>>>>>>>>>>> dictionary could help Tesseract. I don't have >>>>>>>>>>>>>>>>>>>>>>>>> millions of words; but I am >>>>>>>>>>>>>>>>>>>>>>>>> looking into some corpus to get more words into the >>>>>>>>>>>>>>>>>>>>>>>>> dictionary. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> If this all fails, EasyOCR (and probably other >>>>>>>>>>>>>>>>>>>>>>>>> similar open-source packages) is probably our next >>>>>>>>>>>>>>>>>>>>>>>>> option to try on. Sure, >>>>>>>>>>>>>>>>>>>>>>>>> sharing our experiences will be helpful. I will let >>>>>>>>>>>>>>>>>>>>>>>>> you know if I made good >>>>>>>>>>>>>>>>>>>>>>>>> progresses in any of these options. >>>>>>>>>>>>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 12:19:48 PM >>>>>>>>>>>>>>>>>>>>>>>>> UTC+3 mdalihu...@gmail.com wrote: >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> How is your training going for Bengali? It was >>>>>>>>>>>>>>>>>>>>>>>>>> nearly good but I faced space problems between two >>>>>>>>>>>>>>>>>>>>>>>>>> words, some words are >>>>>>>>>>>>>>>>>>>>>>>>>> spaces but most of them have no space. I think is >>>>>>>>>>>>>>>>>>>>>>>>>> problem is in the dataset >>>>>>>>>>>>>>>>>>>>>>>>>> but I use the default training dataset from >>>>>>>>>>>>>>>>>>>>>>>>>> Tesseract which is used in Ben >>>>>>>>>>>>>>>>>>>>>>>>>> That way I am confused so I have to explore more. by >>>>>>>>>>>>>>>>>>>>>>>>>> the way, you can try >>>>>>>>>>>>>>>>>>>>>>>>>> as Lorenzo Blz said. Actually training from >>>>>>>>>>>>>>>>>>>>>>>>>> scratch is harder than fine-tuning. so you can use >>>>>>>>>>>>>>>>>>>>>>>>>> different datasets to >>>>>>>>>>>>>>>>>>>>>>>>>> explore. if you succeed. please let me know how you >>>>>>>>>>>>>>>>>>>>>>>>>> have done this whole >>>>>>>>>>>>>>>>>>>>>>>>>> process. I'm also new in this field. >>>>>>>>>>>>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 1:13:43 pm >>>>>>>>>>>>>>>>>>>>>>>>>> UTC+6 desal...@gmail.com wrote: >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> How is your training going for Bengali? >>>>>>>>>>>>>>>>>>>>>>>>>>> I have been trying to train from scratch. I made >>>>>>>>>>>>>>>>>>>>>>>>>>> about 64,000 lines of text (which produced about >>>>>>>>>>>>>>>>>>>>>>>>>>> 255,000 files, in the end) >>>>>>>>>>>>>>>>>>>>>>>>>>> and run the training for 150,000 iterations; >>>>>>>>>>>>>>>>>>>>>>>>>>> getting 0.51 training error >>>>>>>>>>>>>>>>>>>>>>>>>>> rate. I was hopping to get reasonable accuracy. >>>>>>>>>>>>>>>>>>>>>>>>>>> Unfortunately, when I run >>>>>>>>>>>>>>>>>>>>>>>>>>> the OCR using .traineddata, the accuracy is >>>>>>>>>>>>>>>>>>>>>>>>>>> absolutely terrible. Do you >>>>>>>>>>>>>>>>>>>>>>>>>>> think I made some mistakes, or that is an expected >>>>>>>>>>>>>>>>>>>>>>>>>>> result? >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> On Tuesday, September 12, 2023 at 11:15:25 PM >>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+3 mdalihu...@gmail.com wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, he doesn't mention all fonts but only one >>>>>>>>>>>>>>>>>>>>>>>>>>>> font. That way he didn't use *MODEL_NAME in a >>>>>>>>>>>>>>>>>>>>>>>>>>>> separate **script **file script I think.* >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually, here we teach all *tif, gt.txt, and >>>>>>>>>>>>>>>>>>>>>>>>>>>> .box files *which are created by *MODEL_NAME >>>>>>>>>>>>>>>>>>>>>>>>>>>> I mean **eng, ben, oro flag or language code >>>>>>>>>>>>>>>>>>>>>>>>>>>> *because >>>>>>>>>>>>>>>>>>>>>>>>>>>> when we first create *tif, gt.txt, and .box >>>>>>>>>>>>>>>>>>>>>>>>>>>> files, *every file starts by *MODEL_NAME*. >>>>>>>>>>>>>>>>>>>>>>>>>>>> This *MODEL_NAME* we selected on the >>>>>>>>>>>>>>>>>>>>>>>>>>>> training script for looping each tif, gt.txt, and >>>>>>>>>>>>>>>>>>>>>>>>>>>> .box files which are >>>>>>>>>>>>>>>>>>>>>>>>>>>> created by *MODEL_NAME.* >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tuesday, 12 September, 2023 at 9:42:13 pm >>>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+6 desal...@gmail.com wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I am familiar with the video and have set >>>>>>>>>>>>>>>>>>>>>>>>>>>>> up the folder structure as you did. Indeed, I >>>>>>>>>>>>>>>>>>>>>>>>>>>>> have tried a number of >>>>>>>>>>>>>>>>>>>>>>>>>>>>> fine-tuning with a single font following Gracia's >>>>>>>>>>>>>>>>>>>>>>>>>>>>> video. But, your script >>>>>>>>>>>>>>>>>>>>>>>>>>>>> is much better because supports multiple fonts. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> The whole improvement you >>>>>>>>>>>>>>>>>>>>>>>>>>>>> made is brilliant; and very useful. It is all >>>>>>>>>>>>>>>>>>>>>>>>>>>>> working for me. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> The only part that I didn't understand is the >>>>>>>>>>>>>>>>>>>>>>>>>>>>> trick you used in your tesseract_train.py script. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> You see, I have been >>>>>>>>>>>>>>>>>>>>>>>>>>>>> doing exactly to you did except this script. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> The scripts seems to have the trick of >>>>>>>>>>>>>>>>>>>>>>>>>>>>> sending/teaching each of the fonts (iteratively) >>>>>>>>>>>>>>>>>>>>>>>>>>>>> into the model. The script >>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have been using (which I get from Garcia) >>>>>>>>>>>>>>>>>>>>>>>>>>>>> doesn't mention font at all. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> *TESSDATA_PREFIX=../tesseract/tessdata make >>>>>>>>>>>>>>>>>>>>>>>>>>>>> training MODEL_NAME=oro >>>>>>>>>>>>>>>>>>>>>>>>>>>>> TESSDATA=../tesseract/tessdata >>>>>>>>>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000* >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Does it mean that my model does't train the >>>>>>>>>>>>>>>>>>>>>>>>>>>>> fonts (even if the fonts have been included in >>>>>>>>>>>>>>>>>>>>>>>>>>>>> the splitting process, >>>>>>>>>>>>>>>>>>>>>>>>>>>>> in the other script)? >>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Monday, September 11, 2023 at 10:54:08 AM >>>>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+3 mdalihu...@gmail.com wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *import subprocess# List of font >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> namesfont_names = ['ben']for font in font_names: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> command = >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> training MODEL_NAME={font} >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000"* >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * subprocess.run(command, shell=True) 1 . >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> This command is for training data that I have >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> named '* >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tesseract_training*.py' inside tesstrain >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> folder.* >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *2. root directory means your main training >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> folder and inside it as like langdata, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tessearact, tesstrain folders. if >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> you see this tutorial * >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://www.youtube.com/watch?v=KE4xEzFGSU8 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> you will understand better the folder >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> structure. only I >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> created tesseract_training.py in tesstrain >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> folder for training and >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> FontList.py file is the main path as *like >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> langdata, tessearact, tesstrain, and * >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> split_training_text.py. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3. first of all you have to put all fonts in >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> your Linux fonts folder. /usr/share/fonts/ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> then run: sudo apt update then sudo >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fc-cache -fv >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> after that, you have to add the exact font's >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> name in FontList.py file like me. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have added two pic my folder structure. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> first is main structure pic and the second is >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the Colopse tesstrain folder. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I[image: Screenshot 2023-09-11 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 134947.png][image: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Screenshot 2023-09-11 135014.png] >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 12:50:03 pm >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+6 desal...@gmail.com wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thank you so much for putting out these >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> brilliant scripts. They make the process much >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> more efficient. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have one more question on the other script >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that you use to train. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *import subprocess# List of font >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> namesfont_names = ['ben']for font in >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_names: command = >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> training MODEL_NAME={font} >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000"* >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * subprocess.run(command, shell=True) * >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Do you have the name of fonts listed in file >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in the same/root directory? >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> How do you setup the names of the fonts in >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the file, if you don't mind sharing it? >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Monday, September 11, 2023 at 4:27:27 AM >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+3 mdalihu...@gmail.com wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> You can use the new script below. it's >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> better than the previous two scripts. You can >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> create *tif, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> gt.txt, and .box files *by multiple fonts >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and also use breakpoint if vs code close or >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> anything during creating *tif, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> gt.txt, and .box files *then you can >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> checkpoint to navigate where you close vs code. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> command for *tif, gt.txt, and .box files * >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import os >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import random >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import pathlib >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import subprocess >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import argparse >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> from FontList import FontList >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> def create_training_data(training_text_file, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, start_line= >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> None, end_line=None): >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lines = [] >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with open(training_text_file, 'r') as >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> input_file: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lines = input_file.readlines() >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if not os.path.exists(output_directory >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ): >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> os.mkdir(output_directory) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if start_line is None: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> start_line = 0 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if end_line is None: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> end_line = len(lines) - 1 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for font_name in font_list.fonts: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for line_index in range(start_line, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> end_line + 1): >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> line = lines[line_index].strip >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> () >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file_name = >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pathlib.Path(training_text_file).stem >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> line_serial = f"{line_index:d}" >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> line_gt_text = os.path.join( >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory, f'{ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file_name}_{line_serial}_{ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_name.replace(" ", "_")}.gt.txt') >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with open(line_gt_text, 'w') as >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_file: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_file.writelines >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ([line]) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> file_base_name = f'{ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file_name}_{line_serial}_{ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_name.replace(" ", "_")}' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> subprocess.run([ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 'text2image', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> f'--font={font_name}', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> f'--text={line_gt_text}', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> f'--outputbase={ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory}/{file_base_name}', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--max_pages=1', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--strip_unrenderable_words >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--leading=36', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--xsize=3600', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--ysize=330', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--char_spacing=1.0', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--exposure=0', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/eng.unicharset', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ]) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if __name__ == "__main__": >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parser = argparse.ArgumentParser() >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parser.add_argument('--start', type=int, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> help='Starting line count (inclusive)') >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parser.add_argument('--end', type=int, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> help='Ending line count (inclusive)') >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> args = parser.parse_args() >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file = ' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> langdata/eng.training_text' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory = ' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tesstrain/data/eng-ground-truth' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list = FontList() >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> create_training_data(training_text_file, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, args.start, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> args.end) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Then create a file called "FontList" in the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> root directory and paste it. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class FontList: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> def __init__(self): >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> self.fonts = [ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "Gerlick" >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "Sagar Medium", >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "Ekushey Lohit Normal", >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "Charukola Round Head Regular, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> weight=433", >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "Charukola Round Head Bold, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> weight=443", >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "Ador Orjoma Unicode", >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ] >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> then import in the above code, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *for breakpoint command:* >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sudo python3 split_training_text.py --start >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 0 --end 11 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> change checkpoint according to you --start >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 0 --end 11. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *and training checkpoint as you know >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> already.* >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 1:22:34 am >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+6 desal...@gmail.com wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi mhalidu, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the script you posted here seems much more >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> extensive than you posted before: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/0e2880d9-64c0-4659-b497-902a5747caf4n%40googlegroups.com >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> . >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have been using your earlier script. It >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is magical. How is this one different from >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> earlier one? >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thank you for posting these scripts, by >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the way. It has saved my countless hours; by >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> running multiple fonts in one >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sweep. I was not able to find any instruction >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on how to train for multiple >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fonts. The official manual is also unclear. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> YOUr script helped me to get >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> started. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wednesday, August 9, 2023 at >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 11:00:49 PM UTC+3 mdalihu...@gmail.com >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ok, I will try as you said. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> one more thing, what's the role of the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> trained_text lines will be? I have seen >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Bengali text are long words of >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lines. so I wanna know how many words or >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> characters will be the better >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> choice for the train? and >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--xsize=3600','--ysize=350', will be >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> according >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to words of lines? >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thursday, 10 August, 2023 at >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1:10:14 am UTC+6 shree wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Include the default fonts also in your >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fine-tuning list of fonts and see if that >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> helps. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Aug 9, 2023, 2:27 PM Ali hussain >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <mdalihu...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have trained some new fonts by >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fine-tune methods for the Bengali language >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in Tesseract 5 and I have used >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> all official trained_text and >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tessdata_best and other things also. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> everything is good but the problem is the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> default font which was trained >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> before that does not convert text like >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> prev but my new fonts work well. I >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> don't understand why it's happening. I >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> share code based to understand what >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> going on. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *codes for creating tif, gt.txt, .box >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> files:* >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import os >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import random >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import pathlib >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import subprocess >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import argparse >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> from FontList import FontList >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> def read_line_count(): >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if os.path.exists('line_count.txt' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ): >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with open('line_count.txt', 'r') >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> as file: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> return int(file.read()) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> return 0 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> def write_line_count(line_count): >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with open('line_count.txt', 'w') as >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> file: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> file.write(str(line_count)) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> def create_training_data( >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file, font_list, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory, start_line=None, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> end_line=None): >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lines = [] >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with open(training_text_file, 'r') >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> as input_file: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for line in input_file. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> readlines(): >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lines.append(line.strip()) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if not os.path.exists( >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory): >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> os.mkdir(output_directory) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> random.shuffle(lines) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if start_line is None: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> line_count = read_line_count() >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> # Set the starting line_count from >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the file >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> else: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> line_count = start_line >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if end_line is None: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> end_line_count = len(lines) - 1 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> # Set the ending line_count >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> else: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> end_line_count = min(end_line, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> len(lines) - 1) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for font in font_list.fonts: # >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Iterate through all the fonts in the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_serial = 1 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for line in lines: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file_name = >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pathlib.Path(training_text_file).stem >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> # Generate a unique serial >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> number for each line >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> line_serial = f"{line_count >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> :d}" >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> # GT (Ground Truth) text >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filename >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> line_gt_text = os.path.join >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (output_directory, f'{ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file_name}_{line_serial} >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> .gt.txt') >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with open(line_gt_text, 'w') >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> as output_file: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_file.writelines >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ([line]) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> # Image filename >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> file_base_name = f'ben_{ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> line_serial}' # Unique filename for >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> each font >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> subprocess.run([ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 'text2image', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> f'--font={font}', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> f'--text={line_gt_text} >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> f'--outputbase={ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory}/{file_base_name}', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--max_pages=1', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --strip_unrenderable_words', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--leading=36', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--xsize=3600', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--ysize=350', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--char_spacing=1.0', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--exposure=0', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/ben.unicharset >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ]) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> line_count += 1 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_serial += 1 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> # Reset font_serial for the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> next font iteration >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_serial = 1 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> write_line_count(line_count) # >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Update the line_count in the file >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if __name__ == "__main__": >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parser = argparse.ArgumentParser() >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parser.add_argument('--start', type >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> =int, help='Starting line count >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (inclusive)') >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parser.add_argument('--end', type= >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> int, help='Ending line count >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (inclusive)') >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> args = parser.parse_args() >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file = ' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> langdata/ben.training_text' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory = ' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tesstrain/data/ben-ground-truth' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> # Create an instance of the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> FontList class >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list = FontList() >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> create_training_data(training_text_file, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, args.start, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> args.end) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *and for training code:* >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import subprocess >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> # List of font names >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_names = ['ben'] >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for font in font_names: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> command = >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> make training MODEL_NAME={font} >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> START_MODEL=ben >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> TESSDATA=../tesseract/tessdata >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LANG_TYPE=Indic" >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> subprocess.run(command, shell=True) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> any suggestion to identify to extract >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the problem. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> thanks, everyone >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> You received this message because you >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> are subscribed to the Google Groups >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "tesseract-ocr" group. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To unsubscribe from this group and stop >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> receiving emails from it, send an email to >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tesseract-oc...@googlegroups.com. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To view this discussion on the web >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> visit >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> . >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails >>>>>>>>>>>>>>>> from it, send an email to tesseract-oc...@googlegroups.com. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>>>>>>>> . >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from >>>>>>>>>>>>>> it, send an email to tesseract-oc...@googlegroups.com. >>>>>>>>>>>>>> >>>>>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/eb833902-7258-43e3-8854-d51ce26b7257n%40googlegroups.com >>>>>>>>>>>>>> >>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/eb833902-7258-43e3-8854-d51ce26b7257n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>>>>>> . >>>>>>>>>>>>>> >>>>>>>>>>>>> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2ff0315a-9630-4b9a-902a-fd6b71b2080dn%40googlegroups.com.