This is the code I used to train from a layer: *make training MODEL_NAME=amh START_MODEL=amh APPEND_INDEX=5 NET_SPEC='[Lfx256 O1c105]' TESSDATA=../tesseract/tessdata EPOCHS=3 TARGET_ERROR_RATE=0.0001 training >> data/amh.log &* *- I took it from Scheer' training *tesstrain-JSTORArabic*: * https://github.com/Shreeshrii/tesstrain-JSTORArabic
- The net_spec of ben might not be the same to amh. Shreeshrii has sent a link on the netspecs of languages, in this forum. On Sunday, October 22, 2023 at 12:09:25 PM UTC+3 mdalihu...@gmail.com wrote: > you can test by changes '--char spacing=1.0 . i think it would be problem > accuracy of result on it also. > On Sunday, 22 October, 2023 at 3:07:16 pm UTC+6 Ali hussain wrote: > >> i haven't tried by cut the top layer of the network. you can share your >> knowledge what you done by cut the top layer of the network. or github >> project link. >> On Sunday, 22 October, 2023 at 12:27:32 pm UTC+6 desal...@gmail.com >> wrote: >> >>> That is massive data. Have you tried to train by cut the top layer of >>> the network? >>> I think that is the most promising approach. I was getting really good >>> results with that. But, the result is not getting translated to scanned >>> documents. I get best results with the syntethic data. I am no >>> experimenting with the settings in text2image if it is possible to emulate >>> the scanned documents. >>> I am also suspecting this setting '--char_spacing=1.0', in our setup >>> is causing more trouble. Scanned documents come with characters spacing >>> close to zero.If you are planning to train more, try removing this >>> parameter. >>> >>> On Sunday, October 22, 2023 at 4:09:46 AM UTC+3 mdalihu...@gmail.com >>> wrote: >>> >>>> 600000 lines of text and the itarations higher then 600000. but some >>>> time i got better result in lower itarations in finetune like 100000 lines >>>> of text and itaration is only 5000 to10000. >>>> On Saturday, 21 October, 2023 at 11:37:13 am UTC+6 desal...@gmail.com >>>> wrote: >>>> >>>>> How many lines of text and iterations did you use? >>>>> >>>>> On Saturday, October 21, 2023 at 8:36:38 AM UTC+3 Des Bw wrote: >>>>> >>>>>> Yah, that is what I am getting as well. I was able to add the missing >>>>>> letter. But, the overall accuracy become lower than the default model. >>>>>> >>>>>> On Saturday, October 21, 2023 at 3:22:44 AM UTC+3 >>>>>> mdalihu...@gmail.com wrote: >>>>>> >>>>>>> not good result. that's way i stop to training now. default >>>>>>> traineddata is overall good then scratch. >>>>>>> On Thursday, 19 October, 2023 at 11:32:08 pm UTC+6 >>>>>>> desal...@gmail.com wrote: >>>>>>> >>>>>>>> Hi Ali, >>>>>>>> How is your training going? >>>>>>>> Do you get good results with the training-from-the-scratch? >>>>>>>> >>>>>>>> On Friday, September 15, 2023 at 6:42:26 PM UTC+3 tesseract-ocr >>>>>>>> wrote: >>>>>>>> >>>>>>>>> yes, two months ago when I started to learn OCR I saw that. it was >>>>>>>>> very helpful at the beginning. >>>>>>>>> On Friday, 15 September, 2023 at 4:01:32 pm UTC+6 >>>>>>>>> desal...@gmail.com wrote: >>>>>>>>> >>>>>>>>>> Just saw this paper: https://osf.io/b8h7q >>>>>>>>>> >>>>>>>>>> On Thursday, September 14, 2023 at 9:02:22 PM UTC+3 >>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>> >>>>>>>>>>> I will try some changes. thx >>>>>>>>>>> >>>>>>>>>>> On Thursday, 14 September, 2023 at 2:46:36 pm UTC+6 >>>>>>>>>>> elvi...@gmail.com wrote: >>>>>>>>>>> >>>>>>>>>>>> I also faced that issue in the Windows. Apparently, the issue >>>>>>>>>>>> is related with unicode. You can try your luck by changing "r" to >>>>>>>>>>>> "utf8" >>>>>>>>>>>> in the script. >>>>>>>>>>>> I end up installing Ubuntu because i was having too many errors >>>>>>>>>>>> in the Windows. >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Sep 14, 2023, 9:33 AM Ali hussain <mdalihu...@gmail.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> you faced this error, Can't encode transcription? if you >>>>>>>>>>>>> faced how you have solved this? >>>>>>>>>>>>> >>>>>>>>>>>>> On Thursday, 14 September, 2023 at 10:51:52 am UTC+6 >>>>>>>>>>>>> elvi...@gmail.com wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> I was using my own text >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, Sep 14, 2023, 6:58 AM Ali hussain < >>>>>>>>>>>>>> mdalihu...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> you are training from Tessearact default text data or your >>>>>>>>>>>>>>> own collected text data? >>>>>>>>>>>>>>> On Thursday, 14 September, 2023 at 12:19:53 am UTC+6 >>>>>>>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I now get to 200000 iterations; and the error rate is stuck >>>>>>>>>>>>>>>> at 0.46. The result is absolutely trash: nowhere close to the >>>>>>>>>>>>>>>> default/Ray's >>>>>>>>>>>>>>>> training. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 2:47:05 PM UTC+3 >>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> after Tesseact recognizes text from images. then you can >>>>>>>>>>>>>>>>> apply regex to replace the wrong word with to correct word. >>>>>>>>>>>>>>>>> I'm not familiar with paddleOcr and scanTailor also. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 5:06:12 pm UTC+6 >>>>>>>>>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> At what stage are you doing the regex replacement? >>>>>>>>>>>>>>>>>> My process has been: Scan (tif)--> ScanTailor --> >>>>>>>>>>>>>>>>>> Tesseract --> pdf >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >EasyOCR I think is best for ID cards or something like >>>>>>>>>>>>>>>>>> that image process. but document images like books, here >>>>>>>>>>>>>>>>>> Tesseract is >>>>>>>>>>>>>>>>>> better than EasyOCR. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> How about paddleOcr?, are you familiar with it? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 1:45:54 PM UTC+3 >>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I know what you mean. but in some cases, it helps me. I >>>>>>>>>>>>>>>>>>> have faced specific characters and words are always not >>>>>>>>>>>>>>>>>>> recognized by >>>>>>>>>>>>>>>>>>> Tesseract. That way I use these regex to replace those >>>>>>>>>>>>>>>>>>> characters and >>>>>>>>>>>>>>>>>>> words if those characters are incorrect. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> see what I have done: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> " ী": "ী", >>>>>>>>>>>>>>>>>>> " ্": " ", >>>>>>>>>>>>>>>>>>> " ে": " ", >>>>>>>>>>>>>>>>>>> জ্া: "জা", >>>>>>>>>>>>>>>>>>> " ": " ", >>>>>>>>>>>>>>>>>>> " ": " ", >>>>>>>>>>>>>>>>>>> " ": " ", >>>>>>>>>>>>>>>>>>> "্প": " ", >>>>>>>>>>>>>>>>>>> " য": "র্য", >>>>>>>>>>>>>>>>>>> য: "য", >>>>>>>>>>>>>>>>>>> " া": "া", >>>>>>>>>>>>>>>>>>> আা: "আ", >>>>>>>>>>>>>>>>>>> ম্ি: "মি", >>>>>>>>>>>>>>>>>>> স্ু: "সু", >>>>>>>>>>>>>>>>>>> "হূ ": "হূ", >>>>>>>>>>>>>>>>>>> " ণ": "ণ", >>>>>>>>>>>>>>>>>>> র্্: "র", >>>>>>>>>>>>>>>>>>> "চিন্ত ": "চিন্তা ", >>>>>>>>>>>>>>>>>>> ন্া: "না", >>>>>>>>>>>>>>>>>>> "সম ূর্ন": "সম্পূর্ণ", >>>>>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 4:18:22 pm UTC+6 >>>>>>>>>>>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> The problem for regex is that Tesseract is not >>>>>>>>>>>>>>>>>>>> consistent in its replacement. >>>>>>>>>>>>>>>>>>>> Think of the original training of English data doesn't >>>>>>>>>>>>>>>>>>>> contain the letter /u/. What does Tesseract do when it >>>>>>>>>>>>>>>>>>>> faces /u/ in actual >>>>>>>>>>>>>>>>>>>> processing?? >>>>>>>>>>>>>>>>>>>> In some cases, it replaces it with closely similar >>>>>>>>>>>>>>>>>>>> letters such as /v/ and /w/. In other cases, it completely >>>>>>>>>>>>>>>>>>>> removes it. That >>>>>>>>>>>>>>>>>>>> is what is happening with my case. Those characters re >>>>>>>>>>>>>>>>>>>> sometimes completely >>>>>>>>>>>>>>>>>>>> removed; other times, they are replaced by closely >>>>>>>>>>>>>>>>>>>> resembling characters. >>>>>>>>>>>>>>>>>>>> Because of this inconsistency, applying regex is very >>>>>>>>>>>>>>>>>>>> difficult. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 1:02:01 PM UTC+3 >>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> if Some specific characters or words are always >>>>>>>>>>>>>>>>>>>>> missing from the OCR result. then you can apply logic >>>>>>>>>>>>>>>>>>>>> with the Regular >>>>>>>>>>>>>>>>>>>>> expressions method on your applications. After OCR, these >>>>>>>>>>>>>>>>>>>>> specific >>>>>>>>>>>>>>>>>>>>> characters or words will be replaced by current >>>>>>>>>>>>>>>>>>>>> characters or words that >>>>>>>>>>>>>>>>>>>>> you defined in your applications by Regular expressions. >>>>>>>>>>>>>>>>>>>>> it can be done in >>>>>>>>>>>>>>>>>>>>> some major problems. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 3:51:29 pm UTC+6 >>>>>>>>>>>>>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> The characters are getting missed, even after >>>>>>>>>>>>>>>>>>>>>> fine-tuning. >>>>>>>>>>>>>>>>>>>>>> I never made any progress. I tried many different >>>>>>>>>>>>>>>>>>>>>> ways. Some specific characters are always missing from >>>>>>>>>>>>>>>>>>>>>> the OCR result. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 12:49:20 PM UTC+3 >>>>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> EasyOCR I think is best for ID cards or something >>>>>>>>>>>>>>>>>>>>>>> like that image process. but document images like >>>>>>>>>>>>>>>>>>>>>>> books, here Tesseract is >>>>>>>>>>>>>>>>>>>>>>> better than EasyOCR. Even I didn't use EasyOCR. you >>>>>>>>>>>>>>>>>>>>>>> can try it. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> I have added words of dictionaries but the result is >>>>>>>>>>>>>>>>>>>>>>> the same. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> what kind of problem you have faced in fine-tuning >>>>>>>>>>>>>>>>>>>>>>> in few new characters as you said (*but, I failed >>>>>>>>>>>>>>>>>>>>>>> in every possible way to introduce a few new characters >>>>>>>>>>>>>>>>>>>>>>> into the database.)* >>>>>>>>>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 3:33:48 pm UTC+6 >>>>>>>>>>>>>>>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Yes, we are new to this. I find the instructions >>>>>>>>>>>>>>>>>>>>>>>> (the manual) very hard to follow. The video you linked >>>>>>>>>>>>>>>>>>>>>>>> above was really >>>>>>>>>>>>>>>>>>>>>>>> helpful to get started. My plan at the beginning was >>>>>>>>>>>>>>>>>>>>>>>> to fine tune the >>>>>>>>>>>>>>>>>>>>>>>> existing .traineddata. But, I failed in every possible >>>>>>>>>>>>>>>>>>>>>>>> way to introduce a >>>>>>>>>>>>>>>>>>>>>>>> few new characters into the database. That is why I >>>>>>>>>>>>>>>>>>>>>>>> started from scratch. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Sure, I will follow Lorenzo's suggestion: will run >>>>>>>>>>>>>>>>>>>>>>>> more the iterations, and see if I can improve. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Another areas we need to explore is usage of >>>>>>>>>>>>>>>>>>>>>>>> dictionaries actually. May be adding millions of words >>>>>>>>>>>>>>>>>>>>>>>> into the >>>>>>>>>>>>>>>>>>>>>>>> dictionary could help Tesseract. I don't have millions >>>>>>>>>>>>>>>>>>>>>>>> of words; but I am >>>>>>>>>>>>>>>>>>>>>>>> looking into some corpus to get more words into the >>>>>>>>>>>>>>>>>>>>>>>> dictionary. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> If this all fails, EasyOCR (and probably other >>>>>>>>>>>>>>>>>>>>>>>> similar open-source packages) is probably our next >>>>>>>>>>>>>>>>>>>>>>>> option to try on. Sure, >>>>>>>>>>>>>>>>>>>>>>>> sharing our experiences will be helpful. I will let >>>>>>>>>>>>>>>>>>>>>>>> you know if I made good >>>>>>>>>>>>>>>>>>>>>>>> progresses in any of these options. >>>>>>>>>>>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 12:19:48 PM >>>>>>>>>>>>>>>>>>>>>>>> UTC+3 mdalihu...@gmail.com wrote: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> How is your training going for Bengali? It was >>>>>>>>>>>>>>>>>>>>>>>>> nearly good but I faced space problems between two >>>>>>>>>>>>>>>>>>>>>>>>> words, some words are >>>>>>>>>>>>>>>>>>>>>>>>> spaces but most of them have no space. I think is >>>>>>>>>>>>>>>>>>>>>>>>> problem is in the dataset >>>>>>>>>>>>>>>>>>>>>>>>> but I use the default training dataset from Tesseract >>>>>>>>>>>>>>>>>>>>>>>>> which is used in Ben >>>>>>>>>>>>>>>>>>>>>>>>> That way I am confused so I have to explore more. by >>>>>>>>>>>>>>>>>>>>>>>>> the way, you can try >>>>>>>>>>>>>>>>>>>>>>>>> as Lorenzo Blz said. Actually training from >>>>>>>>>>>>>>>>>>>>>>>>> scratch is harder than fine-tuning. so you can use >>>>>>>>>>>>>>>>>>>>>>>>> different datasets to >>>>>>>>>>>>>>>>>>>>>>>>> explore. if you succeed. please let me know how you >>>>>>>>>>>>>>>>>>>>>>>>> have done this whole >>>>>>>>>>>>>>>>>>>>>>>>> process. I'm also new in this field. >>>>>>>>>>>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 1:13:43 pm >>>>>>>>>>>>>>>>>>>>>>>>> UTC+6 desal...@gmail.com wrote: >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> How is your training going for Bengali? >>>>>>>>>>>>>>>>>>>>>>>>>> I have been trying to train from scratch. I made >>>>>>>>>>>>>>>>>>>>>>>>>> about 64,000 lines of text (which produced about >>>>>>>>>>>>>>>>>>>>>>>>>> 255,000 files, in the end) >>>>>>>>>>>>>>>>>>>>>>>>>> and run the training for 150,000 iterations; getting >>>>>>>>>>>>>>>>>>>>>>>>>> 0.51 training error >>>>>>>>>>>>>>>>>>>>>>>>>> rate. I was hopping to get reasonable accuracy. >>>>>>>>>>>>>>>>>>>>>>>>>> Unfortunately, when I run >>>>>>>>>>>>>>>>>>>>>>>>>> the OCR using .traineddata, the accuracy is >>>>>>>>>>>>>>>>>>>>>>>>>> absolutely terrible. Do you >>>>>>>>>>>>>>>>>>>>>>>>>> think I made some mistakes, or that is an expected >>>>>>>>>>>>>>>>>>>>>>>>>> result? >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> On Tuesday, September 12, 2023 at 11:15:25 PM >>>>>>>>>>>>>>>>>>>>>>>>>> UTC+3 mdalihu...@gmail.com wrote: >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, he doesn't mention all fonts but only one >>>>>>>>>>>>>>>>>>>>>>>>>>> font. That way he didn't use *MODEL_NAME in a >>>>>>>>>>>>>>>>>>>>>>>>>>> separate **script **file script I think.* >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Actually, here we teach all *tif, gt.txt, and >>>>>>>>>>>>>>>>>>>>>>>>>>> .box files *which are created by *MODEL_NAME I >>>>>>>>>>>>>>>>>>>>>>>>>>> mean **eng, ben, oro flag or language code *because >>>>>>>>>>>>>>>>>>>>>>>>>>> when we first create *tif, gt.txt, and .box >>>>>>>>>>>>>>>>>>>>>>>>>>> files, *every file starts by *MODEL_NAME*. >>>>>>>>>>>>>>>>>>>>>>>>>>> This *MODEL_NAME* we selected on the training >>>>>>>>>>>>>>>>>>>>>>>>>>> script for looping each tif, gt.txt, and .box files >>>>>>>>>>>>>>>>>>>>>>>>>>> which are created by >>>>>>>>>>>>>>>>>>>>>>>>>>> *MODEL_NAME.* >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> On Tuesday, 12 September, 2023 at 9:42:13 pm >>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+6 desal...@gmail.com wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I am familiar with the video and have set >>>>>>>>>>>>>>>>>>>>>>>>>>>> up the folder structure as you did. Indeed, I have >>>>>>>>>>>>>>>>>>>>>>>>>>>> tried a number of >>>>>>>>>>>>>>>>>>>>>>>>>>>> fine-tuning with a single font following Gracia's >>>>>>>>>>>>>>>>>>>>>>>>>>>> video. But, your script >>>>>>>>>>>>>>>>>>>>>>>>>>>> is much better because supports multiple fonts. >>>>>>>>>>>>>>>>>>>>>>>>>>>> The whole improvement you >>>>>>>>>>>>>>>>>>>>>>>>>>>> made is brilliant; and very useful. It is all >>>>>>>>>>>>>>>>>>>>>>>>>>>> working for me. >>>>>>>>>>>>>>>>>>>>>>>>>>>> The only part that I didn't understand is the >>>>>>>>>>>>>>>>>>>>>>>>>>>> trick you used in your tesseract_train.py script. >>>>>>>>>>>>>>>>>>>>>>>>>>>> You see, I have been >>>>>>>>>>>>>>>>>>>>>>>>>>>> doing exactly to you did except this script. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> The scripts seems to have the trick of >>>>>>>>>>>>>>>>>>>>>>>>>>>> sending/teaching each of the fonts (iteratively) >>>>>>>>>>>>>>>>>>>>>>>>>>>> into the model. The script >>>>>>>>>>>>>>>>>>>>>>>>>>>> I have been using (which I get from Garcia) >>>>>>>>>>>>>>>>>>>>>>>>>>>> doesn't mention font at all. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> *TESSDATA_PREFIX=../tesseract/tessdata make >>>>>>>>>>>>>>>>>>>>>>>>>>>> training MODEL_NAME=oro >>>>>>>>>>>>>>>>>>>>>>>>>>>> TESSDATA=../tesseract/tessdata >>>>>>>>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000* >>>>>>>>>>>>>>>>>>>>>>>>>>>> Does it mean that my model does't train the >>>>>>>>>>>>>>>>>>>>>>>>>>>> fonts (even if the fonts have been included in the >>>>>>>>>>>>>>>>>>>>>>>>>>>> splitting process, >>>>>>>>>>>>>>>>>>>>>>>>>>>> in the other script)? >>>>>>>>>>>>>>>>>>>>>>>>>>>> On Monday, September 11, 2023 at 10:54:08 AM >>>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+3 mdalihu...@gmail.com wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> *import subprocess# List of font >>>>>>>>>>>>>>>>>>>>>>>>>>>>> namesfont_names = ['ben']for font in font_names: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> command = >>>>>>>>>>>>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make >>>>>>>>>>>>>>>>>>>>>>>>>>>>> training MODEL_NAME={font} >>>>>>>>>>>>>>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata >>>>>>>>>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000"* >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> * subprocess.run(command, shell=True) 1 . >>>>>>>>>>>>>>>>>>>>>>>>>>>>> This command is for training data that I have >>>>>>>>>>>>>>>>>>>>>>>>>>>>> named '* >>>>>>>>>>>>>>>>>>>>>>>>>>>>> tesseract_training*.py' inside tesstrain >>>>>>>>>>>>>>>>>>>>>>>>>>>>> folder.* >>>>>>>>>>>>>>>>>>>>>>>>>>>>> *2. root directory means your main training >>>>>>>>>>>>>>>>>>>>>>>>>>>>> folder and inside it as like langdata, >>>>>>>>>>>>>>>>>>>>>>>>>>>>> tessearact, tesstrain folders. if >>>>>>>>>>>>>>>>>>>>>>>>>>>>> you see this tutorial * >>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://www.youtube.com/watch?v=KE4xEzFGSU8 >>>>>>>>>>>>>>>>>>>>>>>>>>>>> you will understand better the folder structure. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> only I >>>>>>>>>>>>>>>>>>>>>>>>>>>>> created tesseract_training.py in tesstrain folder >>>>>>>>>>>>>>>>>>>>>>>>>>>>> for training and >>>>>>>>>>>>>>>>>>>>>>>>>>>>> FontList.py file is the main path as *like >>>>>>>>>>>>>>>>>>>>>>>>>>>>> langdata, tessearact, tesstrain, and * >>>>>>>>>>>>>>>>>>>>>>>>>>>>> split_training_text.py. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3. first of all you have to put all fonts in >>>>>>>>>>>>>>>>>>>>>>>>>>>>> your Linux fonts folder. /usr/share/fonts/ >>>>>>>>>>>>>>>>>>>>>>>>>>>>> then run: sudo apt update then sudo >>>>>>>>>>>>>>>>>>>>>>>>>>>>> fc-cache -fv >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> after that, you have to add the exact font's >>>>>>>>>>>>>>>>>>>>>>>>>>>>> name in FontList.py file like me. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have added two pic my folder structure. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> first is main structure pic and the second is the >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Colopse tesstrain folder. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> I[image: Screenshot 2023-09-11 134947.png][image: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Screenshot 2023-09-11 135014.png] >>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 12:50:03 pm >>>>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+6 desal...@gmail.com wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thank you so much for putting out these >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> brilliant scripts. They make the process much >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> more efficient. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have one more question on the other script >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that you use to train. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *import subprocess# List of font >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> namesfont_names = ['ben']for font in font_names: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> command = >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> training MODEL_NAME={font} >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000"* >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * subprocess.run(command, shell=True) * >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Do you have the name of fonts listed in file >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in the same/root directory? >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> How do you setup the names of the fonts in >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the file, if you don't mind sharing it? >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Monday, September 11, 2023 at 4:27:27 AM >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+3 mdalihu...@gmail.com wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> You can use the new script below. it's >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> better than the previous two scripts. You can >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> create *tif, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> gt.txt, and .box files *by multiple fonts >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and also use breakpoint if vs code close or >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> anything during creating *tif, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> gt.txt, and .box files *then you can >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> checkpoint to navigate where you close vs code. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> command for *tif, gt.txt, and .box files * >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import os >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import random >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import pathlib >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import subprocess >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import argparse >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> from FontList import FontList >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> def create_training_data(training_text_file, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, start_line=None, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> end_line=None): >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lines = [] >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with open(training_text_file, 'r') as >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> input_file: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lines = input_file.readlines() >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if not os.path.exists(output_directory): >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> os.mkdir(output_directory) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if start_line is None: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> start_line = 0 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if end_line is None: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> end_line = len(lines) - 1 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for font_name in font_list.fonts: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for line_index in range(start_line, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> end_line + 1): >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> line = lines[line_index].strip() >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file_name = >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pathlib.Path(training_text_file).stem >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> line_serial = f"{line_index:d}" >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> line_gt_text = os.path.join( >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory, f'{training_text_file_name >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> }_{line_serial}_{font_name.replace(" ", "_") >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> }.gt.txt') >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with open(line_gt_text, 'w') as >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_file: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_file.writelines >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ([line]) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> file_base_name = f'{ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file_name}_{line_serial}_{ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_name.replace(" ", "_")}' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> subprocess.run([ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 'text2image', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> f'--font={font_name}', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> f'--text={line_gt_text}', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> f'--outputbase={ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory}/{file_base_name}', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--max_pages=1', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--strip_unrenderable_words' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> , >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--leading=36', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--xsize=3600', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--ysize=330', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--char_spacing=1.0', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--exposure=0', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/eng.unicharset', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ]) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if __name__ == "__main__": >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parser = argparse.ArgumentParser() >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parser.add_argument('--start', type=int, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> help='Starting line count (inclusive)') >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parser.add_argument('--end', type=int, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> help='Ending line count (inclusive)') >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> args = parser.parse_args() >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file = ' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> langdata/eng.training_text' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory = ' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tesstrain/data/eng-ground-truth' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list = FontList() >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> create_training_data(training_text_file, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, args.start, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> args.end) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Then create a file called "FontList" in the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> root directory and paste it. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class FontList: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> def __init__(self): >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> self.fonts = [ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "Gerlick" >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "Sagar Medium", >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "Ekushey Lohit Normal", >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "Charukola Round Head Regular, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> weight=433", >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "Charukola Round Head Bold, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> weight=443", >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "Ador Orjoma Unicode", >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ] >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> then import in the above code, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *for breakpoint command:* >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sudo python3 split_training_text.py --start >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 0 --end 11 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> change checkpoint according to you --start >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 0 --end 11. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *and training checkpoint as you know >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> already.* >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 1:22:34 am >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+6 desal...@gmail.com wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi mhalidu, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the script you posted here seems much more >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> extensive than you posted before: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/0e2880d9-64c0-4659-b497-902a5747caf4n%40googlegroups.com >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> . >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have been using your earlier script. It >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is magical. How is this one different from the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> earlier one? >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thank you for posting these scripts, by the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> way. It has saved my countless hours; by >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> running multiple fonts in one >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sweep. I was not able to find any instruction >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on how to train for multiple >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fonts. The official manual is also unclear. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> YOUr script helped me to get >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> started. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wednesday, August 9, 2023 at 11:00:49 PM >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+3 mdalihu...@gmail.com wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ok, I will try as you said. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> one more thing, what's the role of the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> trained_text lines will be? I have seen >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Bengali text are long words of >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lines. so I wanna know how many words or >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> characters will be the better >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> choice for the train? and >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--xsize=3600','--ysize=350', will be >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> according >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to words of lines? >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thursday, 10 August, 2023 at 1:10:14 am >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+6 shree wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Include the default fonts also in your >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fine-tuning list of fonts and see if that >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> helps. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Aug 9, 2023, 2:27 PM Ali hussain < >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have trained some new fonts by >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fine-tune methods for the Bengali language >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in Tesseract 5 and I have used >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> all official trained_text and tessdata_best >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and other things also. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> everything is good but the problem is the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> default font which was trained >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> before that does not convert text like prev >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> but my new fonts work well. I >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> don't understand why it's happening. I >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> share code based to understand what >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> going on. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *codes for creating tif, gt.txt, .box >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> files:* >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import os >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import random >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import pathlib >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import subprocess >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import argparse >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> from FontList import FontList >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> def read_line_count(): >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if os.path.exists('line_count.txt'): >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with open('line_count.txt', 'r') >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> as file: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> return int(file.read()) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> return 0 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> def write_line_count(line_count): >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with open('line_count.txt', 'w') as >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> file: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> file.write(str(line_count)) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> def create_training_data( >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file, font_list, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory, start_line=None, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> end_line=None): >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lines = [] >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with open(training_text_file, 'r') >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> as input_file: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for line in input_file.readlines >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (): >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lines.append(line.strip()) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if not os.path.exists( >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory): >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> os.mkdir(output_directory) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> random.shuffle(lines) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if start_line is None: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> line_count = read_line_count() # >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Set the starting line_count from the file >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> else: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> line_count = start_line >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if end_line is None: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> end_line_count = len(lines) - 1 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> # Set the ending line_count >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> else: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> end_line_count = min(end_line, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> len(lines) - 1) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for font in font_list.fonts: # >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Iterate through all the fonts in the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_serial = 1 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for line in lines: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file_name = >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pathlib.Path(training_text_file).stem >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> # Generate a unique serial >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> number for each line >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> line_serial = f"{line_count >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> :d}" >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> # GT (Ground Truth) text >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filename >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> line_gt_text = os.path.join( >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory, f'{ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file_name}_{line_serial} >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> .gt.txt') >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with open(line_gt_text, 'w') >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> as output_file: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_file.writelines >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ([line]) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> # Image filename >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> file_base_name = f'ben_{ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> line_serial}' # Unique filename for >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> each font >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> subprocess.run([ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 'text2image', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> f'--font={font}', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> f'--text={line_gt_text}' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> , >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> f'--outputbase={ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory}/{file_base_name}', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--max_pages=1', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --strip_unrenderable_words', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--leading=36', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--xsize=3600', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--ysize=350', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--char_spacing=1.0', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--exposure=0', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/ben.unicharset >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ]) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> line_count += 1 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_serial += 1 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> # Reset font_serial for the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> next font iteration >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_serial = 1 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> write_line_count(line_count) # >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Update the line_count in the file >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if __name__ == "__main__": >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parser = argparse.ArgumentParser() >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parser.add_argument('--start', type= >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> int, help='Starting line count >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (inclusive)') >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parser.add_argument('--end', type= >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> int, help='Ending line count (inclusive) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ') >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> args = parser.parse_args() >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file = ' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> langdata/ben.training_text' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory = ' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tesstrain/data/ben-ground-truth' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> # Create an instance of the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> FontList class >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list = FontList() >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> create_training_data(training_text_file, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, args.start, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> args.end) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *and for training code:* >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import subprocess >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> # List of font names >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_names = ['ben'] >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for font in font_names: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> command = >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> make training MODEL_NAME={font} >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> START_MODEL=ben >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> TESSDATA=../tesseract/tessdata >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LANG_TYPE=Indic" >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> subprocess.run(command, shell=True) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> any suggestion to identify to extract >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the problem. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> thanks, everyone >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> You received this message because you >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> are subscribed to the Google Groups >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "tesseract-ocr" group. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To unsubscribe from this group and stop >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> receiving emails from it, send an email to >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tesseract-oc...@googlegroups.com. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> . >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails >>>>>>>>>>>>>>> from it, send an email to tesseract-oc...@googlegroups.com. >>>>>>>>>>>>>>> >>>>>>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>>>>>>> . >>>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from >>>>>>>>>>>>> it, send an email to tesseract-oc...@googlegroups.com. >>>>>>>>>>>>> >>>>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/eb833902-7258-43e3-8854-d51ce26b7257n%40googlegroups.com >>>>>>>>>>>>> >>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/eb833902-7258-43e3-8854-d51ce26b7257n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>>>>> . >>>>>>>>>>>>> >>>>>>>>>>>> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/9f82ed5b-fb32-4b59-a0b3-5a4e6f13e3b7n%40googlegroups.com.