Yah, that is what I am getting as well. I was able to add the missing letter. But, the overall accuracy become lower than the default model.
On Saturday, October 21, 2023 at 3:22:44 AM UTC+3 mdalihu...@gmail.com wrote: > not good result. that's way i stop to training now. default traineddata is > overall good then scratch. > On Thursday, 19 October, 2023 at 11:32:08 pm UTC+6 desal...@gmail.com > wrote: > >> Hi Ali, >> How is your training going? >> Do you get good results with the training-from-the-scratch? >> >> On Friday, September 15, 2023 at 6:42:26 PM UTC+3 tesseract-ocr wrote: >> >>> yes, two months ago when I started to learn OCR I saw that. it was very >>> helpful at the beginning. >>> On Friday, 15 September, 2023 at 4:01:32 pm UTC+6 desal...@gmail.com >>> wrote: >>> >>>> Just saw this paper: https://osf.io/b8h7q >>>> >>>> On Thursday, September 14, 2023 at 9:02:22 PM UTC+3 >>>> mdalihu...@gmail.com wrote: >>>> >>>>> I will try some changes. thx >>>>> >>>>> On Thursday, 14 September, 2023 at 2:46:36 pm UTC+6 elvi...@gmail.com >>>>> wrote: >>>>> >>>>>> I also faced that issue in the Windows. Apparently, the issue is >>>>>> related with unicode. You can try your luck by changing "r" to "utf8" >>>>>> in >>>>>> the script. >>>>>> I end up installing Ubuntu because i was having too many errors in >>>>>> the Windows. >>>>>> >>>>>> On Thu, Sep 14, 2023, 9:33 AM Ali hussain <mdalihu...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> you faced this error, Can't encode transcription? if you faced how >>>>>>> you have solved this? >>>>>>> >>>>>>> On Thursday, 14 September, 2023 at 10:51:52 am UTC+6 >>>>>>> elvi...@gmail.com wrote: >>>>>>> >>>>>>>> I was using my own text >>>>>>>> >>>>>>>> On Thu, Sep 14, 2023, 6:58 AM Ali hussain <mdalihu...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> you are training from Tessearact default text data or your own >>>>>>>>> collected text data? >>>>>>>>> On Thursday, 14 September, 2023 at 12:19:53 am UTC+6 >>>>>>>>> desal...@gmail.com wrote: >>>>>>>>> >>>>>>>>>> I now get to 200000 iterations; and the error rate is stuck at >>>>>>>>>> 0.46. The result is absolutely trash: nowhere close to the >>>>>>>>>> default/Ray's >>>>>>>>>> training. >>>>>>>>>> >>>>>>>>>> On Wednesday, September 13, 2023 at 2:47:05 PM UTC+3 >>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> after Tesseact recognizes text from images. then you can apply >>>>>>>>>>> regex to replace the wrong word with to correct word. >>>>>>>>>>> I'm not familiar with paddleOcr and scanTailor also. >>>>>>>>>>> >>>>>>>>>>> On Wednesday, 13 September, 2023 at 5:06:12 pm UTC+6 >>>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>>> >>>>>>>>>>>> At what stage are you doing the regex replacement? >>>>>>>>>>>> My process has been: Scan (tif)--> ScanTailor --> Tesseract --> >>>>>>>>>>>> pdf >>>>>>>>>>>> >>>>>>>>>>>> >EasyOCR I think is best for ID cards or something like that >>>>>>>>>>>> image process. but document images like books, here Tesseract is >>>>>>>>>>>> better >>>>>>>>>>>> than EasyOCR. >>>>>>>>>>>> >>>>>>>>>>>> How about paddleOcr?, are you familiar with it? >>>>>>>>>>>> >>>>>>>>>>>> On Wednesday, September 13, 2023 at 1:45:54 PM UTC+3 >>>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>>> >>>>>>>>>>>>> I know what you mean. but in some cases, it helps me. I have >>>>>>>>>>>>> faced specific characters and words are always not recognized by >>>>>>>>>>>>> Tesseract. >>>>>>>>>>>>> That way I use these regex to replace those characters and >>>>>>>>>>>>> words if >>>>>>>>>>>>> those characters are incorrect. >>>>>>>>>>>>> >>>>>>>>>>>>> see what I have done: >>>>>>>>>>>>> >>>>>>>>>>>>> " ী": "ী", >>>>>>>>>>>>> " ্": " ", >>>>>>>>>>>>> " ে": " ", >>>>>>>>>>>>> জ্া: "জা", >>>>>>>>>>>>> " ": " ", >>>>>>>>>>>>> " ": " ", >>>>>>>>>>>>> " ": " ", >>>>>>>>>>>>> "্প": " ", >>>>>>>>>>>>> " য": "র্য", >>>>>>>>>>>>> য: "য", >>>>>>>>>>>>> " া": "া", >>>>>>>>>>>>> আা: "আ", >>>>>>>>>>>>> ম্ি: "মি", >>>>>>>>>>>>> স্ু: "সু", >>>>>>>>>>>>> "হূ ": "হূ", >>>>>>>>>>>>> " ণ": "ণ", >>>>>>>>>>>>> র্্: "র", >>>>>>>>>>>>> "চিন্ত ": "চিন্তা ", >>>>>>>>>>>>> ন্া: "না", >>>>>>>>>>>>> "সম ূর্ন": "সম্পূর্ণ", >>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 4:18:22 pm UTC+6 >>>>>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> The problem for regex is that Tesseract is not consistent in >>>>>>>>>>>>>> its replacement. >>>>>>>>>>>>>> Think of the original training of English data doesn't >>>>>>>>>>>>>> contain the letter /u/. What does Tesseract do when it faces /u/ >>>>>>>>>>>>>> in actual >>>>>>>>>>>>>> processing?? >>>>>>>>>>>>>> In some cases, it replaces it with closely similar letters >>>>>>>>>>>>>> such as /v/ and /w/. In other cases, it completely removes it. >>>>>>>>>>>>>> That is what >>>>>>>>>>>>>> is happening with my case. Those characters re sometimes >>>>>>>>>>>>>> completely >>>>>>>>>>>>>> removed; other times, they are replaced by closely resembling >>>>>>>>>>>>>> characters. >>>>>>>>>>>>>> Because of this inconsistency, applying regex is very difficult. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 1:02:01 PM UTC+3 >>>>>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> if Some specific characters or words are always missing >>>>>>>>>>>>>>> from the OCR result. then you can apply logic with the Regular >>>>>>>>>>>>>>> expressions >>>>>>>>>>>>>>> method on your applications. After OCR, these specific >>>>>>>>>>>>>>> characters or words >>>>>>>>>>>>>>> will be replaced by current characters or words that you >>>>>>>>>>>>>>> defined in your >>>>>>>>>>>>>>> applications by Regular expressions. it can be done in some >>>>>>>>>>>>>>> major problems. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 3:51:29 pm UTC+6 >>>>>>>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The characters are getting missed, even after fine-tuning. >>>>>>>>>>>>>>>> I never made any progress. I tried many different >>>>>>>>>>>>>>>> ways. Some specific characters are always missing from the >>>>>>>>>>>>>>>> OCR result. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 12:49:20 PM UTC+3 >>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> EasyOCR I think is best for ID cards or something like >>>>>>>>>>>>>>>>> that image process. but document images like books, here >>>>>>>>>>>>>>>>> Tesseract is >>>>>>>>>>>>>>>>> better than EasyOCR. Even I didn't use EasyOCR. you can try >>>>>>>>>>>>>>>>> it. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I have added words of dictionaries but the result is the >>>>>>>>>>>>>>>>> same. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> what kind of problem you have faced in fine-tuning in few >>>>>>>>>>>>>>>>> new characters as you said (*but, I failed in every >>>>>>>>>>>>>>>>> possible way to introduce a few new characters into the >>>>>>>>>>>>>>>>> database.)* >>>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 3:33:48 pm UTC+6 >>>>>>>>>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Yes, we are new to this. I find the instructions (the >>>>>>>>>>>>>>>>>> manual) very hard to follow. The video you linked above was >>>>>>>>>>>>>>>>>> really helpful >>>>>>>>>>>>>>>>>> to get started. My plan at the beginning was to fine tune >>>>>>>>>>>>>>>>>> the existing >>>>>>>>>>>>>>>>>> .traineddata. But, I failed in every possible way to >>>>>>>>>>>>>>>>>> introduce a few new >>>>>>>>>>>>>>>>>> characters into the database. That is why I started from >>>>>>>>>>>>>>>>>> scratch. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Sure, I will follow Lorenzo's suggestion: will run more >>>>>>>>>>>>>>>>>> the iterations, and see if I can improve. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Another areas we need to explore is usage of dictionaries >>>>>>>>>>>>>>>>>> actually. May be adding millions of words into the >>>>>>>>>>>>>>>>>> dictionary could help >>>>>>>>>>>>>>>>>> Tesseract. I don't have millions of words; but I am looking >>>>>>>>>>>>>>>>>> into some >>>>>>>>>>>>>>>>>> corpus to get more words into the dictionary. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> If this all fails, EasyOCR (and probably other similar >>>>>>>>>>>>>>>>>> open-source packages) is probably our next option to try >>>>>>>>>>>>>>>>>> on. Sure, sharing >>>>>>>>>>>>>>>>>> our experiences will be helpful. I will let you know if I >>>>>>>>>>>>>>>>>> made good >>>>>>>>>>>>>>>>>> progresses in any of these options. >>>>>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 12:19:48 PM UTC+3 >>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> How is your training going for Bengali? It was nearly >>>>>>>>>>>>>>>>>>> good but I faced space problems between two words, some >>>>>>>>>>>>>>>>>>> words are spaces >>>>>>>>>>>>>>>>>>> but most of them have no space. I think is problem is in >>>>>>>>>>>>>>>>>>> the dataset but I >>>>>>>>>>>>>>>>>>> use the default training dataset from Tesseract which is >>>>>>>>>>>>>>>>>>> used in Ben That >>>>>>>>>>>>>>>>>>> way I am confused so I have to explore more. by the way, >>>>>>>>>>>>>>>>>>> you can try as Lorenzo >>>>>>>>>>>>>>>>>>> Blz said. Actually training from scratch is harder >>>>>>>>>>>>>>>>>>> than fine-tuning. so you can use different datasets to >>>>>>>>>>>>>>>>>>> explore. if you >>>>>>>>>>>>>>>>>>> succeed. please let me know how you have done this whole >>>>>>>>>>>>>>>>>>> process. I'm also >>>>>>>>>>>>>>>>>>> new in this field. >>>>>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 1:13:43 pm UTC+6 >>>>>>>>>>>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> How is your training going for Bengali? >>>>>>>>>>>>>>>>>>>> I have been trying to train from scratch. I made about >>>>>>>>>>>>>>>>>>>> 64,000 lines of text (which produced about 255,000 files, >>>>>>>>>>>>>>>>>>>> in the end) and >>>>>>>>>>>>>>>>>>>> run the training for 150,000 iterations; getting 0.51 >>>>>>>>>>>>>>>>>>>> training error rate. >>>>>>>>>>>>>>>>>>>> I was hopping to get reasonable accuracy. Unfortunately, >>>>>>>>>>>>>>>>>>>> when I run the OCR >>>>>>>>>>>>>>>>>>>> using .traineddata, the accuracy is absolutely terrible. >>>>>>>>>>>>>>>>>>>> Do you think I >>>>>>>>>>>>>>>>>>>> made some mistakes, or that is an expected result? >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Tuesday, September 12, 2023 at 11:15:25 PM UTC+3 >>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Yes, he doesn't mention all fonts but only one font. >>>>>>>>>>>>>>>>>>>>> That way he didn't use *MODEL_NAME in a separate * >>>>>>>>>>>>>>>>>>>>> *script **file script I think.* >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Actually, here we teach all *tif, gt.txt, and .box >>>>>>>>>>>>>>>>>>>>> files *which are created by *MODEL_NAME I mean **eng, >>>>>>>>>>>>>>>>>>>>> ben, oro flag or language code *because when we first >>>>>>>>>>>>>>>>>>>>> create *tif, gt.txt, and .box files, *every file >>>>>>>>>>>>>>>>>>>>> starts by *MODEL_NAME*. This *MODEL_NAME* we >>>>>>>>>>>>>>>>>>>>> selected on the training script for looping each tif, >>>>>>>>>>>>>>>>>>>>> gt.txt, and .box >>>>>>>>>>>>>>>>>>>>> files which are created by *MODEL_NAME.* >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Tuesday, 12 September, 2023 at 9:42:13 pm UTC+6 >>>>>>>>>>>>>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Yes, I am familiar with the video and have set up the >>>>>>>>>>>>>>>>>>>>>> folder structure as you did. Indeed, I have tried a >>>>>>>>>>>>>>>>>>>>>> number of fine-tuning >>>>>>>>>>>>>>>>>>>>>> with a single font following Gracia's video. But, your >>>>>>>>>>>>>>>>>>>>>> script is much >>>>>>>>>>>>>>>>>>>>>> better because supports multiple fonts. The whole >>>>>>>>>>>>>>>>>>>>>> improvement you made is >>>>>>>>>>>>>>>>>>>>>> brilliant; and very useful. It is all working for me. >>>>>>>>>>>>>>>>>>>>>> The only part that I didn't understand is the trick >>>>>>>>>>>>>>>>>>>>>> you used in your tesseract_train.py script. You see, I >>>>>>>>>>>>>>>>>>>>>> have been doing >>>>>>>>>>>>>>>>>>>>>> exactly to you did except this script. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> The scripts seems to have the trick of >>>>>>>>>>>>>>>>>>>>>> sending/teaching each of the fonts (iteratively) into >>>>>>>>>>>>>>>>>>>>>> the model. The script >>>>>>>>>>>>>>>>>>>>>> I have been using (which I get from Garcia) doesn't >>>>>>>>>>>>>>>>>>>>>> mention font at all. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> *TESSDATA_PREFIX=../tesseract/tessdata make training >>>>>>>>>>>>>>>>>>>>>> MODEL_NAME=oro TESSDATA=../tesseract/tessdata >>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000* >>>>>>>>>>>>>>>>>>>>>> Does it mean that my model does't train the fonts >>>>>>>>>>>>>>>>>>>>>> (even if the fonts have been included in the splitting >>>>>>>>>>>>>>>>>>>>>> process, in the >>>>>>>>>>>>>>>>>>>>>> other script)? >>>>>>>>>>>>>>>>>>>>>> On Monday, September 11, 2023 at 10:54:08 AM UTC+3 >>>>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> *import subprocess# List of font namesfont_names = >>>>>>>>>>>>>>>>>>>>>>> ['ben']for font in font_names: command = >>>>>>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make training >>>>>>>>>>>>>>>>>>>>>>> MODEL_NAME={font} >>>>>>>>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata >>>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000"* >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> * subprocess.run(command, shell=True) 1 . This >>>>>>>>>>>>>>>>>>>>>>> command is for training data that I have named '* >>>>>>>>>>>>>>>>>>>>>>> tesseract_training*.py' inside tesstrain folder.* >>>>>>>>>>>>>>>>>>>>>>> *2. root directory means your main training folder >>>>>>>>>>>>>>>>>>>>>>> and inside it as like langdata, tessearact, tesstrain >>>>>>>>>>>>>>>>>>>>>>> folders. if you see >>>>>>>>>>>>>>>>>>>>>>> this tutorial * >>>>>>>>>>>>>>>>>>>>>>> https://www.youtube.com/watch?v=KE4xEzFGSU8 you >>>>>>>>>>>>>>>>>>>>>>> will understand better the folder structure. only I >>>>>>>>>>>>>>>>>>>>>>> created tesseract_training.py in tesstrain folder for >>>>>>>>>>>>>>>>>>>>>>> training and >>>>>>>>>>>>>>>>>>>>>>> FontList.py file is the main path as *like >>>>>>>>>>>>>>>>>>>>>>> langdata, tessearact, tesstrain, and * >>>>>>>>>>>>>>>>>>>>>>> split_training_text.py. >>>>>>>>>>>>>>>>>>>>>>> 3. first of all you have to put all fonts in your >>>>>>>>>>>>>>>>>>>>>>> Linux fonts folder. /usr/share/fonts/ then run: >>>>>>>>>>>>>>>>>>>>>>> sudo apt update then sudo fc-cache -fv >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> after that, you have to add the exact font's name in >>>>>>>>>>>>>>>>>>>>>>> FontList.py file like me. >>>>>>>>>>>>>>>>>>>>>>> I have added two pic my folder structure. first is >>>>>>>>>>>>>>>>>>>>>>> main structure pic and the second is the Colopse >>>>>>>>>>>>>>>>>>>>>>> tesstrain folder. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> I[image: Screenshot 2023-09-11 134947.png][image: >>>>>>>>>>>>>>>>>>>>>>> Screenshot 2023-09-11 135014.png] >>>>>>>>>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 12:50:03 pm UTC+6 >>>>>>>>>>>>>>>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Thank you so much for putting out these brilliant >>>>>>>>>>>>>>>>>>>>>>>> scripts. They make the process much more efficient. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> I have one more question on the other script that >>>>>>>>>>>>>>>>>>>>>>>> you use to train. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> *import subprocess# List of font namesfont_names = >>>>>>>>>>>>>>>>>>>>>>>> ['ben']for font in font_names: command = >>>>>>>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make training >>>>>>>>>>>>>>>>>>>>>>>> MODEL_NAME={font} >>>>>>>>>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata >>>>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000"* >>>>>>>>>>>>>>>>>>>>>>>> * subprocess.run(command, shell=True) * >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Do you have the name of fonts listed in file in the >>>>>>>>>>>>>>>>>>>>>>>> same/root directory? >>>>>>>>>>>>>>>>>>>>>>>> How do you setup the names of the fonts in the >>>>>>>>>>>>>>>>>>>>>>>> file, if you don't mind sharing it? >>>>>>>>>>>>>>>>>>>>>>>> On Monday, September 11, 2023 at 4:27:27 AM UTC+3 >>>>>>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> You can use the new script below. it's better than >>>>>>>>>>>>>>>>>>>>>>>>> the previous two scripts. You can create *tif, >>>>>>>>>>>>>>>>>>>>>>>>> gt.txt, and .box files *by multiple fonts and >>>>>>>>>>>>>>>>>>>>>>>>> also use breakpoint if vs code close or anything >>>>>>>>>>>>>>>>>>>>>>>>> during creating *tif, >>>>>>>>>>>>>>>>>>>>>>>>> gt.txt, and .box files *then you can checkpoint >>>>>>>>>>>>>>>>>>>>>>>>> to navigate where you close vs code. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> command for *tif, gt.txt, and .box files * >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> import os >>>>>>>>>>>>>>>>>>>>>>>>> import random >>>>>>>>>>>>>>>>>>>>>>>>> import pathlib >>>>>>>>>>>>>>>>>>>>>>>>> import subprocess >>>>>>>>>>>>>>>>>>>>>>>>> import argparse >>>>>>>>>>>>>>>>>>>>>>>>> from FontList import FontList >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> def create_training_data(training_text_file, >>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, start_line=None, >>>>>>>>>>>>>>>>>>>>>>>>> end_line=None): >>>>>>>>>>>>>>>>>>>>>>>>> lines = [] >>>>>>>>>>>>>>>>>>>>>>>>> with open(training_text_file, 'r') as >>>>>>>>>>>>>>>>>>>>>>>>> input_file: >>>>>>>>>>>>>>>>>>>>>>>>> lines = input_file.readlines() >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> if not os.path.exists(output_directory): >>>>>>>>>>>>>>>>>>>>>>>>> os.mkdir(output_directory) >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> if start_line is None: >>>>>>>>>>>>>>>>>>>>>>>>> start_line = 0 >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> if end_line is None: >>>>>>>>>>>>>>>>>>>>>>>>> end_line = len(lines) - 1 >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> for font_name in font_list.fonts: >>>>>>>>>>>>>>>>>>>>>>>>> for line_index in range(start_line, >>>>>>>>>>>>>>>>>>>>>>>>> end_line + 1): >>>>>>>>>>>>>>>>>>>>>>>>> line = lines[line_index].strip() >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> training_text_file_name = pathlib.Path >>>>>>>>>>>>>>>>>>>>>>>>> (training_text_file).stem >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> line_serial = f"{line_index:d}" >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> line_gt_text = os.path.join( >>>>>>>>>>>>>>>>>>>>>>>>> output_directory, f'{training_text_file_name}_{ >>>>>>>>>>>>>>>>>>>>>>>>> line_serial}_{font_name.replace(" ", "_")}.gt.txt' >>>>>>>>>>>>>>>>>>>>>>>>> ) >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> with open(line_gt_text, 'w') as >>>>>>>>>>>>>>>>>>>>>>>>> output_file: >>>>>>>>>>>>>>>>>>>>>>>>> output_file.writelines([line]) >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> file_base_name = f'{ >>>>>>>>>>>>>>>>>>>>>>>>> training_text_file_name}_{line_serial}_{ >>>>>>>>>>>>>>>>>>>>>>>>> font_name.replace(" ", "_")}' >>>>>>>>>>>>>>>>>>>>>>>>> subprocess.run([ >>>>>>>>>>>>>>>>>>>>>>>>> 'text2image', >>>>>>>>>>>>>>>>>>>>>>>>> f'--font={font_name}', >>>>>>>>>>>>>>>>>>>>>>>>> f'--text={line_gt_text}', >>>>>>>>>>>>>>>>>>>>>>>>> f'--outputbase={output_directory}/ >>>>>>>>>>>>>>>>>>>>>>>>> {file_base_name}', >>>>>>>>>>>>>>>>>>>>>>>>> '--max_pages=1', >>>>>>>>>>>>>>>>>>>>>>>>> '--strip_unrenderable_words', >>>>>>>>>>>>>>>>>>>>>>>>> '--leading=36', >>>>>>>>>>>>>>>>>>>>>>>>> '--xsize=3600', >>>>>>>>>>>>>>>>>>>>>>>>> '--ysize=330', >>>>>>>>>>>>>>>>>>>>>>>>> '--char_spacing=1.0', >>>>>>>>>>>>>>>>>>>>>>>>> '--exposure=0', >>>>>>>>>>>>>>>>>>>>>>>>> ' >>>>>>>>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/eng.unicharset', >>>>>>>>>>>>>>>>>>>>>>>>> ]) >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> if __name__ == "__main__": >>>>>>>>>>>>>>>>>>>>>>>>> parser = argparse.ArgumentParser() >>>>>>>>>>>>>>>>>>>>>>>>> parser.add_argument('--start', type=int, help= >>>>>>>>>>>>>>>>>>>>>>>>> 'Starting line count (inclusive)') >>>>>>>>>>>>>>>>>>>>>>>>> parser.add_argument('--end', type=int, >>>>>>>>>>>>>>>>>>>>>>>>> help='Ending >>>>>>>>>>>>>>>>>>>>>>>>> line count (inclusive)') >>>>>>>>>>>>>>>>>>>>>>>>> args = parser.parse_args() >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> training_text_file = ' >>>>>>>>>>>>>>>>>>>>>>>>> langdata/eng.training_text' >>>>>>>>>>>>>>>>>>>>>>>>> output_directory = ' >>>>>>>>>>>>>>>>>>>>>>>>> tesstrain/data/eng-ground-truth' >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> font_list = FontList() >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> create_training_data(training_text_file, >>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, args.start, args.end) >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Then create a file called "FontList" in the root >>>>>>>>>>>>>>>>>>>>>>>>> directory and paste it. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> class FontList: >>>>>>>>>>>>>>>>>>>>>>>>> def __init__(self): >>>>>>>>>>>>>>>>>>>>>>>>> self.fonts = [ >>>>>>>>>>>>>>>>>>>>>>>>> "Gerlick" >>>>>>>>>>>>>>>>>>>>>>>>> "Sagar Medium", >>>>>>>>>>>>>>>>>>>>>>>>> "Ekushey Lohit Normal", >>>>>>>>>>>>>>>>>>>>>>>>> "Charukola Round Head Regular, >>>>>>>>>>>>>>>>>>>>>>>>> weight=433", >>>>>>>>>>>>>>>>>>>>>>>>> "Charukola Round Head Bold, weight=443 >>>>>>>>>>>>>>>>>>>>>>>>> ", >>>>>>>>>>>>>>>>>>>>>>>>> "Ador Orjoma Unicode", >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> ] >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> then import in the above code, >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> *for breakpoint command:* >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> sudo python3 split_training_text.py --start 0 --end >>>>>>>>>>>>>>>>>>>>>>>>> 11 >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> change checkpoint according to you --start 0 >>>>>>>>>>>>>>>>>>>>>>>>> --end 11. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> *and training checkpoint as you know already.* >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 1:22:34 am UTC+6 >>>>>>>>>>>>>>>>>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Hi mhalidu, >>>>>>>>>>>>>>>>>>>>>>>>>> the script you posted here seems much more >>>>>>>>>>>>>>>>>>>>>>>>>> extensive than you posted before: >>>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/0e2880d9-64c0-4659-b497-902a5747caf4n%40googlegroups.com >>>>>>>>>>>>>>>>>>>>>>>>>> . >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> I have been using your earlier script. It is >>>>>>>>>>>>>>>>>>>>>>>>>> magical. How is this one different from the >>>>>>>>>>>>>>>>>>>>>>>>>> earlier one? >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Thank you for posting these scripts, by the way. >>>>>>>>>>>>>>>>>>>>>>>>>> It has saved my countless hours; by running multiple >>>>>>>>>>>>>>>>>>>>>>>>>> fonts in one sweep. I >>>>>>>>>>>>>>>>>>>>>>>>>> was not able to find any instruction on how to train >>>>>>>>>>>>>>>>>>>>>>>>>> for multiple fonts. >>>>>>>>>>>>>>>>>>>>>>>>>> The official manual is also unclear. YOUr script >>>>>>>>>>>>>>>>>>>>>>>>>> helped me to get started. >>>>>>>>>>>>>>>>>>>>>>>>>> On Wednesday, August 9, 2023 at 11:00:49 PM UTC+3 >>>>>>>>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> ok, I will try as you said. >>>>>>>>>>>>>>>>>>>>>>>>>>> one more thing, what's the role of the >>>>>>>>>>>>>>>>>>>>>>>>>>> trained_text lines will be? I have seen Bengali >>>>>>>>>>>>>>>>>>>>>>>>>>> text are long words of >>>>>>>>>>>>>>>>>>>>>>>>>>> lines. so I wanna know how many words or characters >>>>>>>>>>>>>>>>>>>>>>>>>>> will be the better >>>>>>>>>>>>>>>>>>>>>>>>>>> choice for the train? and >>>>>>>>>>>>>>>>>>>>>>>>>>> '--xsize=3600','--ysize=350', will be according >>>>>>>>>>>>>>>>>>>>>>>>>>> to words of lines? >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> On Thursday, 10 August, 2023 at 1:10:14 am UTC+6 >>>>>>>>>>>>>>>>>>>>>>>>>>> shree wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Include the default fonts also in your >>>>>>>>>>>>>>>>>>>>>>>>>>>> fine-tuning list of fonts and see if that helps. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Aug 9, 2023, 2:27 PM Ali hussain < >>>>>>>>>>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have trained some new fonts by fine-tune >>>>>>>>>>>>>>>>>>>>>>>>>>>>> methods for the Bengali language in Tesseract 5 >>>>>>>>>>>>>>>>>>>>>>>>>>>>> and I have used all >>>>>>>>>>>>>>>>>>>>>>>>>>>>> official trained_text and tessdata_best and other >>>>>>>>>>>>>>>>>>>>>>>>>>>>> things also. everything >>>>>>>>>>>>>>>>>>>>>>>>>>>>> is good but the problem is the default font which >>>>>>>>>>>>>>>>>>>>>>>>>>>>> was trained before that >>>>>>>>>>>>>>>>>>>>>>>>>>>>> does not convert text like prev but my new fonts >>>>>>>>>>>>>>>>>>>>>>>>>>>>> work well. I don't >>>>>>>>>>>>>>>>>>>>>>>>>>>>> understand why it's happening. I share code based >>>>>>>>>>>>>>>>>>>>>>>>>>>>> to understand what going >>>>>>>>>>>>>>>>>>>>>>>>>>>>> on. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> *codes for creating tif, gt.txt, .box files:* >>>>>>>>>>>>>>>>>>>>>>>>>>>>> import os >>>>>>>>>>>>>>>>>>>>>>>>>>>>> import random >>>>>>>>>>>>>>>>>>>>>>>>>>>>> import pathlib >>>>>>>>>>>>>>>>>>>>>>>>>>>>> import subprocess >>>>>>>>>>>>>>>>>>>>>>>>>>>>> import argparse >>>>>>>>>>>>>>>>>>>>>>>>>>>>> from FontList import FontList >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> def read_line_count(): >>>>>>>>>>>>>>>>>>>>>>>>>>>>> if os.path.exists('line_count.txt'): >>>>>>>>>>>>>>>>>>>>>>>>>>>>> with open('line_count.txt', 'r') as >>>>>>>>>>>>>>>>>>>>>>>>>>>>> file: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> return int(file.read()) >>>>>>>>>>>>>>>>>>>>>>>>>>>>> return 0 >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> def write_line_count(line_count): >>>>>>>>>>>>>>>>>>>>>>>>>>>>> with open('line_count.txt', 'w') as file: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> file.write(str(line_count)) >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> def create_training_data(training_text_file, >>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, start_line=None, >>>>>>>>>>>>>>>>>>>>>>>>>>>>> end_line=None): >>>>>>>>>>>>>>>>>>>>>>>>>>>>> lines = [] >>>>>>>>>>>>>>>>>>>>>>>>>>>>> with open(training_text_file, 'r') as >>>>>>>>>>>>>>>>>>>>>>>>>>>>> input_file: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> for line in input_file.readlines(): >>>>>>>>>>>>>>>>>>>>>>>>>>>>> lines.append(line.strip()) >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> if not os.path.exists(output_directory): >>>>>>>>>>>>>>>>>>>>>>>>>>>>> os.mkdir(output_directory) >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> random.shuffle(lines) >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> if start_line is None: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> line_count = read_line_count() # Set >>>>>>>>>>>>>>>>>>>>>>>>>>>>> the starting line_count from the file >>>>>>>>>>>>>>>>>>>>>>>>>>>>> else: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> line_count = start_line >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> if end_line is None: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> end_line_count = len(lines) - 1 # >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Set the ending line_count >>>>>>>>>>>>>>>>>>>>>>>>>>>>> else: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> end_line_count = min(end_line, len(lines) >>>>>>>>>>>>>>>>>>>>>>>>>>>>> - 1) >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> for font in font_list.fonts: # Iterate >>>>>>>>>>>>>>>>>>>>>>>>>>>>> through all the fonts in the font_list >>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_serial = 1 >>>>>>>>>>>>>>>>>>>>>>>>>>>>> for line in lines: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file_name = pathlib. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Path(training_text_file).stem >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> # Generate a unique serial number >>>>>>>>>>>>>>>>>>>>>>>>>>>>> for each line >>>>>>>>>>>>>>>>>>>>>>>>>>>>> line_serial = f"{line_count:d}" >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> # GT (Ground Truth) text filename >>>>>>>>>>>>>>>>>>>>>>>>>>>>> line_gt_text = os.path.join( >>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory, f'{training_text_file_name}_ >>>>>>>>>>>>>>>>>>>>>>>>>>>>> {line_serial}.gt.txt') >>>>>>>>>>>>>>>>>>>>>>>>>>>>> with open(line_gt_text, 'w') as >>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_file: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_file.writelines([line]) >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> # Image filename >>>>>>>>>>>>>>>>>>>>>>>>>>>>> file_base_name = f'ben_{ >>>>>>>>>>>>>>>>>>>>>>>>>>>>> line_serial}' # Unique filename for each font >>>>>>>>>>>>>>>>>>>>>>>>>>>>> subprocess.run([ >>>>>>>>>>>>>>>>>>>>>>>>>>>>> 'text2image', >>>>>>>>>>>>>>>>>>>>>>>>>>>>> f'--font={font}', >>>>>>>>>>>>>>>>>>>>>>>>>>>>> f'--text={line_gt_text}', >>>>>>>>>>>>>>>>>>>>>>>>>>>>> f'--outputbase={ >>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory}/{file_base_name}', >>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--max_pages=1', >>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--strip_unrenderable_words', >>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--leading=36', >>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--xsize=3600', >>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--ysize=350', >>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--char_spacing=1.0', >>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--exposure=0', >>>>>>>>>>>>>>>>>>>>>>>>>>>>> ' >>>>>>>>>>>>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/ben.unicharset', >>>>>>>>>>>>>>>>>>>>>>>>>>>>> ]) >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> line_count += 1 >>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_serial += 1 >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> # Reset font_serial for the next font >>>>>>>>>>>>>>>>>>>>>>>>>>>>> iteration >>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_serial = 1 >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> write_line_count(line_count) # Update >>>>>>>>>>>>>>>>>>>>>>>>>>>>> the line_count in the file >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> if __name__ == "__main__": >>>>>>>>>>>>>>>>>>>>>>>>>>>>> parser = argparse.ArgumentParser() >>>>>>>>>>>>>>>>>>>>>>>>>>>>> parser.add_argument('--start', type=int, >>>>>>>>>>>>>>>>>>>>>>>>>>>>> help='Starting line count (inclusive)') >>>>>>>>>>>>>>>>>>>>>>>>>>>>> parser.add_argument('--end', type=int, >>>>>>>>>>>>>>>>>>>>>>>>>>>>> help='Ending line count (inclusive)') >>>>>>>>>>>>>>>>>>>>>>>>>>>>> args = parser.parse_args() >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file = ' >>>>>>>>>>>>>>>>>>>>>>>>>>>>> langdata/ben.training_text' >>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory = ' >>>>>>>>>>>>>>>>>>>>>>>>>>>>> tesstrain/data/ben-ground-truth' >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> # Create an instance of the FontList class >>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list = FontList() >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> create_training_data(training_text_file, >>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, args.start, args.end) >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> *and for training code:* >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> import subprocess >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> # List of font names >>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_names = ['ben'] >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> for font in font_names: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> command = >>>>>>>>>>>>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata >>>>>>>>>>>>>>>>>>>>>>>>>>>>> make training MODEL_NAME={font} >>>>>>>>>>>>>>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata >>>>>>>>>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000 >>>>>>>>>>>>>>>>>>>>>>>>>>>>> LANG_TYPE=Indic" >>>>>>>>>>>>>>>>>>>>>>>>>>>>> subprocess.run(command, shell=True) >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> any suggestion to identify to extract the >>>>>>>>>>>>>>>>>>>>>>>>>>>>> problem. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> thanks, everyone >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>>>>>>>> You received this message because you are >>>>>>>>>>>>>>>>>>>>>>>>>>>>> subscribed to the Google Groups "tesseract-ocr" >>>>>>>>>>>>>>>>>>>>>>>>>>>>> group. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> To unsubscribe from this group and stop >>>>>>>>>>>>>>>>>>>>>>>>>>>>> receiving emails from it, send an email to >>>>>>>>>>>>>>>>>>>>>>>>>>>>> tesseract-oc...@googlegroups.com. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> . >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>> You received this message because you are subscribed to the Google >>>>>>>>> Groups "tesseract-ocr" group. >>>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>>>>> >>>>>>>> To view this discussion on the web visit >>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com >>>>>>>>> >>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>> . >>>>>>>>> >>>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "tesseract-ocr" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>>> >>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/eb833902-7258-43e3-8854-d51ce26b7257n%40googlegroups.com >>>>>>> >>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/eb833902-7258-43e3-8854-d51ce26b7257n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>> . >>>>>>> >>>>>> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a78fce2d-c803-4c33-98c0-90ef5feea736n%40googlegroups.com.