How many lines of text and iterations did you use? On Saturday, October 21, 2023 at 8:36:38 AM UTC+3 Des Bw wrote:
> Yah, that is what I am getting as well. I was able to add the missing > letter. But, the overall accuracy become lower than the default model. > > On Saturday, October 21, 2023 at 3:22:44 AM UTC+3 mdalihu...@gmail.com > wrote: > >> not good result. that's way i stop to training now. default traineddata >> is overall good then scratch. >> On Thursday, 19 October, 2023 at 11:32:08 pm UTC+6 desal...@gmail.com >> wrote: >> >>> Hi Ali, >>> How is your training going? >>> Do you get good results with the training-from-the-scratch? >>> >>> On Friday, September 15, 2023 at 6:42:26 PM UTC+3 tesseract-ocr wrote: >>> >>>> yes, two months ago when I started to learn OCR I saw that. it was very >>>> helpful at the beginning. >>>> On Friday, 15 September, 2023 at 4:01:32 pm UTC+6 desal...@gmail.com >>>> wrote: >>>> >>>>> Just saw this paper: https://osf.io/b8h7q >>>>> >>>>> On Thursday, September 14, 2023 at 9:02:22 PM UTC+3 >>>>> mdalihu...@gmail.com wrote: >>>>> >>>>>> I will try some changes. thx >>>>>> >>>>>> On Thursday, 14 September, 2023 at 2:46:36 pm UTC+6 elvi...@gmail.com >>>>>> wrote: >>>>>> >>>>>>> I also faced that issue in the Windows. Apparently, the issue is >>>>>>> related with unicode. You can try your luck by changing "r" to "utf8" >>>>>>> in >>>>>>> the script. >>>>>>> I end up installing Ubuntu because i was having too many errors in >>>>>>> the Windows. >>>>>>> >>>>>>> On Thu, Sep 14, 2023, 9:33 AM Ali hussain <mdalihu...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> you faced this error, Can't encode transcription? if you faced how >>>>>>>> you have solved this? >>>>>>>> >>>>>>>> On Thursday, 14 September, 2023 at 10:51:52 am UTC+6 >>>>>>>> elvi...@gmail.com wrote: >>>>>>>> >>>>>>>>> I was using my own text >>>>>>>>> >>>>>>>>> On Thu, Sep 14, 2023, 6:58 AM Ali hussain <mdalihu...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> you are training from Tessearact default text data or your own >>>>>>>>>> collected text data? >>>>>>>>>> On Thursday, 14 September, 2023 at 12:19:53 am UTC+6 >>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>> >>>>>>>>>>> I now get to 200000 iterations; and the error rate is stuck at >>>>>>>>>>> 0.46. The result is absolutely trash: nowhere close to the >>>>>>>>>>> default/Ray's >>>>>>>>>>> training. >>>>>>>>>>> >>>>>>>>>>> On Wednesday, September 13, 2023 at 2:47:05 PM UTC+3 >>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> after Tesseact recognizes text from images. then you can apply >>>>>>>>>>>> regex to replace the wrong word with to correct word. >>>>>>>>>>>> I'm not familiar with paddleOcr and scanTailor also. >>>>>>>>>>>> >>>>>>>>>>>> On Wednesday, 13 September, 2023 at 5:06:12 pm UTC+6 >>>>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>>>> >>>>>>>>>>>>> At what stage are you doing the regex replacement? >>>>>>>>>>>>> My process has been: Scan (tif)--> ScanTailor --> Tesseract >>>>>>>>>>>>> --> pdf >>>>>>>>>>>>> >>>>>>>>>>>>> >EasyOCR I think is best for ID cards or something like that >>>>>>>>>>>>> image process. but document images like books, here Tesseract is >>>>>>>>>>>>> better >>>>>>>>>>>>> than EasyOCR. >>>>>>>>>>>>> >>>>>>>>>>>>> How about paddleOcr?, are you familiar with it? >>>>>>>>>>>>> >>>>>>>>>>>>> On Wednesday, September 13, 2023 at 1:45:54 PM UTC+3 >>>>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> I know what you mean. but in some cases, it helps me. I have >>>>>>>>>>>>>> faced specific characters and words are always not recognized by >>>>>>>>>>>>>> Tesseract. >>>>>>>>>>>>>> That way I use these regex to replace those characters and >>>>>>>>>>>>>> words if >>>>>>>>>>>>>> those characters are incorrect. >>>>>>>>>>>>>> >>>>>>>>>>>>>> see what I have done: >>>>>>>>>>>>>> >>>>>>>>>>>>>> " ী": "ী", >>>>>>>>>>>>>> " ্": " ", >>>>>>>>>>>>>> " ে": " ", >>>>>>>>>>>>>> জ্া: "জা", >>>>>>>>>>>>>> " ": " ", >>>>>>>>>>>>>> " ": " ", >>>>>>>>>>>>>> " ": " ", >>>>>>>>>>>>>> "্প": " ", >>>>>>>>>>>>>> " য": "র্য", >>>>>>>>>>>>>> য: "য", >>>>>>>>>>>>>> " া": "া", >>>>>>>>>>>>>> আা: "আ", >>>>>>>>>>>>>> ম্ি: "মি", >>>>>>>>>>>>>> স্ু: "সু", >>>>>>>>>>>>>> "হূ ": "হূ", >>>>>>>>>>>>>> " ণ": "ণ", >>>>>>>>>>>>>> র্্: "র", >>>>>>>>>>>>>> "চিন্ত ": "চিন্তা ", >>>>>>>>>>>>>> ন্া: "না", >>>>>>>>>>>>>> "সম ূর্ন": "সম্পূর্ণ", >>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 4:18:22 pm UTC+6 >>>>>>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> The problem for regex is that Tesseract is not consistent in >>>>>>>>>>>>>>> its replacement. >>>>>>>>>>>>>>> Think of the original training of English data doesn't >>>>>>>>>>>>>>> contain the letter /u/. What does Tesseract do when it faces >>>>>>>>>>>>>>> /u/ in actual >>>>>>>>>>>>>>> processing?? >>>>>>>>>>>>>>> In some cases, it replaces it with closely similar letters >>>>>>>>>>>>>>> such as /v/ and /w/. In other cases, it completely removes it. >>>>>>>>>>>>>>> That is what >>>>>>>>>>>>>>> is happening with my case. Those characters re sometimes >>>>>>>>>>>>>>> completely >>>>>>>>>>>>>>> removed; other times, they are replaced by closely resembling >>>>>>>>>>>>>>> characters. >>>>>>>>>>>>>>> Because of this inconsistency, applying regex is very >>>>>>>>>>>>>>> difficult. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 1:02:01 PM UTC+3 >>>>>>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> if Some specific characters or words are always missing >>>>>>>>>>>>>>>> from the OCR result. then you can apply logic with the >>>>>>>>>>>>>>>> Regular expressions >>>>>>>>>>>>>>>> method on your applications. After OCR, these specific >>>>>>>>>>>>>>>> characters or words >>>>>>>>>>>>>>>> will be replaced by current characters or words that you >>>>>>>>>>>>>>>> defined in your >>>>>>>>>>>>>>>> applications by Regular expressions. it can be done in some >>>>>>>>>>>>>>>> major problems. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 3:51:29 pm UTC+6 >>>>>>>>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> The characters are getting missed, even after fine-tuning. >>>>>>>>>>>>>>>>> I never made any progress. I tried many different >>>>>>>>>>>>>>>>> ways. Some specific characters are always missing from the >>>>>>>>>>>>>>>>> OCR result. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 12:49:20 PM UTC+3 >>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> EasyOCR I think is best for ID cards or something like >>>>>>>>>>>>>>>>>> that image process. but document images like books, here >>>>>>>>>>>>>>>>>> Tesseract is >>>>>>>>>>>>>>>>>> better than EasyOCR. Even I didn't use EasyOCR. you can try >>>>>>>>>>>>>>>>>> it. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I have added words of dictionaries but the result is the >>>>>>>>>>>>>>>>>> same. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> what kind of problem you have faced in fine-tuning in few >>>>>>>>>>>>>>>>>> new characters as you said (*but, I failed in every >>>>>>>>>>>>>>>>>> possible way to introduce a few new characters into the >>>>>>>>>>>>>>>>>> database.)* >>>>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 3:33:48 pm UTC+6 >>>>>>>>>>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Yes, we are new to this. I find the instructions (the >>>>>>>>>>>>>>>>>>> manual) very hard to follow. The video you linked above was >>>>>>>>>>>>>>>>>>> really helpful >>>>>>>>>>>>>>>>>>> to get started. My plan at the beginning was to fine tune >>>>>>>>>>>>>>>>>>> the existing >>>>>>>>>>>>>>>>>>> .traineddata. But, I failed in every possible way to >>>>>>>>>>>>>>>>>>> introduce a few new >>>>>>>>>>>>>>>>>>> characters into the database. That is why I started from >>>>>>>>>>>>>>>>>>> scratch. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Sure, I will follow Lorenzo's suggestion: will run more >>>>>>>>>>>>>>>>>>> the iterations, and see if I can improve. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Another areas we need to explore is usage of >>>>>>>>>>>>>>>>>>> dictionaries actually. May be adding millions of words into >>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>> dictionary could help Tesseract. I don't have millions of >>>>>>>>>>>>>>>>>>> words; but I am >>>>>>>>>>>>>>>>>>> looking into some corpus to get more words into the >>>>>>>>>>>>>>>>>>> dictionary. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> If this all fails, EasyOCR (and probably other similar >>>>>>>>>>>>>>>>>>> open-source packages) is probably our next option to try >>>>>>>>>>>>>>>>>>> on. Sure, sharing >>>>>>>>>>>>>>>>>>> our experiences will be helpful. I will let you know if I >>>>>>>>>>>>>>>>>>> made good >>>>>>>>>>>>>>>>>>> progresses in any of these options. >>>>>>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 12:19:48 PM UTC+3 >>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> How is your training going for Bengali? It was nearly >>>>>>>>>>>>>>>>>>>> good but I faced space problems between two words, some >>>>>>>>>>>>>>>>>>>> words are spaces >>>>>>>>>>>>>>>>>>>> but most of them have no space. I think is problem is in >>>>>>>>>>>>>>>>>>>> the dataset but I >>>>>>>>>>>>>>>>>>>> use the default training dataset from Tesseract which is >>>>>>>>>>>>>>>>>>>> used in Ben That >>>>>>>>>>>>>>>>>>>> way I am confused so I have to explore more. by the way, >>>>>>>>>>>>>>>>>>>> you can try as Lorenzo >>>>>>>>>>>>>>>>>>>> Blz said. Actually training from scratch is harder >>>>>>>>>>>>>>>>>>>> than fine-tuning. so you can use different datasets to >>>>>>>>>>>>>>>>>>>> explore. if you >>>>>>>>>>>>>>>>>>>> succeed. please let me know how you have done this whole >>>>>>>>>>>>>>>>>>>> process. I'm also >>>>>>>>>>>>>>>>>>>> new in this field. >>>>>>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 1:13:43 pm UTC+6 >>>>>>>>>>>>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> How is your training going for Bengali? >>>>>>>>>>>>>>>>>>>>> I have been trying to train from scratch. I made about >>>>>>>>>>>>>>>>>>>>> 64,000 lines of text (which produced about 255,000 files, >>>>>>>>>>>>>>>>>>>>> in the end) and >>>>>>>>>>>>>>>>>>>>> run the training for 150,000 iterations; getting 0.51 >>>>>>>>>>>>>>>>>>>>> training error rate. >>>>>>>>>>>>>>>>>>>>> I was hopping to get reasonable accuracy. Unfortunately, >>>>>>>>>>>>>>>>>>>>> when I run the OCR >>>>>>>>>>>>>>>>>>>>> using .traineddata, the accuracy is absolutely >>>>>>>>>>>>>>>>>>>>> terrible. Do you think I >>>>>>>>>>>>>>>>>>>>> made some mistakes, or that is an expected result? >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Tuesday, September 12, 2023 at 11:15:25 PM UTC+3 >>>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Yes, he doesn't mention all fonts but only one font. >>>>>>>>>>>>>>>>>>>>>> That way he didn't use *MODEL_NAME in a separate * >>>>>>>>>>>>>>>>>>>>>> *script **file script I think.* >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Actually, here we teach all *tif, gt.txt, and .box >>>>>>>>>>>>>>>>>>>>>> files *which are created by *MODEL_NAME I mean **eng, >>>>>>>>>>>>>>>>>>>>>> ben, oro flag or language code *because when we >>>>>>>>>>>>>>>>>>>>>> first create *tif, gt.txt, and .box files, *every >>>>>>>>>>>>>>>>>>>>>> file starts by *MODEL_NAME*. This *MODEL_NAME* we >>>>>>>>>>>>>>>>>>>>>> selected on the training script for looping each tif, >>>>>>>>>>>>>>>>>>>>>> gt.txt, and .box >>>>>>>>>>>>>>>>>>>>>> files which are created by *MODEL_NAME.* >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Tuesday, 12 September, 2023 at 9:42:13 pm UTC+6 >>>>>>>>>>>>>>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Yes, I am familiar with the video and have set up >>>>>>>>>>>>>>>>>>>>>>> the folder structure as you did. Indeed, I have tried a >>>>>>>>>>>>>>>>>>>>>>> number of >>>>>>>>>>>>>>>>>>>>>>> fine-tuning with a single font following Gracia's >>>>>>>>>>>>>>>>>>>>>>> video. But, your script >>>>>>>>>>>>>>>>>>>>>>> is much better because supports multiple fonts. The >>>>>>>>>>>>>>>>>>>>>>> whole improvement you >>>>>>>>>>>>>>>>>>>>>>> made is brilliant; and very useful. It is all working >>>>>>>>>>>>>>>>>>>>>>> for me. >>>>>>>>>>>>>>>>>>>>>>> The only part that I didn't understand is the trick >>>>>>>>>>>>>>>>>>>>>>> you used in your tesseract_train.py script. You see, I >>>>>>>>>>>>>>>>>>>>>>> have been doing >>>>>>>>>>>>>>>>>>>>>>> exactly to you did except this script. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> The scripts seems to have the trick of >>>>>>>>>>>>>>>>>>>>>>> sending/teaching each of the fonts (iteratively) into >>>>>>>>>>>>>>>>>>>>>>> the model. The script >>>>>>>>>>>>>>>>>>>>>>> I have been using (which I get from Garcia) doesn't >>>>>>>>>>>>>>>>>>>>>>> mention font at all. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> *TESSDATA_PREFIX=../tesseract/tessdata make training >>>>>>>>>>>>>>>>>>>>>>> MODEL_NAME=oro TESSDATA=../tesseract/tessdata >>>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000* >>>>>>>>>>>>>>>>>>>>>>> Does it mean that my model does't train the fonts >>>>>>>>>>>>>>>>>>>>>>> (even if the fonts have been included in the splitting >>>>>>>>>>>>>>>>>>>>>>> process, in the >>>>>>>>>>>>>>>>>>>>>>> other script)? >>>>>>>>>>>>>>>>>>>>>>> On Monday, September 11, 2023 at 10:54:08 AM UTC+3 >>>>>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> *import subprocess# List of font namesfont_names = >>>>>>>>>>>>>>>>>>>>>>>> ['ben']for font in font_names: command = >>>>>>>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make training >>>>>>>>>>>>>>>>>>>>>>>> MODEL_NAME={font} >>>>>>>>>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata >>>>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000"* >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> * subprocess.run(command, shell=True) 1 . This >>>>>>>>>>>>>>>>>>>>>>>> command is for training data that I have named '* >>>>>>>>>>>>>>>>>>>>>>>> tesseract_training*.py' inside tesstrain folder.* >>>>>>>>>>>>>>>>>>>>>>>> *2. root directory means your main training folder >>>>>>>>>>>>>>>>>>>>>>>> and inside it as like langdata, tessearact, tesstrain >>>>>>>>>>>>>>>>>>>>>>>> folders. if you see >>>>>>>>>>>>>>>>>>>>>>>> this tutorial * >>>>>>>>>>>>>>>>>>>>>>>> https://www.youtube.com/watch?v=KE4xEzFGSU8 you >>>>>>>>>>>>>>>>>>>>>>>> will understand better the folder structure. only I >>>>>>>>>>>>>>>>>>>>>>>> created tesseract_training.py in tesstrain folder for >>>>>>>>>>>>>>>>>>>>>>>> training and >>>>>>>>>>>>>>>>>>>>>>>> FontList.py file is the main path as *like >>>>>>>>>>>>>>>>>>>>>>>> langdata, tessearact, tesstrain, and * >>>>>>>>>>>>>>>>>>>>>>>> split_training_text.py. >>>>>>>>>>>>>>>>>>>>>>>> 3. first of all you have to put all fonts in your >>>>>>>>>>>>>>>>>>>>>>>> Linux fonts folder. /usr/share/fonts/ then run: >>>>>>>>>>>>>>>>>>>>>>>> sudo apt update then sudo fc-cache -fv >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> after that, you have to add the exact font's name >>>>>>>>>>>>>>>>>>>>>>>> in FontList.py file like me. >>>>>>>>>>>>>>>>>>>>>>>> I have added two pic my folder structure. first >>>>>>>>>>>>>>>>>>>>>>>> is main structure pic and the second is the Colopse >>>>>>>>>>>>>>>>>>>>>>>> tesstrain folder. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> I[image: Screenshot 2023-09-11 134947.png][image: >>>>>>>>>>>>>>>>>>>>>>>> Screenshot 2023-09-11 135014.png] >>>>>>>>>>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 12:50:03 pm UTC+6 >>>>>>>>>>>>>>>>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Thank you so much for putting out these brilliant >>>>>>>>>>>>>>>>>>>>>>>>> scripts. They make the process much more efficient. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> I have one more question on the other script that >>>>>>>>>>>>>>>>>>>>>>>>> you use to train. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> *import subprocess# List of font namesfont_names = >>>>>>>>>>>>>>>>>>>>>>>>> ['ben']for font in font_names: command = >>>>>>>>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make training >>>>>>>>>>>>>>>>>>>>>>>>> MODEL_NAME={font} >>>>>>>>>>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata >>>>>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000"* >>>>>>>>>>>>>>>>>>>>>>>>> * subprocess.run(command, shell=True) * >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Do you have the name of fonts listed in file in >>>>>>>>>>>>>>>>>>>>>>>>> the same/root directory? >>>>>>>>>>>>>>>>>>>>>>>>> How do you setup the names of the fonts in the >>>>>>>>>>>>>>>>>>>>>>>>> file, if you don't mind sharing it? >>>>>>>>>>>>>>>>>>>>>>>>> On Monday, September 11, 2023 at 4:27:27 AM UTC+3 >>>>>>>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> You can use the new script below. it's better >>>>>>>>>>>>>>>>>>>>>>>>>> than the previous two scripts. You can create *tif, >>>>>>>>>>>>>>>>>>>>>>>>>> gt.txt, and .box files *by multiple fonts and >>>>>>>>>>>>>>>>>>>>>>>>>> also use breakpoint if vs code close or anything >>>>>>>>>>>>>>>>>>>>>>>>>> during creating *tif, >>>>>>>>>>>>>>>>>>>>>>>>>> gt.txt, and .box files *then you can checkpoint >>>>>>>>>>>>>>>>>>>>>>>>>> to navigate where you close vs code. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> command for *tif, gt.txt, and .box files * >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> import os >>>>>>>>>>>>>>>>>>>>>>>>>> import random >>>>>>>>>>>>>>>>>>>>>>>>>> import pathlib >>>>>>>>>>>>>>>>>>>>>>>>>> import subprocess >>>>>>>>>>>>>>>>>>>>>>>>>> import argparse >>>>>>>>>>>>>>>>>>>>>>>>>> from FontList import FontList >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> def create_training_data(training_text_file, >>>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, start_line=None, >>>>>>>>>>>>>>>>>>>>>>>>>> end_line=None): >>>>>>>>>>>>>>>>>>>>>>>>>> lines = [] >>>>>>>>>>>>>>>>>>>>>>>>>> with open(training_text_file, 'r') as >>>>>>>>>>>>>>>>>>>>>>>>>> input_file: >>>>>>>>>>>>>>>>>>>>>>>>>> lines = input_file.readlines() >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> if not os.path.exists(output_directory): >>>>>>>>>>>>>>>>>>>>>>>>>> os.mkdir(output_directory) >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> if start_line is None: >>>>>>>>>>>>>>>>>>>>>>>>>> start_line = 0 >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> if end_line is None: >>>>>>>>>>>>>>>>>>>>>>>>>> end_line = len(lines) - 1 >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> for font_name in font_list.fonts: >>>>>>>>>>>>>>>>>>>>>>>>>> for line_index in range(start_line, >>>>>>>>>>>>>>>>>>>>>>>>>> end_line + 1): >>>>>>>>>>>>>>>>>>>>>>>>>> line = lines[line_index].strip() >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file_name = pathlib. >>>>>>>>>>>>>>>>>>>>>>>>>> Path(training_text_file).stem >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> line_serial = f"{line_index:d}" >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> line_gt_text = os.path.join( >>>>>>>>>>>>>>>>>>>>>>>>>> output_directory, f'{training_text_file_name}_{ >>>>>>>>>>>>>>>>>>>>>>>>>> line_serial}_{font_name.replace(" ", "_")} >>>>>>>>>>>>>>>>>>>>>>>>>> .gt.txt') >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> with open(line_gt_text, 'w') as >>>>>>>>>>>>>>>>>>>>>>>>>> output_file: >>>>>>>>>>>>>>>>>>>>>>>>>> output_file.writelines([line]) >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> file_base_name = f'{ >>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file_name}_{line_serial}_{ >>>>>>>>>>>>>>>>>>>>>>>>>> font_name.replace(" ", "_")}' >>>>>>>>>>>>>>>>>>>>>>>>>> subprocess.run([ >>>>>>>>>>>>>>>>>>>>>>>>>> 'text2image', >>>>>>>>>>>>>>>>>>>>>>>>>> f'--font={font_name}', >>>>>>>>>>>>>>>>>>>>>>>>>> f'--text={line_gt_text}', >>>>>>>>>>>>>>>>>>>>>>>>>> f'--outputbase={output_directory} >>>>>>>>>>>>>>>>>>>>>>>>>> /{file_base_name}', >>>>>>>>>>>>>>>>>>>>>>>>>> '--max_pages=1', >>>>>>>>>>>>>>>>>>>>>>>>>> '--strip_unrenderable_words', >>>>>>>>>>>>>>>>>>>>>>>>>> '--leading=36', >>>>>>>>>>>>>>>>>>>>>>>>>> '--xsize=3600', >>>>>>>>>>>>>>>>>>>>>>>>>> '--ysize=330', >>>>>>>>>>>>>>>>>>>>>>>>>> '--char_spacing=1.0', >>>>>>>>>>>>>>>>>>>>>>>>>> '--exposure=0', >>>>>>>>>>>>>>>>>>>>>>>>>> ' >>>>>>>>>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/eng.unicharset', >>>>>>>>>>>>>>>>>>>>>>>>>> ]) >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> if __name__ == "__main__": >>>>>>>>>>>>>>>>>>>>>>>>>> parser = argparse.ArgumentParser() >>>>>>>>>>>>>>>>>>>>>>>>>> parser.add_argument('--start', type=int, help >>>>>>>>>>>>>>>>>>>>>>>>>> ='Starting line count (inclusive)') >>>>>>>>>>>>>>>>>>>>>>>>>> parser.add_argument('--end', type=int, >>>>>>>>>>>>>>>>>>>>>>>>>> help='Ending >>>>>>>>>>>>>>>>>>>>>>>>>> line count (inclusive)') >>>>>>>>>>>>>>>>>>>>>>>>>> args = parser.parse_args() >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file = ' >>>>>>>>>>>>>>>>>>>>>>>>>> langdata/eng.training_text' >>>>>>>>>>>>>>>>>>>>>>>>>> output_directory = ' >>>>>>>>>>>>>>>>>>>>>>>>>> tesstrain/data/eng-ground-truth' >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> font_list = FontList() >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> create_training_data(training_text_file, >>>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, args.start, args.end) >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Then create a file called "FontList" in the root >>>>>>>>>>>>>>>>>>>>>>>>>> directory and paste it. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> class FontList: >>>>>>>>>>>>>>>>>>>>>>>>>> def __init__(self): >>>>>>>>>>>>>>>>>>>>>>>>>> self.fonts = [ >>>>>>>>>>>>>>>>>>>>>>>>>> "Gerlick" >>>>>>>>>>>>>>>>>>>>>>>>>> "Sagar Medium", >>>>>>>>>>>>>>>>>>>>>>>>>> "Ekushey Lohit Normal", >>>>>>>>>>>>>>>>>>>>>>>>>> "Charukola Round Head Regular, >>>>>>>>>>>>>>>>>>>>>>>>>> weight=433", >>>>>>>>>>>>>>>>>>>>>>>>>> "Charukola Round Head Bold, >>>>>>>>>>>>>>>>>>>>>>>>>> weight=443", >>>>>>>>>>>>>>>>>>>>>>>>>> "Ador Orjoma Unicode", >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> ] >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> then import in the above code, >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> *for breakpoint command:* >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> sudo python3 split_training_text.py --start 0 --end >>>>>>>>>>>>>>>>>>>>>>>>>> 11 >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> change checkpoint according to you --start 0 >>>>>>>>>>>>>>>>>>>>>>>>>> --end 11. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> *and training checkpoint as you know already.* >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 1:22:34 am UTC+6 >>>>>>>>>>>>>>>>>>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Hi mhalidu, >>>>>>>>>>>>>>>>>>>>>>>>>>> the script you posted here seems much more >>>>>>>>>>>>>>>>>>>>>>>>>>> extensive than you posted before: >>>>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/0e2880d9-64c0-4659-b497-902a5747caf4n%40googlegroups.com >>>>>>>>>>>>>>>>>>>>>>>>>>> . >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> I have been using your earlier script. It is >>>>>>>>>>>>>>>>>>>>>>>>>>> magical. How is this one different from the >>>>>>>>>>>>>>>>>>>>>>>>>>> earlier one? >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Thank you for posting these scripts, by the way. >>>>>>>>>>>>>>>>>>>>>>>>>>> It has saved my countless hours; by running >>>>>>>>>>>>>>>>>>>>>>>>>>> multiple fonts in one sweep. I >>>>>>>>>>>>>>>>>>>>>>>>>>> was not able to find any instruction on how to >>>>>>>>>>>>>>>>>>>>>>>>>>> train for multiple fonts. >>>>>>>>>>>>>>>>>>>>>>>>>>> The official manual is also unclear. YOUr script >>>>>>>>>>>>>>>>>>>>>>>>>>> helped me to get started. >>>>>>>>>>>>>>>>>>>>>>>>>>> On Wednesday, August 9, 2023 at 11:00:49 PM >>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+3 mdalihu...@gmail.com wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> ok, I will try as you said. >>>>>>>>>>>>>>>>>>>>>>>>>>>> one more thing, what's the role of the >>>>>>>>>>>>>>>>>>>>>>>>>>>> trained_text lines will be? I have seen Bengali >>>>>>>>>>>>>>>>>>>>>>>>>>>> text are long words of >>>>>>>>>>>>>>>>>>>>>>>>>>>> lines. so I wanna know how many words or >>>>>>>>>>>>>>>>>>>>>>>>>>>> characters will be the better >>>>>>>>>>>>>>>>>>>>>>>>>>>> choice for the train? and >>>>>>>>>>>>>>>>>>>>>>>>>>>> '--xsize=3600','--ysize=350', will be according >>>>>>>>>>>>>>>>>>>>>>>>>>>> to words of lines? >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thursday, 10 August, 2023 at 1:10:14 am >>>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+6 shree wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Include the default fonts also in your >>>>>>>>>>>>>>>>>>>>>>>>>>>>> fine-tuning list of fonts and see if that helps. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Aug 9, 2023, 2:27 PM Ali hussain < >>>>>>>>>>>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have trained some new fonts by fine-tune >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> methods for the Bengali language in Tesseract 5 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and I have used all >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> official trained_text and tessdata_best and >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> other things also. everything >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is good but the problem is the default font >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which was trained before that >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> does not convert text like prev but my new fonts >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> work well. I don't >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> understand why it's happening. I share code >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> based to understand what going >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *codes for creating tif, gt.txt, .box files:* >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import os >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import random >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import pathlib >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import subprocess >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import argparse >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> from FontList import FontList >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> def read_line_count(): >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if os.path.exists('line_count.txt'): >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with open('line_count.txt', 'r') as >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> file: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> return int(file.read()) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> return 0 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> def write_line_count(line_count): >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with open('line_count.txt', 'w') as file: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> file.write(str(line_count)) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> def create_training_data(training_text_file, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, start_line=None, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> end_line=None): >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lines = [] >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with open(training_text_file, 'r') as >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> input_file: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for line in input_file.readlines(): >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lines.append(line.strip()) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if not os.path.exists(output_directory): >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> os.mkdir(output_directory) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> random.shuffle(lines) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if start_line is None: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> line_count = read_line_count() # >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Set the starting line_count from the file >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> else: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> line_count = start_line >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if end_line is None: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> end_line_count = len(lines) - 1 # >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Set the ending line_count >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> else: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> end_line_count = min(end_line, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> len(lines) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - 1) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for font in font_list.fonts: # Iterate >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> through all the fonts in the font_list >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_serial = 1 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for line in lines: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file_name = >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pathlib.Path(training_text_file).stem >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> # Generate a unique serial >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> number for each line >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> line_serial = f"{line_count:d}" >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> # GT (Ground Truth) text filename >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> line_gt_text = os.path.join( >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory, f'{training_text_file_name} >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> _{line_serial}.gt.txt') >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with open(line_gt_text, 'w') as >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_file: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_file.writelines >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ([line]) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> # Image filename >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> file_base_name = f'ben_{ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> line_serial}' # Unique filename for each >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> subprocess.run([ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 'text2image', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> f'--font={font}', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> f'--text={line_gt_text}', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> f'--outputbase={ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory}/{file_base_name}', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--max_pages=1', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--strip_unrenderable_words', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--leading=36', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--xsize=3600', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--ysize=350', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--char_spacing=1.0', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--exposure=0', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/ben.unicharset', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ]) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> line_count += 1 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_serial += 1 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> # Reset font_serial for the next >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font iteration >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_serial = 1 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> write_line_count(line_count) # Update >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the line_count in the file >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if __name__ == "__main__": >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parser = argparse.ArgumentParser() >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parser.add_argument('--start', type=int, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> help='Starting line count (inclusive)') >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parser.add_argument('--end', type=int, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> help='Ending line count (inclusive)') >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> args = parser.parse_args() >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file = ' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> langdata/ben.training_text' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory = ' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tesstrain/data/ben-ground-truth' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> # Create an instance of the FontList >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list = FontList() >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> create_training_data(training_text_file, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, args.start, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> args.end) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *and for training code:* >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import subprocess >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> # List of font names >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_names = ['ben'] >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for font in font_names: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> command = >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> make training MODEL_NAME={font} >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LANG_TYPE=Indic" >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> subprocess.run(command, shell=True) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> any suggestion to identify to extract the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> problem. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> thanks, everyone >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> You received this message because you are >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> subscribed to the Google Groups "tesseract-ocr" >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> group. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To unsubscribe from this group and stop >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> receiving emails from it, send an email to >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tesseract-oc...@googlegroups.com. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> . >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>>>>>> >>>>>>>>> To view this discussion on the web visit >>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com >>>>>>>>>> >>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>> . >>>>>>>>>> >>>>>>>>> -- >>>>>>>> You received this message because you are subscribed to the Google >>>>>>>> Groups "tesseract-ocr" group. >>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>>>> >>>>>>> To view this discussion on the web visit >>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/eb833902-7258-43e3-8854-d51ce26b7257n%40googlegroups.com >>>>>>>> >>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/eb833902-7258-43e3-8854-d51ce26b7257n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>> . >>>>>>>> >>>>>>> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/81f7aee2-aec8-4b93-bd18-4bb15325b8afn%40googlegroups.com.