600000 lines of text and the itarations higher then 600000. but some time i got better result in lower itarations in finetune like 100000 lines of text and itaration is only 5000 to10000. On Saturday, 21 October, 2023 at 11:37:13 am UTC+6 [email protected] wrote:
> How many lines of text and iterations did you use? > > On Saturday, October 21, 2023 at 8:36:38 AM UTC+3 Des Bw wrote: > >> Yah, that is what I am getting as well. I was able to add the missing >> letter. But, the overall accuracy become lower than the default model. >> >> On Saturday, October 21, 2023 at 3:22:44 AM UTC+3 [email protected] >> wrote: >> >>> not good result. that's way i stop to training now. default traineddata >>> is overall good then scratch. >>> On Thursday, 19 October, 2023 at 11:32:08 pm UTC+6 [email protected] >>> wrote: >>> >>>> Hi Ali, >>>> How is your training going? >>>> Do you get good results with the training-from-the-scratch? >>>> >>>> On Friday, September 15, 2023 at 6:42:26 PM UTC+3 tesseract-ocr wrote: >>>> >>>>> yes, two months ago when I started to learn OCR I saw that. it was >>>>> very helpful at the beginning. >>>>> On Friday, 15 September, 2023 at 4:01:32 pm UTC+6 [email protected] >>>>> wrote: >>>>> >>>>>> Just saw this paper: https://osf.io/b8h7q >>>>>> >>>>>> On Thursday, September 14, 2023 at 9:02:22 PM UTC+3 >>>>>> [email protected] wrote: >>>>>> >>>>>>> I will try some changes. thx >>>>>>> >>>>>>> On Thursday, 14 September, 2023 at 2:46:36 pm UTC+6 >>>>>>> [email protected] wrote: >>>>>>> >>>>>>>> I also faced that issue in the Windows. Apparently, the issue is >>>>>>>> related with unicode. You can try your luck by changing "r" to "utf8" >>>>>>>> in >>>>>>>> the script. >>>>>>>> I end up installing Ubuntu because i was having too many errors in >>>>>>>> the Windows. >>>>>>>> >>>>>>>> On Thu, Sep 14, 2023, 9:33 AM Ali hussain <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> you faced this error, Can't encode transcription? if you faced >>>>>>>>> how you have solved this? >>>>>>>>> >>>>>>>>> On Thursday, 14 September, 2023 at 10:51:52 am UTC+6 >>>>>>>>> [email protected] wrote: >>>>>>>>> >>>>>>>>>> I was using my own text >>>>>>>>>> >>>>>>>>>> On Thu, Sep 14, 2023, 6:58 AM Ali hussain <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> you are training from Tessearact default text data or your own >>>>>>>>>>> collected text data? >>>>>>>>>>> On Thursday, 14 September, 2023 at 12:19:53 am UTC+6 >>>>>>>>>>> [email protected] wrote: >>>>>>>>>>> >>>>>>>>>>>> I now get to 200000 iterations; and the error rate is stuck at >>>>>>>>>>>> 0.46. The result is absolutely trash: nowhere close to the >>>>>>>>>>>> default/Ray's >>>>>>>>>>>> training. >>>>>>>>>>>> >>>>>>>>>>>> On Wednesday, September 13, 2023 at 2:47:05 PM UTC+3 >>>>>>>>>>>> [email protected] wrote: >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> after Tesseact recognizes text from images. then you can apply >>>>>>>>>>>>> regex to replace the wrong word with to correct word. >>>>>>>>>>>>> I'm not familiar with paddleOcr and scanTailor also. >>>>>>>>>>>>> >>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 5:06:12 pm UTC+6 >>>>>>>>>>>>> [email protected] wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> At what stage are you doing the regex replacement? >>>>>>>>>>>>>> My process has been: Scan (tif)--> ScanTailor --> Tesseract >>>>>>>>>>>>>> --> pdf >>>>>>>>>>>>>> >>>>>>>>>>>>>> >EasyOCR I think is best for ID cards or something like that >>>>>>>>>>>>>> image process. but document images like books, here Tesseract is >>>>>>>>>>>>>> better >>>>>>>>>>>>>> than EasyOCR. >>>>>>>>>>>>>> >>>>>>>>>>>>>> How about paddleOcr?, are you familiar with it? >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 1:45:54 PM UTC+3 >>>>>>>>>>>>>> [email protected] wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> I know what you mean. but in some cases, it helps me. I >>>>>>>>>>>>>>> have faced specific characters and words are always not >>>>>>>>>>>>>>> recognized by >>>>>>>>>>>>>>> Tesseract. That way I use these regex to replace those >>>>>>>>>>>>>>> characters and >>>>>>>>>>>>>>> words if those characters are incorrect. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> see what I have done: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> " ী": "ী", >>>>>>>>>>>>>>> " ্": " ", >>>>>>>>>>>>>>> " ে": " ", >>>>>>>>>>>>>>> জ্া: "জা", >>>>>>>>>>>>>>> " ": " ", >>>>>>>>>>>>>>> " ": " ", >>>>>>>>>>>>>>> " ": " ", >>>>>>>>>>>>>>> "্প": " ", >>>>>>>>>>>>>>> " য": "র্য", >>>>>>>>>>>>>>> য: "য", >>>>>>>>>>>>>>> " া": "া", >>>>>>>>>>>>>>> আা: "আ", >>>>>>>>>>>>>>> ম্ি: "মি", >>>>>>>>>>>>>>> স্ু: "সু", >>>>>>>>>>>>>>> "হূ ": "হূ", >>>>>>>>>>>>>>> " ণ": "ণ", >>>>>>>>>>>>>>> র্্: "র", >>>>>>>>>>>>>>> "চিন্ত ": "চিন্তা ", >>>>>>>>>>>>>>> ন্া: "না", >>>>>>>>>>>>>>> "সম ূর্ন": "সম্পূর্ণ", >>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 4:18:22 pm UTC+6 >>>>>>>>>>>>>>> [email protected] wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The problem for regex is that Tesseract is not consistent >>>>>>>>>>>>>>>> in its replacement. >>>>>>>>>>>>>>>> Think of the original training of English data doesn't >>>>>>>>>>>>>>>> contain the letter /u/. What does Tesseract do when it faces >>>>>>>>>>>>>>>> /u/ in actual >>>>>>>>>>>>>>>> processing?? >>>>>>>>>>>>>>>> In some cases, it replaces it with closely similar letters >>>>>>>>>>>>>>>> such as /v/ and /w/. In other cases, it completely removes it. >>>>>>>>>>>>>>>> That is what >>>>>>>>>>>>>>>> is happening with my case. Those characters re sometimes >>>>>>>>>>>>>>>> completely >>>>>>>>>>>>>>>> removed; other times, they are replaced by closely resembling >>>>>>>>>>>>>>>> characters. >>>>>>>>>>>>>>>> Because of this inconsistency, applying regex is very >>>>>>>>>>>>>>>> difficult. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 1:02:01 PM UTC+3 >>>>>>>>>>>>>>>> [email protected] wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> if Some specific characters or words are always missing >>>>>>>>>>>>>>>>> from the OCR result. then you can apply logic with the >>>>>>>>>>>>>>>>> Regular expressions >>>>>>>>>>>>>>>>> method on your applications. After OCR, these specific >>>>>>>>>>>>>>>>> characters or words >>>>>>>>>>>>>>>>> will be replaced by current characters or words that you >>>>>>>>>>>>>>>>> defined in your >>>>>>>>>>>>>>>>> applications by Regular expressions. it can be done in some >>>>>>>>>>>>>>>>> major problems. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 3:51:29 pm UTC+6 >>>>>>>>>>>>>>>>> [email protected] wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> The characters are getting missed, even after >>>>>>>>>>>>>>>>>> fine-tuning. >>>>>>>>>>>>>>>>>> I never made any progress. I tried many different >>>>>>>>>>>>>>>>>> ways. Some specific characters are always missing from the >>>>>>>>>>>>>>>>>> OCR result. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 12:49:20 PM UTC+3 >>>>>>>>>>>>>>>>>> [email protected] wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> EasyOCR I think is best for ID cards or something like >>>>>>>>>>>>>>>>>>> that image process. but document images like books, here >>>>>>>>>>>>>>>>>>> Tesseract is >>>>>>>>>>>>>>>>>>> better than EasyOCR. Even I didn't use EasyOCR. you can >>>>>>>>>>>>>>>>>>> try it. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I have added words of dictionaries but the result is the >>>>>>>>>>>>>>>>>>> same. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> what kind of problem you have faced in fine-tuning in >>>>>>>>>>>>>>>>>>> few new characters as you said (*but, I failed in every >>>>>>>>>>>>>>>>>>> possible way to introduce a few new characters into the >>>>>>>>>>>>>>>>>>> database.)* >>>>>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 3:33:48 pm UTC+6 >>>>>>>>>>>>>>>>>>> [email protected] wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Yes, we are new to this. I find the instructions (the >>>>>>>>>>>>>>>>>>>> manual) very hard to follow. The video you linked above >>>>>>>>>>>>>>>>>>>> was really helpful >>>>>>>>>>>>>>>>>>>> to get started. My plan at the beginning was to fine tune >>>>>>>>>>>>>>>>>>>> the existing >>>>>>>>>>>>>>>>>>>> .traineddata. But, I failed in every possible way to >>>>>>>>>>>>>>>>>>>> introduce a few new >>>>>>>>>>>>>>>>>>>> characters into the database. That is why I started from >>>>>>>>>>>>>>>>>>>> scratch. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Sure, I will follow Lorenzo's suggestion: will run more >>>>>>>>>>>>>>>>>>>> the iterations, and see if I can improve. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Another areas we need to explore is usage of >>>>>>>>>>>>>>>>>>>> dictionaries actually. May be adding millions of words >>>>>>>>>>>>>>>>>>>> into the >>>>>>>>>>>>>>>>>>>> dictionary could help Tesseract. I don't have millions of >>>>>>>>>>>>>>>>>>>> words; but I am >>>>>>>>>>>>>>>>>>>> looking into some corpus to get more words into the >>>>>>>>>>>>>>>>>>>> dictionary. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> If this all fails, EasyOCR (and probably other similar >>>>>>>>>>>>>>>>>>>> open-source packages) is probably our next option to try >>>>>>>>>>>>>>>>>>>> on. Sure, sharing >>>>>>>>>>>>>>>>>>>> our experiences will be helpful. I will let you know if I >>>>>>>>>>>>>>>>>>>> made good >>>>>>>>>>>>>>>>>>>> progresses in any of these options. >>>>>>>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 12:19:48 PM UTC+3 >>>>>>>>>>>>>>>>>>>> [email protected] wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> How is your training going for Bengali? It was nearly >>>>>>>>>>>>>>>>>>>>> good but I faced space problems between two words, some >>>>>>>>>>>>>>>>>>>>> words are spaces >>>>>>>>>>>>>>>>>>>>> but most of them have no space. I think is problem is in >>>>>>>>>>>>>>>>>>>>> the dataset but I >>>>>>>>>>>>>>>>>>>>> use the default training dataset from Tesseract which is >>>>>>>>>>>>>>>>>>>>> used in Ben That >>>>>>>>>>>>>>>>>>>>> way I am confused so I have to explore more. by the way, >>>>>>>>>>>>>>>>>>>>> you can try as Lorenzo >>>>>>>>>>>>>>>>>>>>> Blz said. Actually training from scratch is harder >>>>>>>>>>>>>>>>>>>>> than fine-tuning. so you can use different datasets to >>>>>>>>>>>>>>>>>>>>> explore. if you >>>>>>>>>>>>>>>>>>>>> succeed. please let me know how you have done this whole >>>>>>>>>>>>>>>>>>>>> process. I'm also >>>>>>>>>>>>>>>>>>>>> new in this field. >>>>>>>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 1:13:43 pm UTC+6 >>>>>>>>>>>>>>>>>>>>> [email protected] wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> How is your training going for Bengali? >>>>>>>>>>>>>>>>>>>>>> I have been trying to train from scratch. I made >>>>>>>>>>>>>>>>>>>>>> about 64,000 lines of text (which produced about 255,000 >>>>>>>>>>>>>>>>>>>>>> files, in the end) >>>>>>>>>>>>>>>>>>>>>> and run the training for 150,000 iterations; getting >>>>>>>>>>>>>>>>>>>>>> 0.51 training error >>>>>>>>>>>>>>>>>>>>>> rate. I was hopping to get reasonable accuracy. >>>>>>>>>>>>>>>>>>>>>> Unfortunately, when I run >>>>>>>>>>>>>>>>>>>>>> the OCR using .traineddata, the accuracy is absolutely >>>>>>>>>>>>>>>>>>>>>> terrible. Do you >>>>>>>>>>>>>>>>>>>>>> think I made some mistakes, or that is an expected >>>>>>>>>>>>>>>>>>>>>> result? >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Tuesday, September 12, 2023 at 11:15:25 PM UTC+3 >>>>>>>>>>>>>>>>>>>>>> [email protected] wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Yes, he doesn't mention all fonts but only one >>>>>>>>>>>>>>>>>>>>>>> font. That way he didn't use *MODEL_NAME in a >>>>>>>>>>>>>>>>>>>>>>> separate **script **file script I think.* >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Actually, here we teach all *tif, gt.txt, and .box >>>>>>>>>>>>>>>>>>>>>>> files *which are created by *MODEL_NAME I mean **eng, >>>>>>>>>>>>>>>>>>>>>>> ben, oro flag or language code *because when we >>>>>>>>>>>>>>>>>>>>>>> first create *tif, gt.txt, and .box files, *every >>>>>>>>>>>>>>>>>>>>>>> file starts by *MODEL_NAME*. This *MODEL_NAME* >>>>>>>>>>>>>>>>>>>>>>> we selected on the training script for looping each >>>>>>>>>>>>>>>>>>>>>>> tif, gt.txt, and .box >>>>>>>>>>>>>>>>>>>>>>> files which are created by *MODEL_NAME.* >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> On Tuesday, 12 September, 2023 at 9:42:13 pm UTC+6 >>>>>>>>>>>>>>>>>>>>>>> [email protected] wrote: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Yes, I am familiar with the video and have set up >>>>>>>>>>>>>>>>>>>>>>>> the folder structure as you did. Indeed, I have tried >>>>>>>>>>>>>>>>>>>>>>>> a number of >>>>>>>>>>>>>>>>>>>>>>>> fine-tuning with a single font following Gracia's >>>>>>>>>>>>>>>>>>>>>>>> video. But, your script >>>>>>>>>>>>>>>>>>>>>>>> is much better because supports multiple fonts. The >>>>>>>>>>>>>>>>>>>>>>>> whole improvement you >>>>>>>>>>>>>>>>>>>>>>>> made is brilliant; and very useful. It is all working >>>>>>>>>>>>>>>>>>>>>>>> for me. >>>>>>>>>>>>>>>>>>>>>>>> The only part that I didn't understand is the trick >>>>>>>>>>>>>>>>>>>>>>>> you used in your tesseract_train.py script. You see, I >>>>>>>>>>>>>>>>>>>>>>>> have been doing >>>>>>>>>>>>>>>>>>>>>>>> exactly to you did except this script. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> The scripts seems to have the trick of >>>>>>>>>>>>>>>>>>>>>>>> sending/teaching each of the fonts (iteratively) into >>>>>>>>>>>>>>>>>>>>>>>> the model. The script >>>>>>>>>>>>>>>>>>>>>>>> I have been using (which I get from Garcia) doesn't >>>>>>>>>>>>>>>>>>>>>>>> mention font at all. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> *TESSDATA_PREFIX=../tesseract/tessdata make >>>>>>>>>>>>>>>>>>>>>>>> training MODEL_NAME=oro TESSDATA=../tesseract/tessdata >>>>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000* >>>>>>>>>>>>>>>>>>>>>>>> Does it mean that my model does't train the fonts >>>>>>>>>>>>>>>>>>>>>>>> (even if the fonts have been included in the splitting >>>>>>>>>>>>>>>>>>>>>>>> process, in the >>>>>>>>>>>>>>>>>>>>>>>> other script)? >>>>>>>>>>>>>>>>>>>>>>>> On Monday, September 11, 2023 at 10:54:08 AM UTC+3 >>>>>>>>>>>>>>>>>>>>>>>> [email protected] wrote: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> *import subprocess# List of font namesfont_names = >>>>>>>>>>>>>>>>>>>>>>>>> ['ben']for font in font_names: command = >>>>>>>>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make training >>>>>>>>>>>>>>>>>>>>>>>>> MODEL_NAME={font} >>>>>>>>>>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata >>>>>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000"* >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> * subprocess.run(command, shell=True) 1 . This >>>>>>>>>>>>>>>>>>>>>>>>> command is for training data that I have named '* >>>>>>>>>>>>>>>>>>>>>>>>> tesseract_training*.py' inside tesstrain folder.* >>>>>>>>>>>>>>>>>>>>>>>>> *2. root directory means your main training folder >>>>>>>>>>>>>>>>>>>>>>>>> and inside it as like langdata, tessearact, >>>>>>>>>>>>>>>>>>>>>>>>> tesstrain folders. if you see >>>>>>>>>>>>>>>>>>>>>>>>> this tutorial * >>>>>>>>>>>>>>>>>>>>>>>>> https://www.youtube.com/watch?v=KE4xEzFGSU8 you >>>>>>>>>>>>>>>>>>>>>>>>> will understand better the folder structure. only I >>>>>>>>>>>>>>>>>>>>>>>>> created tesseract_training.py in tesstrain folder for >>>>>>>>>>>>>>>>>>>>>>>>> training and >>>>>>>>>>>>>>>>>>>>>>>>> FontList.py file is the main path as *like >>>>>>>>>>>>>>>>>>>>>>>>> langdata, tessearact, tesstrain, and * >>>>>>>>>>>>>>>>>>>>>>>>> split_training_text.py. >>>>>>>>>>>>>>>>>>>>>>>>> 3. first of all you have to put all fonts in your >>>>>>>>>>>>>>>>>>>>>>>>> Linux fonts folder. /usr/share/fonts/ then >>>>>>>>>>>>>>>>>>>>>>>>> run: sudo apt update then sudo fc-cache -fv >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> after that, you have to add the exact font's name >>>>>>>>>>>>>>>>>>>>>>>>> in FontList.py file like me. >>>>>>>>>>>>>>>>>>>>>>>>> I have added two pic my folder structure. first >>>>>>>>>>>>>>>>>>>>>>>>> is main structure pic and the second is the Colopse >>>>>>>>>>>>>>>>>>>>>>>>> tesstrain folder. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> I[image: Screenshot 2023-09-11 134947.png][image: >>>>>>>>>>>>>>>>>>>>>>>>> Screenshot 2023-09-11 135014.png] >>>>>>>>>>>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 12:50:03 pm UTC+6 >>>>>>>>>>>>>>>>>>>>>>>>> [email protected] wrote: >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Thank you so much for putting out these brilliant >>>>>>>>>>>>>>>>>>>>>>>>>> scripts. They make the process much more efficient. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> I have one more question on the other script that >>>>>>>>>>>>>>>>>>>>>>>>>> you use to train. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> *import subprocess# List of font namesfont_names >>>>>>>>>>>>>>>>>>>>>>>>>> = ['ben']for font in font_names: command = >>>>>>>>>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make >>>>>>>>>>>>>>>>>>>>>>>>>> training MODEL_NAME={font} >>>>>>>>>>>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata >>>>>>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000"* >>>>>>>>>>>>>>>>>>>>>>>>>> * subprocess.run(command, shell=True) * >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Do you have the name of fonts listed in file in >>>>>>>>>>>>>>>>>>>>>>>>>> the same/root directory? >>>>>>>>>>>>>>>>>>>>>>>>>> How do you setup the names of the fonts in the >>>>>>>>>>>>>>>>>>>>>>>>>> file, if you don't mind sharing it? >>>>>>>>>>>>>>>>>>>>>>>>>> On Monday, September 11, 2023 at 4:27:27 AM UTC+3 >>>>>>>>>>>>>>>>>>>>>>>>>> [email protected] wrote: >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> You can use the new script below. it's better >>>>>>>>>>>>>>>>>>>>>>>>>>> than the previous two scripts. You can create >>>>>>>>>>>>>>>>>>>>>>>>>>> *tif, >>>>>>>>>>>>>>>>>>>>>>>>>>> gt.txt, and .box files *by multiple fonts and >>>>>>>>>>>>>>>>>>>>>>>>>>> also use breakpoint if vs code close or anything >>>>>>>>>>>>>>>>>>>>>>>>>>> during creating *tif, >>>>>>>>>>>>>>>>>>>>>>>>>>> gt.txt, and .box files *then you can checkpoint >>>>>>>>>>>>>>>>>>>>>>>>>>> to navigate where you close vs code. >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> command for *tif, gt.txt, and .box files * >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> import os >>>>>>>>>>>>>>>>>>>>>>>>>>> import random >>>>>>>>>>>>>>>>>>>>>>>>>>> import pathlib >>>>>>>>>>>>>>>>>>>>>>>>>>> import subprocess >>>>>>>>>>>>>>>>>>>>>>>>>>> import argparse >>>>>>>>>>>>>>>>>>>>>>>>>>> from FontList import FontList >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> def create_training_data(training_text_file, >>>>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, start_line=None, >>>>>>>>>>>>>>>>>>>>>>>>>>> end_line=None): >>>>>>>>>>>>>>>>>>>>>>>>>>> lines = [] >>>>>>>>>>>>>>>>>>>>>>>>>>> with open(training_text_file, 'r') as >>>>>>>>>>>>>>>>>>>>>>>>>>> input_file: >>>>>>>>>>>>>>>>>>>>>>>>>>> lines = input_file.readlines() >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> if not os.path.exists(output_directory): >>>>>>>>>>>>>>>>>>>>>>>>>>> os.mkdir(output_directory) >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> if start_line is None: >>>>>>>>>>>>>>>>>>>>>>>>>>> start_line = 0 >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> if end_line is None: >>>>>>>>>>>>>>>>>>>>>>>>>>> end_line = len(lines) - 1 >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> for font_name in font_list.fonts: >>>>>>>>>>>>>>>>>>>>>>>>>>> for line_index in range(start_line, >>>>>>>>>>>>>>>>>>>>>>>>>>> end_line + 1): >>>>>>>>>>>>>>>>>>>>>>>>>>> line = lines[line_index].strip() >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file_name = pathlib. >>>>>>>>>>>>>>>>>>>>>>>>>>> Path(training_text_file).stem >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> line_serial = f"{line_index:d}" >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> line_gt_text = os.path.join( >>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory, f'{training_text_file_name}_{ >>>>>>>>>>>>>>>>>>>>>>>>>>> line_serial}_{font_name.replace(" ", "_")} >>>>>>>>>>>>>>>>>>>>>>>>>>> .gt.txt') >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> with open(line_gt_text, 'w') as >>>>>>>>>>>>>>>>>>>>>>>>>>> output_file: >>>>>>>>>>>>>>>>>>>>>>>>>>> output_file.writelines([line]) >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> file_base_name = f'{ >>>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file_name}_{line_serial}_{ >>>>>>>>>>>>>>>>>>>>>>>>>>> font_name.replace(" ", "_")}' >>>>>>>>>>>>>>>>>>>>>>>>>>> subprocess.run([ >>>>>>>>>>>>>>>>>>>>>>>>>>> 'text2image', >>>>>>>>>>>>>>>>>>>>>>>>>>> f'--font={font_name}', >>>>>>>>>>>>>>>>>>>>>>>>>>> f'--text={line_gt_text}', >>>>>>>>>>>>>>>>>>>>>>>>>>> f'--outputbase={output_directory >>>>>>>>>>>>>>>>>>>>>>>>>>> }/{file_base_name}', >>>>>>>>>>>>>>>>>>>>>>>>>>> '--max_pages=1', >>>>>>>>>>>>>>>>>>>>>>>>>>> '--strip_unrenderable_words', >>>>>>>>>>>>>>>>>>>>>>>>>>> '--leading=36', >>>>>>>>>>>>>>>>>>>>>>>>>>> '--xsize=3600', >>>>>>>>>>>>>>>>>>>>>>>>>>> '--ysize=330', >>>>>>>>>>>>>>>>>>>>>>>>>>> '--char_spacing=1.0', >>>>>>>>>>>>>>>>>>>>>>>>>>> '--exposure=0', >>>>>>>>>>>>>>>>>>>>>>>>>>> ' >>>>>>>>>>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/eng.unicharset', >>>>>>>>>>>>>>>>>>>>>>>>>>> ]) >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> if __name__ == "__main__": >>>>>>>>>>>>>>>>>>>>>>>>>>> parser = argparse.ArgumentParser() >>>>>>>>>>>>>>>>>>>>>>>>>>> parser.add_argument('--start', type=int, >>>>>>>>>>>>>>>>>>>>>>>>>>> help='Starting line count (inclusive)') >>>>>>>>>>>>>>>>>>>>>>>>>>> parser.add_argument('--end', type=int, help= >>>>>>>>>>>>>>>>>>>>>>>>>>> 'Ending line count (inclusive)') >>>>>>>>>>>>>>>>>>>>>>>>>>> args = parser.parse_args() >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file = ' >>>>>>>>>>>>>>>>>>>>>>>>>>> langdata/eng.training_text' >>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory = ' >>>>>>>>>>>>>>>>>>>>>>>>>>> tesstrain/data/eng-ground-truth' >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> font_list = FontList() >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> create_training_data(training_text_file, >>>>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, args.start, args.end) >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Then create a file called "FontList" in the root >>>>>>>>>>>>>>>>>>>>>>>>>>> directory and paste it. >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> class FontList: >>>>>>>>>>>>>>>>>>>>>>>>>>> def __init__(self): >>>>>>>>>>>>>>>>>>>>>>>>>>> self.fonts = [ >>>>>>>>>>>>>>>>>>>>>>>>>>> "Gerlick" >>>>>>>>>>>>>>>>>>>>>>>>>>> "Sagar Medium", >>>>>>>>>>>>>>>>>>>>>>>>>>> "Ekushey Lohit Normal", >>>>>>>>>>>>>>>>>>>>>>>>>>> "Charukola Round Head Regular, >>>>>>>>>>>>>>>>>>>>>>>>>>> weight=433", >>>>>>>>>>>>>>>>>>>>>>>>>>> "Charukola Round Head Bold, >>>>>>>>>>>>>>>>>>>>>>>>>>> weight=443", >>>>>>>>>>>>>>>>>>>>>>>>>>> "Ador Orjoma Unicode", >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> ] >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> then import in the above code, >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> *for breakpoint command:* >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> sudo python3 split_training_text.py --start 0 >>>>>>>>>>>>>>>>>>>>>>>>>>> --end 11 >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> change checkpoint according to you --start 0 >>>>>>>>>>>>>>>>>>>>>>>>>>> --end 11. >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> *and training checkpoint as you know already.* >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 1:22:34 am >>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+6 [email protected] wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi mhalidu, >>>>>>>>>>>>>>>>>>>>>>>>>>>> the script you posted here seems much more >>>>>>>>>>>>>>>>>>>>>>>>>>>> extensive than you posted before: >>>>>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/0e2880d9-64c0-4659-b497-902a5747caf4n%40googlegroups.com >>>>>>>>>>>>>>>>>>>>>>>>>>>> . >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> I have been using your earlier script. It is >>>>>>>>>>>>>>>>>>>>>>>>>>>> magical. How is this one different from the >>>>>>>>>>>>>>>>>>>>>>>>>>>> earlier one? >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Thank you for posting these scripts, by the >>>>>>>>>>>>>>>>>>>>>>>>>>>> way. It has saved my countless hours; by running >>>>>>>>>>>>>>>>>>>>>>>>>>>> multiple fonts in one >>>>>>>>>>>>>>>>>>>>>>>>>>>> sweep. I was not able to find any instruction on >>>>>>>>>>>>>>>>>>>>>>>>>>>> how to train for multiple >>>>>>>>>>>>>>>>>>>>>>>>>>>> fonts. The official manual is also unclear. YOUr >>>>>>>>>>>>>>>>>>>>>>>>>>>> script helped me to get >>>>>>>>>>>>>>>>>>>>>>>>>>>> started. >>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wednesday, August 9, 2023 at 11:00:49 PM >>>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+3 [email protected] wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> ok, I will try as you said. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> one more thing, what's the role of the >>>>>>>>>>>>>>>>>>>>>>>>>>>>> trained_text lines will be? I have seen Bengali >>>>>>>>>>>>>>>>>>>>>>>>>>>>> text are long words of >>>>>>>>>>>>>>>>>>>>>>>>>>>>> lines. so I wanna know how many words or >>>>>>>>>>>>>>>>>>>>>>>>>>>>> characters will be the better >>>>>>>>>>>>>>>>>>>>>>>>>>>>> choice for the train? and >>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--xsize=3600','--ysize=350', will be according >>>>>>>>>>>>>>>>>>>>>>>>>>>>> to words of lines? >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thursday, 10 August, 2023 at 1:10:14 am >>>>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+6 shree wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Include the default fonts also in your >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fine-tuning list of fonts and see if that helps. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Aug 9, 2023, 2:27 PM Ali hussain < >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have trained some new fonts by fine-tune >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> methods for the Bengali language in Tesseract 5 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and I have used all >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> official trained_text and tessdata_best and >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> other things also. everything >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is good but the problem is the default font >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which was trained before that >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> does not convert text like prev but my new >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fonts work well. I don't >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> understand why it's happening. I share code >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> based to understand what going >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *codes for creating tif, gt.txt, .box >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> files:* >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import os >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import random >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import pathlib >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import subprocess >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import argparse >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> from FontList import FontList >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> def read_line_count(): >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if os.path.exists('line_count.txt'): >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with open('line_count.txt', 'r') as >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> file: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> return int(file.read()) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> return 0 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> def write_line_count(line_count): >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with open('line_count.txt', 'w') as >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> file: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> file.write(str(line_count)) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> def create_training_data(training_text_file, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, start_line=None, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> end_line=None): >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lines = [] >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with open(training_text_file, 'r') as >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> input_file: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for line in input_file.readlines(): >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lines.append(line.strip()) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if not os.path.exists(output_directory): >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> os.mkdir(output_directory) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> random.shuffle(lines) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if start_line is None: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> line_count = read_line_count() # >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Set the starting line_count from the file >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> else: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> line_count = start_line >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if end_line is None: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> end_line_count = len(lines) - 1 # >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Set the ending line_count >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> else: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> end_line_count = min(end_line, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> len(lines) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - 1) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for font in font_list.fonts: # Iterate >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> through all the fonts in the font_list >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_serial = 1 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for line in lines: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file_name = >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pathlib.Path(training_text_file).stem >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> # Generate a unique serial >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> number for each line >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> line_serial = f"{line_count:d}" >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> # GT (Ground Truth) text >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filename >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> line_gt_text = os.path.join( >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory, f'{training_text_file_name >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> }_{line_serial}.gt.txt') >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with open(line_gt_text, 'w') as >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_file: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_file.writelines >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ([line]) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> # Image filename >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> file_base_name = f'ben_{ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> line_serial}' # Unique filename for each >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> subprocess.run([ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 'text2image', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> f'--font={font}', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> f'--text={line_gt_text}', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> f'--outputbase={ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory}/{file_base_name}', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--max_pages=1', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--strip_unrenderable_words' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> , >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--leading=36', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--xsize=3600', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--ysize=350', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--char_spacing=1.0', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--exposure=0', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/ben.unicharset', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ]) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> line_count += 1 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_serial += 1 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> # Reset font_serial for the next >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font iteration >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_serial = 1 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> write_line_count(line_count) # Update >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the line_count in the file >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if __name__ == "__main__": >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parser = argparse.ArgumentParser() >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parser.add_argument('--start', type=int, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> help='Starting line count (inclusive)') >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parser.add_argument('--end', type=int, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> help='Ending line count (inclusive)') >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> args = parser.parse_args() >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file = ' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> langdata/ben.training_text' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory = ' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tesstrain/data/ben-ground-truth' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> # Create an instance of the FontList >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list = FontList() >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> create_training_data(training_text_file, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, args.start, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> args.end) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *and for training code:* >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import subprocess >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> # List of font names >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_names = ['ben'] >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for font in font_names: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> command = >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> make training MODEL_NAME={font} >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LANG_TYPE=Indic" >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> subprocess.run(command, shell=True) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> any suggestion to identify to extract the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> problem. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> thanks, everyone >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> You received this message because you are >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> subscribed to the Google Groups "tesseract-ocr" >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> group. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To unsubscribe from this group and stop >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> receiving emails from it, send an email to >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> . >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>>> To unsubscribe from this group and stop receiving emails from >>>>>>>>>>> it, send an email to [email protected]. >>>>>>>>>>> >>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com >>>>>>>>>>> >>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>>> . >>>>>>>>>>> >>>>>>>>>> -- >>>>>>>>> You received this message because you are subscribed to the Google >>>>>>>>> Groups "tesseract-ocr" group. >>>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>>> send an email to [email protected]. >>>>>>>>> >>>>>>>> To view this discussion on the web visit >>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/eb833902-7258-43e3-8854-d51ce26b7257n%40googlegroups.com >>>>>>>>> >>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/eb833902-7258-43e3-8854-d51ce26b7257n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>> . >>>>>>>>> >>>>>>>> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/edcc9b3f-235e-48a2-8a0b-407318544a5fn%40googlegroups.com.

