Just saw this paper: https://osf.io/b8h7q
On Thursday, September 14, 2023 at 9:02:22 PM UTC+3 mdalihu...@gmail.com wrote: > I will try some changes. thx > > On Thursday, 14 September, 2023 at 2:46:36 pm UTC+6 elvi...@gmail.com > wrote: > >> I also faced that issue in the Windows. Apparently, the issue is related >> with unicode. You can try your luck by changing "r" to "utf8" in the >> script. >> I end up installing Ubuntu because i was having too many errors in the >> Windows. >> >> On Thu, Sep 14, 2023, 9:33 AM Ali hussain <mdalihu...@gmail.com> wrote: >> >>> you faced this error, Can't encode transcription? if you faced how you >>> have solved this? >>> >>> On Thursday, 14 September, 2023 at 10:51:52 am UTC+6 elvi...@gmail.com >>> wrote: >>> >>>> I was using my own text >>>> >>>> On Thu, Sep 14, 2023, 6:58 AM Ali hussain <mdalihu...@gmail.com> wrote: >>>> >>>>> you are training from Tessearact default text data or your own >>>>> collected text data? >>>>> On Thursday, 14 September, 2023 at 12:19:53 am UTC+6 >>>>> desal...@gmail.com wrote: >>>>> >>>>>> I now get to 200000 iterations; and the error rate is stuck at 0.46. >>>>>> The result is absolutely trash: nowhere close to the default/Ray's >>>>>> training. >>>>>> >>>>>> On Wednesday, September 13, 2023 at 2:47:05 PM UTC+3 >>>>>> mdalihu...@gmail.com wrote: >>>>>> >>>>>>> >>>>>>> after Tesseact recognizes text from images. then you can apply regex >>>>>>> to replace the wrong word with to correct word. >>>>>>> I'm not familiar with paddleOcr and scanTailor also. >>>>>>> >>>>>>> On Wednesday, 13 September, 2023 at 5:06:12 pm UTC+6 >>>>>>> desal...@gmail.com wrote: >>>>>>> >>>>>>>> At what stage are you doing the regex replacement? >>>>>>>> My process has been: Scan (tif)--> ScanTailor --> Tesseract --> pdf >>>>>>>> >>>>>>>> >EasyOCR I think is best for ID cards or something like that image >>>>>>>> process. but document images like books, here Tesseract is better than >>>>>>>> EasyOCR. >>>>>>>> >>>>>>>> How about paddleOcr?, are you familiar with it? >>>>>>>> >>>>>>>> On Wednesday, September 13, 2023 at 1:45:54 PM UTC+3 >>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>> >>>>>>>>> I know what you mean. but in some cases, it helps me. I have >>>>>>>>> faced specific characters and words are always not recognized by >>>>>>>>> Tesseract. >>>>>>>>> That way I use these regex to replace those characters and words if >>>>>>>>> >>>>>>>>> those characters are incorrect. >>>>>>>>> >>>>>>>>> see what I have done: >>>>>>>>> >>>>>>>>> " ী": "ী", >>>>>>>>> " ্": " ", >>>>>>>>> " ে": " ", >>>>>>>>> জ্া: "জা", >>>>>>>>> " ": " ", >>>>>>>>> " ": " ", >>>>>>>>> " ": " ", >>>>>>>>> "্প": " ", >>>>>>>>> " য": "র্য", >>>>>>>>> য: "য", >>>>>>>>> " া": "া", >>>>>>>>> আা: "আ", >>>>>>>>> ম্ি: "মি", >>>>>>>>> স্ু: "সু", >>>>>>>>> "হূ ": "হূ", >>>>>>>>> " ণ": "ণ", >>>>>>>>> র্্: "র", >>>>>>>>> "চিন্ত ": "চিন্তা ", >>>>>>>>> ন্া: "না", >>>>>>>>> "সম ূর্ন": "সম্পূর্ণ", >>>>>>>>> On Wednesday, 13 September, 2023 at 4:18:22 pm UTC+6 >>>>>>>>> desal...@gmail.com wrote: >>>>>>>>> >>>>>>>>>> The problem for regex is that Tesseract is not consistent in its >>>>>>>>>> replacement. >>>>>>>>>> Think of the original training of English data doesn't contain >>>>>>>>>> the letter /u/. What does Tesseract do when it faces /u/ in actual >>>>>>>>>> processing?? >>>>>>>>>> In some cases, it replaces it with closely similar letters such >>>>>>>>>> as /v/ and /w/. In other cases, it completely removes it. That is >>>>>>>>>> what is >>>>>>>>>> happening with my case. Those characters re sometimes completely >>>>>>>>>> removed; >>>>>>>>>> other times, they are replaced by closely resembling characters. >>>>>>>>>> Because of >>>>>>>>>> this inconsistency, applying regex is very difficult. >>>>>>>>>> >>>>>>>>>> On Wednesday, September 13, 2023 at 1:02:01 PM UTC+3 >>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>> >>>>>>>>>>> if Some specific characters or words are always missing from the >>>>>>>>>>> OCR result. then you can apply logic with the Regular expressions >>>>>>>>>>> method >>>>>>>>>>> on your applications. After OCR, these specific characters or words >>>>>>>>>>> will be >>>>>>>>>>> replaced by current characters or words that you defined in your >>>>>>>>>>> applications by Regular expressions. it can be done in some major >>>>>>>>>>> problems. >>>>>>>>>>> >>>>>>>>>>> On Wednesday, 13 September, 2023 at 3:51:29 pm UTC+6 >>>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>>> >>>>>>>>>>>> The characters are getting missed, even after fine-tuning. >>>>>>>>>>>> I never made any progress. I tried many different ways. Some >>>>>>>>>>>> specific characters are always missing from the OCR result. >>>>>>>>>>>> >>>>>>>>>>>> On Wednesday, September 13, 2023 at 12:49:20 PM UTC+3 >>>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>>> >>>>>>>>>>>>> EasyOCR I think is best for ID cards or something like that >>>>>>>>>>>>> image process. but document images like books, here Tesseract is >>>>>>>>>>>>> better >>>>>>>>>>>>> than EasyOCR. Even I didn't use EasyOCR. you can try it. >>>>>>>>>>>>> >>>>>>>>>>>>> I have added words of dictionaries but the result is the same. >>>>>>>>>>>>> >>>>>>>>>>>>> what kind of problem you have faced in fine-tuning in few new >>>>>>>>>>>>> characters as you said (*but, I failed in every possible way >>>>>>>>>>>>> to introduce a few new characters into the database.)* >>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 3:33:48 pm UTC+6 >>>>>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Yes, we are new to this. I find the instructions (the manual) >>>>>>>>>>>>>> very hard to follow. The video you linked above was really >>>>>>>>>>>>>> helpful to get >>>>>>>>>>>>>> started. My plan at the beginning was to fine tune the existing >>>>>>>>>>>>>> .traineddata. But, I failed in every possible way to introduce a >>>>>>>>>>>>>> few new >>>>>>>>>>>>>> characters into the database. That is why I started from >>>>>>>>>>>>>> scratch. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Sure, I will follow Lorenzo's suggestion: will run more the >>>>>>>>>>>>>> iterations, and see if I can improve. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Another areas we need to explore is usage of dictionaries >>>>>>>>>>>>>> actually. May be adding millions of words into the dictionary >>>>>>>>>>>>>> could help >>>>>>>>>>>>>> Tesseract. I don't have millions of words; but I am looking into >>>>>>>>>>>>>> some >>>>>>>>>>>>>> corpus to get more words into the dictionary. >>>>>>>>>>>>>> >>>>>>>>>>>>>> If this all fails, EasyOCR (and probably other similar >>>>>>>>>>>>>> open-source packages) is probably our next option to try on. >>>>>>>>>>>>>> Sure, sharing >>>>>>>>>>>>>> our experiences will be helpful. I will let you know if I made >>>>>>>>>>>>>> good >>>>>>>>>>>>>> progresses in any of these options. >>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 12:19:48 PM UTC+3 >>>>>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> How is your training going for Bengali? It was nearly good >>>>>>>>>>>>>>> but I faced space problems between two words, some words are >>>>>>>>>>>>>>> spaces but >>>>>>>>>>>>>>> most of them have no space. I think is problem is in the >>>>>>>>>>>>>>> dataset but I use >>>>>>>>>>>>>>> the default training dataset from Tesseract which is used in >>>>>>>>>>>>>>> Ben That way I >>>>>>>>>>>>>>> am confused so I have to explore more. by the way, you can try >>>>>>>>>>>>>>> as Lorenzo >>>>>>>>>>>>>>> Blz said. Actually training from scratch is harder than >>>>>>>>>>>>>>> fine-tuning. so you can use different datasets to explore. if >>>>>>>>>>>>>>> you succeed. >>>>>>>>>>>>>>> please let me know how you have done this whole process. I'm >>>>>>>>>>>>>>> also new in >>>>>>>>>>>>>>> this field. >>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 1:13:43 pm UTC+6 >>>>>>>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> How is your training going for Bengali? >>>>>>>>>>>>>>>> I have been trying to train from scratch. I made about >>>>>>>>>>>>>>>> 64,000 lines of text (which produced about 255,000 files, in >>>>>>>>>>>>>>>> the end) and >>>>>>>>>>>>>>>> run the training for 150,000 iterations; getting 0.51 training >>>>>>>>>>>>>>>> error rate. >>>>>>>>>>>>>>>> I was hopping to get reasonable accuracy. Unfortunately, when >>>>>>>>>>>>>>>> I run the OCR >>>>>>>>>>>>>>>> using .traineddata, the accuracy is absolutely terrible. Do >>>>>>>>>>>>>>>> you think I >>>>>>>>>>>>>>>> made some mistakes, or that is an expected result? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Tuesday, September 12, 2023 at 11:15:25 PM UTC+3 >>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Yes, he doesn't mention all fonts but only one font. That >>>>>>>>>>>>>>>>> way he didn't use *MODEL_NAME in a separate **script **file >>>>>>>>>>>>>>>>> script I think.* >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Actually, here we teach all *tif, gt.txt, and .box files >>>>>>>>>>>>>>>>> *which >>>>>>>>>>>>>>>>> are created by *MODEL_NAME I mean **eng, ben, oro flag >>>>>>>>>>>>>>>>> or language code *because when we first create *tif, >>>>>>>>>>>>>>>>> gt.txt, and .box files, *every file starts by >>>>>>>>>>>>>>>>> *MODEL_NAME*. This *MODEL_NAME* we selected on the >>>>>>>>>>>>>>>>> training script for looping each tif, gt.txt, and .box files >>>>>>>>>>>>>>>>> which are >>>>>>>>>>>>>>>>> created by *MODEL_NAME.* >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Tuesday, 12 September, 2023 at 9:42:13 pm UTC+6 >>>>>>>>>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Yes, I am familiar with the video and have set up the >>>>>>>>>>>>>>>>>> folder structure as you did. Indeed, I have tried a number >>>>>>>>>>>>>>>>>> of fine-tuning >>>>>>>>>>>>>>>>>> with a single font following Gracia's video. But, your >>>>>>>>>>>>>>>>>> script is much >>>>>>>>>>>>>>>>>> better because supports multiple fonts. The whole >>>>>>>>>>>>>>>>>> improvement you made is >>>>>>>>>>>>>>>>>> brilliant; and very useful. It is all working for me. >>>>>>>>>>>>>>>>>> The only part that I didn't understand is the trick you >>>>>>>>>>>>>>>>>> used in your tesseract_train.py script. You see, I have been >>>>>>>>>>>>>>>>>> doing exactly >>>>>>>>>>>>>>>>>> to you did except this script. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> The scripts seems to have the trick of sending/teaching >>>>>>>>>>>>>>>>>> each of the fonts (iteratively) into the model. The script I >>>>>>>>>>>>>>>>>> have been >>>>>>>>>>>>>>>>>> using (which I get from Garcia) doesn't mention font at >>>>>>>>>>>>>>>>>> all. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> *TESSDATA_PREFIX=../tesseract/tessdata make training >>>>>>>>>>>>>>>>>> MODEL_NAME=oro TESSDATA=../tesseract/tessdata >>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000* >>>>>>>>>>>>>>>>>> Does it mean that my model does't train the fonts (even >>>>>>>>>>>>>>>>>> if the fonts have been included in the splitting process, in >>>>>>>>>>>>>>>>>> the other >>>>>>>>>>>>>>>>>> script)? >>>>>>>>>>>>>>>>>> On Monday, September 11, 2023 at 10:54:08 AM UTC+3 >>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> *import subprocess# List of font namesfont_names = >>>>>>>>>>>>>>>>>>> ['ben']for font in font_names: command = >>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make training >>>>>>>>>>>>>>>>>>> MODEL_NAME={font} >>>>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata >>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000"* >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> * subprocess.run(command, shell=True) 1 . This >>>>>>>>>>>>>>>>>>> command is for training data that I have named '* >>>>>>>>>>>>>>>>>>> tesseract_training*.py' inside tesstrain folder.* >>>>>>>>>>>>>>>>>>> *2. root directory means your main training folder and >>>>>>>>>>>>>>>>>>> inside it as like langdata, tessearact, tesstrain folders. >>>>>>>>>>>>>>>>>>> if you see this >>>>>>>>>>>>>>>>>>> tutorial *https://www.youtube.com/watch?v=KE4xEzFGSU8 >>>>>>>>>>>>>>>>>>> you will understand better the folder structure. only I >>>>>>>>>>>>>>>>>>> created tesseract_training.py in tesstrain folder for >>>>>>>>>>>>>>>>>>> training and >>>>>>>>>>>>>>>>>>> FontList.py file is the main path as *like langdata, >>>>>>>>>>>>>>>>>>> tessearact, tesstrain, and *split_training_text.py. >>>>>>>>>>>>>>>>>>> 3. first of all you have to put all fonts in your Linux >>>>>>>>>>>>>>>>>>> fonts folder. /usr/share/fonts/ then run: sudo apt >>>>>>>>>>>>>>>>>>> update then sudo fc-cache -fv >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> after that, you have to add the exact font's name in >>>>>>>>>>>>>>>>>>> FontList.py file like me. >>>>>>>>>>>>>>>>>>> I have added two pic my folder structure. first is >>>>>>>>>>>>>>>>>>> main structure pic and the second is the Colopse tesstrain >>>>>>>>>>>>>>>>>>> folder. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I[image: Screenshot 2023-09-11 134947.png][image: >>>>>>>>>>>>>>>>>>> Screenshot 2023-09-11 135014.png] >>>>>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 12:50:03 pm UTC+6 >>>>>>>>>>>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thank you so much for putting out these brilliant >>>>>>>>>>>>>>>>>>>> scripts. They make the process much more efficient. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I have one more question on the other script that you >>>>>>>>>>>>>>>>>>>> use to train. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> *import subprocess# List of font namesfont_names = >>>>>>>>>>>>>>>>>>>> ['ben']for font in font_names: command = >>>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make training >>>>>>>>>>>>>>>>>>>> MODEL_NAME={font} >>>>>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata >>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000"* >>>>>>>>>>>>>>>>>>>> * subprocess.run(command, shell=True) * >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Do you have the name of fonts listed in file in the >>>>>>>>>>>>>>>>>>>> same/root directory? >>>>>>>>>>>>>>>>>>>> How do you setup the names of the fonts in the file, if >>>>>>>>>>>>>>>>>>>> you don't mind sharing it? >>>>>>>>>>>>>>>>>>>> On Monday, September 11, 2023 at 4:27:27 AM UTC+3 >>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> You can use the new script below. it's better than the >>>>>>>>>>>>>>>>>>>>> previous two scripts. You can create *tif, gt.txt, >>>>>>>>>>>>>>>>>>>>> and .box files *by multiple fonts and also use >>>>>>>>>>>>>>>>>>>>> breakpoint if vs code close or anything during creating >>>>>>>>>>>>>>>>>>>>> *tif, >>>>>>>>>>>>>>>>>>>>> gt.txt, and .box files *then you can checkpoint to >>>>>>>>>>>>>>>>>>>>> navigate where you close vs code. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> command for *tif, gt.txt, and .box files * >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> import os >>>>>>>>>>>>>>>>>>>>> import random >>>>>>>>>>>>>>>>>>>>> import pathlib >>>>>>>>>>>>>>>>>>>>> import subprocess >>>>>>>>>>>>>>>>>>>>> import argparse >>>>>>>>>>>>>>>>>>>>> from FontList import FontList >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> def create_training_data(training_text_file, font_list, >>>>>>>>>>>>>>>>>>>>> output_directory, start_line=None, end_line=None): >>>>>>>>>>>>>>>>>>>>> lines = [] >>>>>>>>>>>>>>>>>>>>> with open(training_text_file, 'r') as input_file: >>>>>>>>>>>>>>>>>>>>> lines = input_file.readlines() >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> if not os.path.exists(output_directory): >>>>>>>>>>>>>>>>>>>>> os.mkdir(output_directory) >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> if start_line is None: >>>>>>>>>>>>>>>>>>>>> start_line = 0 >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> if end_line is None: >>>>>>>>>>>>>>>>>>>>> end_line = len(lines) - 1 >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> for font_name in font_list.fonts: >>>>>>>>>>>>>>>>>>>>> for line_index in range(start_line, end_line + >>>>>>>>>>>>>>>>>>>>> 1): >>>>>>>>>>>>>>>>>>>>> line = lines[line_index].strip() >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> training_text_file_name = pathlib.Path( >>>>>>>>>>>>>>>>>>>>> training_text_file).stem >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> line_serial = f"{line_index:d}" >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> line_gt_text = os.path.join( >>>>>>>>>>>>>>>>>>>>> output_directory, f'{training_text_file_name}_{ >>>>>>>>>>>>>>>>>>>>> line_serial}_{font_name.replace(" ", "_")}.gt.txt') >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> with open(line_gt_text, 'w') as >>>>>>>>>>>>>>>>>>>>> output_file: >>>>>>>>>>>>>>>>>>>>> output_file.writelines([line]) >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> file_base_name = f'{ >>>>>>>>>>>>>>>>>>>>> training_text_file_name}_{line_serial}_{ >>>>>>>>>>>>>>>>>>>>> font_name.replace(" ", "_")}' >>>>>>>>>>>>>>>>>>>>> subprocess.run([ >>>>>>>>>>>>>>>>>>>>> 'text2image', >>>>>>>>>>>>>>>>>>>>> f'--font={font_name}', >>>>>>>>>>>>>>>>>>>>> f'--text={line_gt_text}', >>>>>>>>>>>>>>>>>>>>> f'--outputbase={output_directory}/{ >>>>>>>>>>>>>>>>>>>>> file_base_name}', >>>>>>>>>>>>>>>>>>>>> '--max_pages=1', >>>>>>>>>>>>>>>>>>>>> '--strip_unrenderable_words', >>>>>>>>>>>>>>>>>>>>> '--leading=36', >>>>>>>>>>>>>>>>>>>>> '--xsize=3600', >>>>>>>>>>>>>>>>>>>>> '--ysize=330', >>>>>>>>>>>>>>>>>>>>> '--char_spacing=1.0', >>>>>>>>>>>>>>>>>>>>> '--exposure=0', >>>>>>>>>>>>>>>>>>>>> ' >>>>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/eng.unicharset', >>>>>>>>>>>>>>>>>>>>> ]) >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> if __name__ == "__main__": >>>>>>>>>>>>>>>>>>>>> parser = argparse.ArgumentParser() >>>>>>>>>>>>>>>>>>>>> parser.add_argument('--start', type=int, >>>>>>>>>>>>>>>>>>>>> help='Starting >>>>>>>>>>>>>>>>>>>>> line count (inclusive)') >>>>>>>>>>>>>>>>>>>>> parser.add_argument('--end', type=int, help='Ending >>>>>>>>>>>>>>>>>>>>> line count (inclusive)') >>>>>>>>>>>>>>>>>>>>> args = parser.parse_args() >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> training_text_file = 'langdata/eng.training_text' >>>>>>>>>>>>>>>>>>>>> output_directory = ' >>>>>>>>>>>>>>>>>>>>> tesstrain/data/eng-ground-truth' >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> font_list = FontList() >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> create_training_data(training_text_file, >>>>>>>>>>>>>>>>>>>>> font_list, output_directory, args.start, args.end) >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Then create a file called "FontList" in the root >>>>>>>>>>>>>>>>>>>>> directory and paste it. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> class FontList: >>>>>>>>>>>>>>>>>>>>> def __init__(self): >>>>>>>>>>>>>>>>>>>>> self.fonts = [ >>>>>>>>>>>>>>>>>>>>> "Gerlick" >>>>>>>>>>>>>>>>>>>>> "Sagar Medium", >>>>>>>>>>>>>>>>>>>>> "Ekushey Lohit Normal", >>>>>>>>>>>>>>>>>>>>> "Charukola Round Head Regular, weight=433", >>>>>>>>>>>>>>>>>>>>> "Charukola Round Head Bold, weight=443", >>>>>>>>>>>>>>>>>>>>> "Ador Orjoma Unicode", >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> ] >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> then import in the above code, >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> *for breakpoint command:* >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> sudo python3 split_training_text.py --start 0 --end >>>>>>>>>>>>>>>>>>>>> 11 >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> change checkpoint according to you --start 0 --end 11. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> *and training checkpoint as you know already.* >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 1:22:34 am UTC+6 >>>>>>>>>>>>>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Hi mhalidu, >>>>>>>>>>>>>>>>>>>>>> the script you posted here seems much more extensive >>>>>>>>>>>>>>>>>>>>>> than you posted before: >>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/0e2880d9-64c0-4659-b497-902a5747caf4n%40googlegroups.com >>>>>>>>>>>>>>>>>>>>>> . >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> I have been using your earlier script. It is magical. >>>>>>>>>>>>>>>>>>>>>> How is this one different from the earlier one? >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Thank you for posting these scripts, by the way. It >>>>>>>>>>>>>>>>>>>>>> has saved my countless hours; by running multiple fonts >>>>>>>>>>>>>>>>>>>>>> in one sweep. I was >>>>>>>>>>>>>>>>>>>>>> not able to find any instruction on how to train for >>>>>>>>>>>>>>>>>>>>>> multiple fonts. The >>>>>>>>>>>>>>>>>>>>>> official manual is also unclear. YOUr script helped me >>>>>>>>>>>>>>>>>>>>>> to get started. >>>>>>>>>>>>>>>>>>>>>> On Wednesday, August 9, 2023 at 11:00:49 PM UTC+3 >>>>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> ok, I will try as you said. >>>>>>>>>>>>>>>>>>>>>>> one more thing, what's the role of the trained_text >>>>>>>>>>>>>>>>>>>>>>> lines will be? I have seen Bengali text are long words >>>>>>>>>>>>>>>>>>>>>>> of lines. so I wanna >>>>>>>>>>>>>>>>>>>>>>> know how many words or characters will be the better >>>>>>>>>>>>>>>>>>>>>>> choice for the train? >>>>>>>>>>>>>>>>>>>>>>> and '--xsize=3600','--ysize=350', will be according to >>>>>>>>>>>>>>>>>>>>>>> words of lines? >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> On Thursday, 10 August, 2023 at 1:10:14 am UTC+6 >>>>>>>>>>>>>>>>>>>>>>> shree wrote: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Include the default fonts also in your fine-tuning >>>>>>>>>>>>>>>>>>>>>>>> list of fonts and see if that helps. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> On Wed, Aug 9, 2023, 2:27 PM Ali hussain < >>>>>>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> I have trained some new fonts by fine-tune methods >>>>>>>>>>>>>>>>>>>>>>>>> for the Bengali language in Tesseract 5 and I have >>>>>>>>>>>>>>>>>>>>>>>>> used all official >>>>>>>>>>>>>>>>>>>>>>>>> trained_text and tessdata_best and other things also. >>>>>>>>>>>>>>>>>>>>>>>>> everything is good >>>>>>>>>>>>>>>>>>>>>>>>> but the problem is the default font which was trained >>>>>>>>>>>>>>>>>>>>>>>>> before that does not >>>>>>>>>>>>>>>>>>>>>>>>> convert text like prev but my new fonts work well. I >>>>>>>>>>>>>>>>>>>>>>>>> don't understand why >>>>>>>>>>>>>>>>>>>>>>>>> it's happening. I share code based to understand what >>>>>>>>>>>>>>>>>>>>>>>>> going on. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> *codes for creating tif, gt.txt, .box files:* >>>>>>>>>>>>>>>>>>>>>>>>> import os >>>>>>>>>>>>>>>>>>>>>>>>> import random >>>>>>>>>>>>>>>>>>>>>>>>> import pathlib >>>>>>>>>>>>>>>>>>>>>>>>> import subprocess >>>>>>>>>>>>>>>>>>>>>>>>> import argparse >>>>>>>>>>>>>>>>>>>>>>>>> from FontList import FontList >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> def read_line_count(): >>>>>>>>>>>>>>>>>>>>>>>>> if os.path.exists('line_count.txt'): >>>>>>>>>>>>>>>>>>>>>>>>> with open('line_count.txt', 'r') as file: >>>>>>>>>>>>>>>>>>>>>>>>> return int(file.read()) >>>>>>>>>>>>>>>>>>>>>>>>> return 0 >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> def write_line_count(line_count): >>>>>>>>>>>>>>>>>>>>>>>>> with open('line_count.txt', 'w') as file: >>>>>>>>>>>>>>>>>>>>>>>>> file.write(str(line_count)) >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> def create_training_data(training_text_file, >>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, start_line=None, >>>>>>>>>>>>>>>>>>>>>>>>> end_line=None): >>>>>>>>>>>>>>>>>>>>>>>>> lines = [] >>>>>>>>>>>>>>>>>>>>>>>>> with open(training_text_file, 'r') as >>>>>>>>>>>>>>>>>>>>>>>>> input_file: >>>>>>>>>>>>>>>>>>>>>>>>> for line in input_file.readlines(): >>>>>>>>>>>>>>>>>>>>>>>>> lines.append(line.strip()) >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> if not os.path.exists(output_directory): >>>>>>>>>>>>>>>>>>>>>>>>> os.mkdir(output_directory) >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> random.shuffle(lines) >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> if start_line is None: >>>>>>>>>>>>>>>>>>>>>>>>> line_count = read_line_count() # Set the >>>>>>>>>>>>>>>>>>>>>>>>> starting line_count from the file >>>>>>>>>>>>>>>>>>>>>>>>> else: >>>>>>>>>>>>>>>>>>>>>>>>> line_count = start_line >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> if end_line is None: >>>>>>>>>>>>>>>>>>>>>>>>> end_line_count = len(lines) - 1 # Set >>>>>>>>>>>>>>>>>>>>>>>>> the ending line_count >>>>>>>>>>>>>>>>>>>>>>>>> else: >>>>>>>>>>>>>>>>>>>>>>>>> end_line_count = min(end_line, len(lines) >>>>>>>>>>>>>>>>>>>>>>>>> - 1) >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> for font in font_list.fonts: # Iterate >>>>>>>>>>>>>>>>>>>>>>>>> through all the fonts in the font_list >>>>>>>>>>>>>>>>>>>>>>>>> font_serial = 1 >>>>>>>>>>>>>>>>>>>>>>>>> for line in lines: >>>>>>>>>>>>>>>>>>>>>>>>> training_text_file_name = pathlib.Path >>>>>>>>>>>>>>>>>>>>>>>>> (training_text_file).stem >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> # Generate a unique serial number for >>>>>>>>>>>>>>>>>>>>>>>>> each line >>>>>>>>>>>>>>>>>>>>>>>>> line_serial = f"{line_count:d}" >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> # GT (Ground Truth) text filename >>>>>>>>>>>>>>>>>>>>>>>>> line_gt_text = os.path.join( >>>>>>>>>>>>>>>>>>>>>>>>> output_directory, f'{training_text_file_name}_{ >>>>>>>>>>>>>>>>>>>>>>>>> line_serial}.gt.txt') >>>>>>>>>>>>>>>>>>>>>>>>> with open(line_gt_text, 'w') as >>>>>>>>>>>>>>>>>>>>>>>>> output_file: >>>>>>>>>>>>>>>>>>>>>>>>> output_file.writelines([line]) >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> # Image filename >>>>>>>>>>>>>>>>>>>>>>>>> file_base_name = f'ben_{line_serial}' >>>>>>>>>>>>>>>>>>>>>>>>> # Unique filename for each font >>>>>>>>>>>>>>>>>>>>>>>>> subprocess.run([ >>>>>>>>>>>>>>>>>>>>>>>>> 'text2image', >>>>>>>>>>>>>>>>>>>>>>>>> f'--font={font}', >>>>>>>>>>>>>>>>>>>>>>>>> f'--text={line_gt_text}', >>>>>>>>>>>>>>>>>>>>>>>>> f'--outputbase={output_directory}/ >>>>>>>>>>>>>>>>>>>>>>>>> {file_base_name}', >>>>>>>>>>>>>>>>>>>>>>>>> '--max_pages=1', >>>>>>>>>>>>>>>>>>>>>>>>> '--strip_unrenderable_words', >>>>>>>>>>>>>>>>>>>>>>>>> '--leading=36', >>>>>>>>>>>>>>>>>>>>>>>>> '--xsize=3600', >>>>>>>>>>>>>>>>>>>>>>>>> '--ysize=350', >>>>>>>>>>>>>>>>>>>>>>>>> '--char_spacing=1.0', >>>>>>>>>>>>>>>>>>>>>>>>> '--exposure=0', >>>>>>>>>>>>>>>>>>>>>>>>> ' >>>>>>>>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/ben.unicharset', >>>>>>>>>>>>>>>>>>>>>>>>> ]) >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> line_count += 1 >>>>>>>>>>>>>>>>>>>>>>>>> font_serial += 1 >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> # Reset font_serial for the next font >>>>>>>>>>>>>>>>>>>>>>>>> iteration >>>>>>>>>>>>>>>>>>>>>>>>> font_serial = 1 >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> write_line_count(line_count) # Update the >>>>>>>>>>>>>>>>>>>>>>>>> line_count in the file >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> if __name__ == "__main__": >>>>>>>>>>>>>>>>>>>>>>>>> parser = argparse.ArgumentParser() >>>>>>>>>>>>>>>>>>>>>>>>> parser.add_argument('--start', type=int, help= >>>>>>>>>>>>>>>>>>>>>>>>> 'Starting line count (inclusive)') >>>>>>>>>>>>>>>>>>>>>>>>> parser.add_argument('--end', type=int, >>>>>>>>>>>>>>>>>>>>>>>>> help='Ending >>>>>>>>>>>>>>>>>>>>>>>>> line count (inclusive)') >>>>>>>>>>>>>>>>>>>>>>>>> args = parser.parse_args() >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> training_text_file = ' >>>>>>>>>>>>>>>>>>>>>>>>> langdata/ben.training_text' >>>>>>>>>>>>>>>>>>>>>>>>> output_directory = ' >>>>>>>>>>>>>>>>>>>>>>>>> tesstrain/data/ben-ground-truth' >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> # Create an instance of the FontList class >>>>>>>>>>>>>>>>>>>>>>>>> font_list = FontList() >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> create_training_data(training_text_file, >>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, args.start, args.end) >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> *and for training code:* >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> import subprocess >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> # List of font names >>>>>>>>>>>>>>>>>>>>>>>>> font_names = ['ben'] >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> for font in font_names: >>>>>>>>>>>>>>>>>>>>>>>>> command = f"TESSDATA_PREFIX=../tesseract/tessdata >>>>>>>>>>>>>>>>>>>>>>>>> make training MODEL_NAME={font} START_MODEL=ben >>>>>>>>>>>>>>>>>>>>>>>>> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000 >>>>>>>>>>>>>>>>>>>>>>>>> LANG_TYPE=Indic" >>>>>>>>>>>>>>>>>>>>>>>>> subprocess.run(command, shell=True) >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> any suggestion to identify to extract the problem. >>>>>>>>>>>>>>>>>>>>>>>>> thanks, everyone >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>>>> You received this message because you are >>>>>>>>>>>>>>>>>>>>>>>>> subscribed to the Google Groups "tesseract-ocr" group. >>>>>>>>>>>>>>>>>>>>>>>>> To unsubscribe from this group and stop receiving >>>>>>>>>>>>>>>>>>>>>>>>> emails from it, send an email to >>>>>>>>>>>>>>>>>>>>>>>>> tesseract-oc...@googlegroups.com. >>>>>>>>>>>>>>>>>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>>>>>>>>>>>>>>>>> . >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to tesseract-oc...@googlegroups.com. >>>>> >>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com >>>>> >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to tesseract-oc...@googlegroups.com. >>> >> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/eb833902-7258-43e3-8854-d51ce26b7257n%40googlegroups.com >>> >>> <https://groups.google.com/d/msgid/tesseract-ocr/eb833902-7258-43e3-8854-d51ce26b7257n%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ba377910-d2e1-43e1-ad3f-0425071a195cn%40googlegroups.com.