Yes, he doesn't mention all fonts but only one font. That way he didn't use *MODEL_NAME*.
Actually, here we teach all *tif, gt.txt, and .box files *which are created by *MODEL_NAME I mean **eng, ben, oro flag or language code *because when we first create *tif, gt.txt, and .box files, *every file starts by *MODEL_NAME*. This *MODEL_NAME* we selected on the training script for looping each tif, gt.txt, and .box files which are created by *MODEL_NAME.* On Tuesday, 12 September, 2023 at 9:42:13 pm UTC+6 desal...@gmail.com wrote: > Yes, I am familiar with the video and have set up the folder structure as > you did. Indeed, I have tried a number of fine-tuning with a single font > following Gracia's video. But, your script is much better because supports > multiple fonts. The whole improvement you made is brilliant; and very > useful. It is all working for me. > The only part that I didn't understand is the trick you used in your > tesseract_train.py script. You see, I have been doing exactly to you did > except this script. > > The scripts seems to have the trick of sending/teaching each of the fonts > (iteratively) into the model. The script I have been using (which I get > from Garcia) doesn't mention font at all. > > *TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=oro > TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000* > Does it mean that my model does't train the fonts (even if the fonts have > been included in the splitting process, in the other script)? > On Monday, September 11, 2023 at 10:54:08 AM UTC+3 mdalihu...@gmail.com > wrote: > >> >> >> >> >> >> >> *import subprocess# List of font namesfont_names = ['ben']for font in >> font_names: command = f"TESSDATA_PREFIX=../tesseract/tessdata make >> training MODEL_NAME={font} START_MODEL=ben TESSDATA=../tesseract/tessdata >> MAX_ITERATIONS=10000"* >> >> >> * subprocess.run(command, shell=True) 1 . This command is for training >> data that I have named '*tesseract_training*.py' inside tesstrain >> folder.* >> *2. root directory means your main training folder and inside it as like >> langdata, tessearact, tesstrain folders. if you see this tutorial * >> https://www.youtube.com/watch?v=KE4xEzFGSU8 you will understand better >> the folder structure. only I created tesseract_training.py in tesstrain >> folder for training and FontList.py file is the main path as *like >> langdata, tessearact, tesstrain, and *split_training_text.py. >> 3. first of all you have to put all fonts in your Linux fonts folder. >> /usr/share/fonts/ then run: sudo apt update then sudo fc-cache -fv >> >> after that, you have to add the exact font's name in FontList.py file >> like me. >> I have added two pic my folder structure. first is main structure pic >> and the second is the Colopse tesstrain folder. >> >> I[image: Screenshot 2023-09-11 134947.png][image: Screenshot 2023-09-11 >> 135014.png] >> On Monday, 11 September, 2023 at 12:50:03 pm UTC+6 desal...@gmail.com >> wrote: >> >>> Thank you so much for putting out these brilliant scripts. They make the >>> process much more efficient. >>> >>> I have one more question on the other script that you use to train. >>> >>> >>> >>> >>> >>> >>> >>> *import subprocess# List of font namesfont_names = ['ben']for font in >>> font_names: command = f"TESSDATA_PREFIX=../tesseract/tessdata make >>> training MODEL_NAME={font} START_MODEL=ben TESSDATA=../tesseract/tessdata >>> MAX_ITERATIONS=10000"* >>> * subprocess.run(command, shell=True) * >>> >>> Do you have the name of fonts listed in file in the same/root directory? >>> How do you setup the names of the fonts in the file, if you don't mind >>> sharing it? >>> On Monday, September 11, 2023 at 4:27:27 AM UTC+3 mdalihu...@gmail.com >>> wrote: >>> >>>> You can use the new script below. it's better than the previous two >>>> scripts. You can create *tif, gt.txt, and .box files *by multiple >>>> fonts and also use breakpoint if vs code close or anything during creating >>>> *tif, >>>> gt.txt, and .box files *then you can checkpoint to navigate where you >>>> close vs code. >>>> >>>> command for *tif, gt.txt, and .box files * >>>> >>>> >>>> import os >>>> import random >>>> import pathlib >>>> import subprocess >>>> import argparse >>>> from FontList import FontList >>>> >>>> def create_training_data(training_text_file, font_list, >>>> output_directory, start_line=None, end_line=None): >>>> lines = [] >>>> with open(training_text_file, 'r') as input_file: >>>> lines = input_file.readlines() >>>> >>>> if not os.path.exists(output_directory): >>>> os.mkdir(output_directory) >>>> >>>> if start_line is None: >>>> start_line = 0 >>>> >>>> if end_line is None: >>>> end_line = len(lines) - 1 >>>> >>>> for font_name in font_list.fonts: >>>> for line_index in range(start_line, end_line + 1): >>>> line = lines[line_index].strip() >>>> >>>> training_text_file_name = pathlib.Path(training_text_file >>>> ).stem >>>> >>>> line_serial = f"{line_index:d}" >>>> >>>> line_gt_text = os.path.join(output_directory, f'{ >>>> training_text_file_name}_{line_serial}_{font_name.replace(" ", "_")} >>>> .gt.txt') >>>> >>>> >>>> with open(line_gt_text, 'w') as output_file: >>>> output_file.writelines([line]) >>>> >>>> file_base_name = f'{training_text_file_name}_{line_serial}_ >>>> {font_name.replace(" ", "_")}' >>>> subprocess.run([ >>>> 'text2image', >>>> f'--font={font_name}', >>>> f'--text={line_gt_text}', >>>> f'--outputbase={output_directory}/{file_base_name}', >>>> '--max_pages=1', >>>> '--strip_unrenderable_words', >>>> '--leading=36', >>>> '--xsize=3600', >>>> '--ysize=330', >>>> '--char_spacing=1.0', >>>> '--exposure=0', >>>> '--unicharset_file=langdata/eng.unicharset', >>>> ]) >>>> >>>> if __name__ == "__main__": >>>> parser = argparse.ArgumentParser() >>>> parser.add_argument('--start', type=int, help='Starting line count >>>> (inclusive)') >>>> parser.add_argument('--end', type=int, help='Ending line count >>>> (inclusive)') >>>> args = parser.parse_args() >>>> >>>> training_text_file = 'langdata/eng.training_text' >>>> output_directory = 'tesstrain/data/eng-ground-truth' >>>> >>>> font_list = FontList() >>>> >>>> create_training_data(training_text_file, font_list, >>>> output_directory, args.start, args.end) >>>> >>>> >>>> >>>> Then create a file called "FontList" in the root directory and paste it. >>>> >>>> >>>> >>>> class FontList: >>>> def __init__(self): >>>> self.fonts = [ >>>> "Gerlick" >>>> "Sagar Medium", >>>> "Ekushey Lohit Normal", >>>> "Charukola Round Head Regular, weight=433", >>>> "Charukola Round Head Bold, weight=443", >>>> "Ador Orjoma Unicode", >>>> >>>> >>>> >>>> ] >>>> >>>> >>>> >>>> then import in the above code, >>>> >>>> >>>> *for breakpoint command:* >>>> >>>> >>>> sudo python3 split_training_text.py --start 0 --end 11 >>>> >>>> >>>> >>>> change checkpoint according to you --start 0 --end 11. >>>> >>>> *and training checkpoint as you know already.* >>>> >>>> >>>> On Monday, 11 September, 2023 at 1:22:34 am UTC+6 desal...@gmail.com >>>> wrote: >>>> >>>>> Hi mhalidu, >>>>> the script you posted here seems much more extensive than you posted >>>>> before: >>>>> https://groups.google.com/d/msgid/tesseract-ocr/0e2880d9-64c0-4659-b497-902a5747caf4n%40googlegroups.com >>>>> . >>>>> >>>>> I have been using your earlier script. It is magical. How is this one >>>>> different from the earlier one? >>>>> >>>>> Thank you for posting these scripts, by the way. It has saved my >>>>> countless hours; by running multiple fonts in one sweep. I was not able >>>>> to >>>>> find any instruction on how to train for multiple fonts. The official >>>>> manual is also unclear. YOUr script helped me to get started. >>>>> On Wednesday, August 9, 2023 at 11:00:49 PM UTC+3 mdalihu...@gmail.com >>>>> wrote: >>>>> >>>>>> ok, I will try as you said. >>>>>> one more thing, what's the role of the trained_text lines will be? I >>>>>> have seen Bengali text are long words of lines. so I wanna know how many >>>>>> words or characters will be the better choice for the train? >>>>>> and '--xsize=3600','--ysize=350', will be according to words of lines? >>>>>> >>>>>> On Thursday, 10 August, 2023 at 1:10:14 am UTC+6 shree wrote: >>>>>> >>>>>>> Include the default fonts also in your fine-tuning list of fonts and >>>>>>> see if that helps. >>>>>>> >>>>>>> On Wed, Aug 9, 2023, 2:27 PM Ali hussain <mdalihu...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> I have trained some new fonts by fine-tune methods for the Bengali >>>>>>>> language in Tesseract 5 and I have used all official trained_text and >>>>>>>> tessdata_best and other things also. everything is good but the >>>>>>>> problem is >>>>>>>> the default font which was trained before that does not convert text >>>>>>>> like >>>>>>>> prev but my new fonts work well. I don't understand why it's >>>>>>>> happening. I >>>>>>>> share code based to understand what going on. >>>>>>>> >>>>>>>> >>>>>>>> *codes for creating tif, gt.txt, .box files:* >>>>>>>> import os >>>>>>>> import random >>>>>>>> import pathlib >>>>>>>> import subprocess >>>>>>>> import argparse >>>>>>>> from FontList import FontList >>>>>>>> >>>>>>>> def read_line_count(): >>>>>>>> if os.path.exists('line_count.txt'): >>>>>>>> with open('line_count.txt', 'r') as file: >>>>>>>> return int(file.read()) >>>>>>>> return 0 >>>>>>>> >>>>>>>> def write_line_count(line_count): >>>>>>>> with open('line_count.txt', 'w') as file: >>>>>>>> file.write(str(line_count)) >>>>>>>> >>>>>>>> def create_training_data(training_text_file, font_list, >>>>>>>> output_directory, start_line=None, end_line=None): >>>>>>>> lines = [] >>>>>>>> with open(training_text_file, 'r') as input_file: >>>>>>>> for line in input_file.readlines(): >>>>>>>> lines.append(line.strip()) >>>>>>>> >>>>>>>> if not os.path.exists(output_directory): >>>>>>>> os.mkdir(output_directory) >>>>>>>> >>>>>>>> random.shuffle(lines) >>>>>>>> >>>>>>>> if start_line is None: >>>>>>>> line_count = read_line_count() # Set the starting >>>>>>>> line_count from the file >>>>>>>> else: >>>>>>>> line_count = start_line >>>>>>>> >>>>>>>> if end_line is None: >>>>>>>> end_line_count = len(lines) - 1 # Set the ending >>>>>>>> line_count >>>>>>>> else: >>>>>>>> end_line_count = min(end_line, len(lines) - 1) >>>>>>>> >>>>>>>> for font in font_list.fonts: # Iterate through all the fonts >>>>>>>> in the font_list >>>>>>>> font_serial = 1 >>>>>>>> for line in lines: >>>>>>>> training_text_file_name = pathlib.Path( >>>>>>>> training_text_file).stem >>>>>>>> >>>>>>>> # Generate a unique serial number for each line >>>>>>>> line_serial = f"{line_count:d}" >>>>>>>> >>>>>>>> # GT (Ground Truth) text filename >>>>>>>> line_gt_text = os.path.join(output_directory, f'{ >>>>>>>> training_text_file_name}_{line_serial}.gt.txt') >>>>>>>> with open(line_gt_text, 'w') as output_file: >>>>>>>> output_file.writelines([line]) >>>>>>>> >>>>>>>> # Image filename >>>>>>>> file_base_name = f'ben_{line_serial}' # Unique >>>>>>>> filename for each font >>>>>>>> subprocess.run([ >>>>>>>> 'text2image', >>>>>>>> f'--font={font}', >>>>>>>> f'--text={line_gt_text}', >>>>>>>> f'--outputbase={output_directory}/{file_base_name}' >>>>>>>> , >>>>>>>> '--max_pages=1', >>>>>>>> '--strip_unrenderable_words', >>>>>>>> '--leading=36', >>>>>>>> '--xsize=3600', >>>>>>>> '--ysize=350', >>>>>>>> '--char_spacing=1.0', >>>>>>>> '--exposure=0', >>>>>>>> '--unicharset_file=langdata/ben.unicharset', >>>>>>>> ]) >>>>>>>> >>>>>>>> line_count += 1 >>>>>>>> font_serial += 1 >>>>>>>> >>>>>>>> # Reset font_serial for the next font iteration >>>>>>>> font_serial = 1 >>>>>>>> >>>>>>>> write_line_count(line_count) # Update the line_count in the >>>>>>>> file >>>>>>>> >>>>>>>> if __name__ == "__main__": >>>>>>>> parser = argparse.ArgumentParser() >>>>>>>> parser.add_argument('--start', type=int, help='Starting line >>>>>>>> count (inclusive)') >>>>>>>> parser.add_argument('--end', type=int, help='Ending line count >>>>>>>> (inclusive)') >>>>>>>> args = parser.parse_args() >>>>>>>> >>>>>>>> training_text_file = 'langdata/ben.training_text' >>>>>>>> output_directory = 'tesstrain/data/ben-ground-truth' >>>>>>>> >>>>>>>> # Create an instance of the FontList class >>>>>>>> font_list = FontList() >>>>>>>> >>>>>>>> create_training_data(training_text_file, font_list, >>>>>>>> output_directory, args.start, args.end) >>>>>>>> >>>>>>>> >>>>>>>> *and for training code:* >>>>>>>> >>>>>>>> import subprocess >>>>>>>> >>>>>>>> # List of font names >>>>>>>> font_names = ['ben'] >>>>>>>> >>>>>>>> for font in font_names: >>>>>>>> command = f"TESSDATA_PREFIX=../tesseract/tessdata make >>>>>>>> training MODEL_NAME={font} START_MODEL=ben >>>>>>>> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000 LANG_TYPE=Indic" >>>>>>>> subprocess.run(command, shell=True) >>>>>>>> >>>>>>>> >>>>>>>> any suggestion to identify to extract the problem. >>>>>>>> thanks, everyone >>>>>>>> >>>>>>>> -- >>>>>>>> You received this message because you are subscribed to the Google >>>>>>>> Groups "tesseract-ocr" group. >>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>>>> To view this discussion on the web visit >>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com >>>>>>>> >>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>> . >>>>>>>> >>>>>>> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/04cb5241-3be3-4c86-a316-4be4a5a3e0f5n%40googlegroups.com.