You can use the new script below. it's better than the previous two scripts. You can create *tif, gt.txt, and .box files *by multiple fonts and also use breakpoint if vs code close or anything during creating *tif, gt.txt, and .box files *then you can checkpoint to navigate where you close vs code.
command for *tif, gt.txt, and .box files * import os import random import pathlib import subprocess import argparse from FontList import FontList def create_training_data(training_text_file, font_list, output_directory, start_line=None, end_line=None): lines = [] with open(training_text_file, 'r') as input_file: lines = input_file.readlines() if not os.path.exists(output_directory): os.mkdir(output_directory) if start_line is None: start_line = 0 if end_line is None: end_line = len(lines) - 1 for font_name in font_list.fonts: for line_index in range(start_line, end_line + 1): line = lines[line_index].strip() training_text_file_name = pathlib.Path(training_text_file).stem line_serial = f"{line_index:d}" line_gt_text = os.path.join(output_directory, f'{ training_text_file_name}_{line_serial}_{font_name.replace(" ", "_")}.gt.txt' ) with open(line_gt_text, 'w') as output_file: output_file.writelines([line]) file_base_name = f'{training_text_file_name}_{line_serial}_{ font_name.replace(" ", "_")}' subprocess.run([ 'text2image', f'--font={font_name}', f'--text={line_gt_text}', f'--outputbase={output_directory}/{file_base_name}', '--max_pages=1', '--strip_unrenderable_words', '--leading=36', '--xsize=3600', '--ysize=330', '--char_spacing=1.0', '--exposure=0', '--unicharset_file=langdata/eng.unicharset', ]) if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument('--start', type=int, help='Starting line count (inclusive)') parser.add_argument('--end', type=int, help='Ending line count (inclusive)') args = parser.parse_args() training_text_file = 'langdata/eng.training_text' output_directory = 'tesstrain/data/eng-ground-truth' font_list = FontList() create_training_data(training_text_file, font_list, output_directory, args.start, args.end) Then create a file called "FontList" in the root directory and paste it. class FontList: def __init__(self): self.fonts = [ "Gerlick" "Sagar Medium", "Ekushey Lohit Normal", "Charukola Round Head Regular, weight=433", "Charukola Round Head Bold, weight=443", "Ador Orjoma Unicode", ] then import in the above code, *for breakpoint command:* sudo python3 split_training_text.py --start 0 --end 11 change checkpoint according to you --start 0 --end 11. *and training checkpoint as you know already.* On Monday, 11 September, 2023 at 1:22:34 am UTC+6 desal...@gmail.com wrote: > Hi mhalidu, > the script you posted here seems much more extensive than you posted > before: > https://groups.google.com/d/msgid/tesseract-ocr/0e2880d9-64c0-4659-b497-902a5747caf4n%40googlegroups.com > . > > I have been using your earlier script. It is magical. How is this one > different from the earlier one? > > Thank you for posting these scripts, by the way. It has saved my countless > hours; by running multiple fonts in one sweep. I was not able to find any > instruction on how to train for multiple fonts. The official manual is > also unclear. YOUr script helped me to get started. > On Wednesday, August 9, 2023 at 11:00:49 PM UTC+3 mdalihu...@gmail.com > wrote: > >> ok, I will try as you said. >> one more thing, what's the role of the trained_text lines will be? I have >> seen Bengali text are long words of lines. so I wanna know how many words >> or characters will be the better choice for the train? >> and '--xsize=3600','--ysize=350', will be according to words of lines? >> >> On Thursday, 10 August, 2023 at 1:10:14 am UTC+6 shree wrote: >> >>> Include the default fonts also in your fine-tuning list of fonts and see >>> if that helps. >>> >>> On Wed, Aug 9, 2023, 2:27 PM Ali hussain <mdalihu...@gmail.com> wrote: >>> >>>> I have trained some new fonts by fine-tune methods for the Bengali >>>> language in Tesseract 5 and I have used all official trained_text and >>>> tessdata_best and other things also. everything is good but the problem >>>> is >>>> the default font which was trained before that does not convert text like >>>> prev but my new fonts work well. I don't understand why it's happening. I >>>> share code based to understand what going on. >>>> >>>> >>>> *codes for creating tif, gt.txt, .box files:* >>>> import os >>>> import random >>>> import pathlib >>>> import subprocess >>>> import argparse >>>> from FontList import FontList >>>> >>>> def read_line_count(): >>>> if os.path.exists('line_count.txt'): >>>> with open('line_count.txt', 'r') as file: >>>> return int(file.read()) >>>> return 0 >>>> >>>> def write_line_count(line_count): >>>> with open('line_count.txt', 'w') as file: >>>> file.write(str(line_count)) >>>> >>>> def create_training_data(training_text_file, font_list, >>>> output_directory, start_line=None, end_line=None): >>>> lines = [] >>>> with open(training_text_file, 'r') as input_file: >>>> for line in input_file.readlines(): >>>> lines.append(line.strip()) >>>> >>>> if not os.path.exists(output_directory): >>>> os.mkdir(output_directory) >>>> >>>> random.shuffle(lines) >>>> >>>> if start_line is None: >>>> line_count = read_line_count() # Set the starting line_count >>>> from the file >>>> else: >>>> line_count = start_line >>>> >>>> if end_line is None: >>>> end_line_count = len(lines) - 1 # Set the ending line_count >>>> else: >>>> end_line_count = min(end_line, len(lines) - 1) >>>> >>>> for font in font_list.fonts: # Iterate through all the fonts in >>>> the font_list >>>> font_serial = 1 >>>> for line in lines: >>>> training_text_file_name = pathlib.Path(training_text_file >>>> ).stem >>>> >>>> # Generate a unique serial number for each line >>>> line_serial = f"{line_count:d}" >>>> >>>> # GT (Ground Truth) text filename >>>> line_gt_text = os.path.join(output_directory, f'{ >>>> training_text_file_name}_{line_serial}.gt.txt') >>>> with open(line_gt_text, 'w') as output_file: >>>> output_file.writelines([line]) >>>> >>>> # Image filename >>>> file_base_name = f'ben_{line_serial}' # Unique filename >>>> for each font >>>> subprocess.run([ >>>> 'text2image', >>>> f'--font={font}', >>>> f'--text={line_gt_text}', >>>> f'--outputbase={output_directory}/{file_base_name}', >>>> '--max_pages=1', >>>> '--strip_unrenderable_words', >>>> '--leading=36', >>>> '--xsize=3600', >>>> '--ysize=350', >>>> '--char_spacing=1.0', >>>> '--exposure=0', >>>> '--unicharset_file=langdata/ben.unicharset', >>>> ]) >>>> >>>> line_count += 1 >>>> font_serial += 1 >>>> >>>> # Reset font_serial for the next font iteration >>>> font_serial = 1 >>>> >>>> write_line_count(line_count) # Update the line_count in the file >>>> >>>> if __name__ == "__main__": >>>> parser = argparse.ArgumentParser() >>>> parser.add_argument('--start', type=int, help='Starting line count >>>> (inclusive)') >>>> parser.add_argument('--end', type=int, help='Ending line count >>>> (inclusive)') >>>> args = parser.parse_args() >>>> >>>> training_text_file = 'langdata/ben.training_text' >>>> output_directory = 'tesstrain/data/ben-ground-truth' >>>> >>>> # Create an instance of the FontList class >>>> font_list = FontList() >>>> >>>> create_training_data(training_text_file, font_list, >>>> output_directory, args.start, args.end) >>>> >>>> >>>> *and for training code:* >>>> >>>> import subprocess >>>> >>>> # List of font names >>>> font_names = ['ben'] >>>> >>>> for font in font_names: >>>> command = f"TESSDATA_PREFIX=../tesseract/tessdata make training >>>> MODEL_NAME={font} START_MODEL=ben TESSDATA=../tesseract/tessdata >>>> MAX_ITERATIONS=10000 LANG_TYPE=Indic" >>>> subprocess.run(command, shell=True) >>>> >>>> >>>> any suggestion to identify to extract the problem. >>>> thanks, everyone >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to tesseract-oc...@googlegroups.com. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/c504c4e3-4cd3-4514-b61d-819c77ba933en%40googlegroups.com.