Thank you so much for putting out these brilliant scripts. They make the process much more efficient.
I have one more question on the other script that you use to train. *import subprocess# List of font namesfont_names = ['ben']for font in font_names: command = f"TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME={font} START_MODEL=ben TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000"* * subprocess.run(command, shell=True) * Do you have the name of fonts listed in file in the same/root directory? How do you setup the names of the fonts in the file, if you don't mind sharing it? On Monday, September 11, 2023 at 4:27:27 AM UTC+3 mdalihu...@gmail.com wrote: > You can use the new script below. it's better than the previous two > scripts. You can create *tif, gt.txt, and .box files *by multiple fonts > and also use breakpoint if vs code close or anything during creating *tif, > gt.txt, and .box files *then you can checkpoint to navigate where you > close vs code. > > command for *tif, gt.txt, and .box files * > > > import os > import random > import pathlib > import subprocess > import argparse > from FontList import FontList > > def create_training_data(training_text_file, font_list, output_directory, > start_line=None, end_line=None): > lines = [] > with open(training_text_file, 'r') as input_file: > lines = input_file.readlines() > > if not os.path.exists(output_directory): > os.mkdir(output_directory) > > if start_line is None: > start_line = 0 > > if end_line is None: > end_line = len(lines) - 1 > > for font_name in font_list.fonts: > for line_index in range(start_line, end_line + 1): > line = lines[line_index].strip() > > training_text_file_name = pathlib.Path(training_text_file > ).stem > > line_serial = f"{line_index:d}" > > line_gt_text = os.path.join(output_directory, f'{ > training_text_file_name}_{line_serial}_{font_name.replace(" ", "_")} > .gt.txt') > > > with open(line_gt_text, 'w') as output_file: > output_file.writelines([line]) > > file_base_name = f'{training_text_file_name}_{line_serial}_{ > font_name.replace(" ", "_")}' > subprocess.run([ > 'text2image', > f'--font={font_name}', > f'--text={line_gt_text}', > f'--outputbase={output_directory}/{file_base_name}', > '--max_pages=1', > '--strip_unrenderable_words', > '--leading=36', > '--xsize=3600', > '--ysize=330', > '--char_spacing=1.0', > '--exposure=0', > '--unicharset_file=langdata/eng.unicharset', > ]) > > if __name__ == "__main__": > parser = argparse.ArgumentParser() > parser.add_argument('--start', type=int, help='Starting line count > (inclusive)') > parser.add_argument('--end', type=int, help='Ending line count > (inclusive)') > args = parser.parse_args() > > training_text_file = 'langdata/eng.training_text' > output_directory = 'tesstrain/data/eng-ground-truth' > > font_list = FontList() > > create_training_data(training_text_file, font_list, output_directory, > args.start, args.end) > > > > Then create a file called "FontList" in the root directory and paste it. > > > > class FontList: > def __init__(self): > self.fonts = [ > "Gerlick" > "Sagar Medium", > "Ekushey Lohit Normal", > "Charukola Round Head Regular, weight=433", > "Charukola Round Head Bold, weight=443", > "Ador Orjoma Unicode", > > > > ] > > > > then import in the above code, > > > *for breakpoint command:* > > > sudo python3 split_training_text.py --start 0 --end 11 > > > > change checkpoint according to you --start 0 --end 11. > > *and training checkpoint as you know already.* > > > On Monday, 11 September, 2023 at 1:22:34 am UTC+6 desal...@gmail.com > wrote: > >> Hi mhalidu, >> the script you posted here seems much more extensive than you posted >> before: >> https://groups.google.com/d/msgid/tesseract-ocr/0e2880d9-64c0-4659-b497-902a5747caf4n%40googlegroups.com >> . >> >> I have been using your earlier script. It is magical. How is this one >> different from the earlier one? >> >> Thank you for posting these scripts, by the way. It has saved my >> countless hours; by running multiple fonts in one sweep. I was not able to >> find any instruction on how to train for multiple fonts. The official >> manual is also unclear. YOUr script helped me to get started. >> On Wednesday, August 9, 2023 at 11:00:49 PM UTC+3 mdalihu...@gmail.com >> wrote: >> >>> ok, I will try as you said. >>> one more thing, what's the role of the trained_text lines will be? I >>> have seen Bengali text are long words of lines. so I wanna know how many >>> words or characters will be the better choice for the train? >>> and '--xsize=3600','--ysize=350', will be according to words of lines? >>> >>> On Thursday, 10 August, 2023 at 1:10:14 am UTC+6 shree wrote: >>> >>>> Include the default fonts also in your fine-tuning list of fonts and >>>> see if that helps. >>>> >>>> On Wed, Aug 9, 2023, 2:27 PM Ali hussain <mdalihu...@gmail.com> wrote: >>>> >>>>> I have trained some new fonts by fine-tune methods for the Bengali >>>>> language in Tesseract 5 and I have used all official trained_text and >>>>> tessdata_best and other things also. everything is good but the problem >>>>> is >>>>> the default font which was trained before that does not convert text like >>>>> prev but my new fonts work well. I don't understand why it's happening. I >>>>> share code based to understand what going on. >>>>> >>>>> >>>>> *codes for creating tif, gt.txt, .box files:* >>>>> import os >>>>> import random >>>>> import pathlib >>>>> import subprocess >>>>> import argparse >>>>> from FontList import FontList >>>>> >>>>> def read_line_count(): >>>>> if os.path.exists('line_count.txt'): >>>>> with open('line_count.txt', 'r') as file: >>>>> return int(file.read()) >>>>> return 0 >>>>> >>>>> def write_line_count(line_count): >>>>> with open('line_count.txt', 'w') as file: >>>>> file.write(str(line_count)) >>>>> >>>>> def create_training_data(training_text_file, font_list, >>>>> output_directory, start_line=None, end_line=None): >>>>> lines = [] >>>>> with open(training_text_file, 'r') as input_file: >>>>> for line in input_file.readlines(): >>>>> lines.append(line.strip()) >>>>> >>>>> if not os.path.exists(output_directory): >>>>> os.mkdir(output_directory) >>>>> >>>>> random.shuffle(lines) >>>>> >>>>> if start_line is None: >>>>> line_count = read_line_count() # Set the starting line_count >>>>> from the file >>>>> else: >>>>> line_count = start_line >>>>> >>>>> if end_line is None: >>>>> end_line_count = len(lines) - 1 # Set the ending line_count >>>>> else: >>>>> end_line_count = min(end_line, len(lines) - 1) >>>>> >>>>> for font in font_list.fonts: # Iterate through all the fonts in >>>>> the font_list >>>>> font_serial = 1 >>>>> for line in lines: >>>>> training_text_file_name = pathlib.Path(training_text_file >>>>> ).stem >>>>> >>>>> # Generate a unique serial number for each line >>>>> line_serial = f"{line_count:d}" >>>>> >>>>> # GT (Ground Truth) text filename >>>>> line_gt_text = os.path.join(output_directory, f'{ >>>>> training_text_file_name}_{line_serial}.gt.txt') >>>>> with open(line_gt_text, 'w') as output_file: >>>>> output_file.writelines([line]) >>>>> >>>>> # Image filename >>>>> file_base_name = f'ben_{line_serial}' # Unique filename >>>>> for each font >>>>> subprocess.run([ >>>>> 'text2image', >>>>> f'--font={font}', >>>>> f'--text={line_gt_text}', >>>>> f'--outputbase={output_directory}/{file_base_name}', >>>>> '--max_pages=1', >>>>> '--strip_unrenderable_words', >>>>> '--leading=36', >>>>> '--xsize=3600', >>>>> '--ysize=350', >>>>> '--char_spacing=1.0', >>>>> '--exposure=0', >>>>> '--unicharset_file=langdata/ben.unicharset', >>>>> ]) >>>>> >>>>> line_count += 1 >>>>> font_serial += 1 >>>>> >>>>> # Reset font_serial for the next font iteration >>>>> font_serial = 1 >>>>> >>>>> write_line_count(line_count) # Update the line_count in the file >>>>> >>>>> if __name__ == "__main__": >>>>> parser = argparse.ArgumentParser() >>>>> parser.add_argument('--start', type=int, help='Starting line >>>>> count (inclusive)') >>>>> parser.add_argument('--end', type=int, help='Ending line count >>>>> (inclusive)') >>>>> args = parser.parse_args() >>>>> >>>>> training_text_file = 'langdata/ben.training_text' >>>>> output_directory = 'tesstrain/data/ben-ground-truth' >>>>> >>>>> # Create an instance of the FontList class >>>>> font_list = FontList() >>>>> >>>>> create_training_data(training_text_file, font_list, >>>>> output_directory, args.start, args.end) >>>>> >>>>> >>>>> *and for training code:* >>>>> >>>>> import subprocess >>>>> >>>>> # List of font names >>>>> font_names = ['ben'] >>>>> >>>>> for font in font_names: >>>>> command = f"TESSDATA_PREFIX=../tesseract/tessdata make training >>>>> MODEL_NAME={font} START_MODEL=ben TESSDATA=../tesseract/tessdata >>>>> MAX_ITERATIONS=10000 LANG_TYPE=Indic" >>>>> subprocess.run(command, shell=True) >>>>> >>>>> >>>>> any suggestion to identify to extract the problem. >>>>> thanks, everyone >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to tesseract-oc...@googlegroups.com. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com >>>>> >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5c9f85a3-ffbc-4adb-8cad-3d8ab77ec940n%40googlegroups.com.