hey I'm new in ocr and I don't know Python language actually I'm working on javascript but I fixed the problem.
I share my code of what I'm done: *1, Replace the bellow code into your main 'split_training_text.py' file: * import os import random import pathlib import subprocess from tesstrain.tesseract_training import run_tesseract_training training_text_file = 'langdata/eng.training_text' fonts = ['lato', 'roboto'] lines = [] with open(training_text_file, 'r') as input_file: for line in input_file.readlines(): lines.append(line.strip()) output_directory = 'tesstrain/data' if not os.path.exists(output_directory): os.mkdir(output_directory) random.shuffle(lines) count = 5 lines = lines[:count] line_count = 0 for font in fonts: font_output_directory = os.path.join( output_directory, f'{font}-ground-truth') if not os.path.exists(font_output_directory): os.mkdir(font_output_directory) for line in lines: training_text_file_name = pathlib.Path(training_text_file).stem line_training_text = os.path.join( font_output_directory, f'{training_text_file_name}_{line_count} .gt.txt') with open(line_training_text, 'w') as output_file: output_file.writelines([line]) file_base_name = f'eng_{line_count}' subprocess.run([ 'text2image', f'--font={font}', f'--text={line_training_text}', f'--outputbase={os.path.join(font_output_directory, file_base_name)}', '--max_pages=1', '--strip_unrenderable_words', '--leading=32', '--xsize=3600', '--ysize=480', '--char_spacing=1.0', '--exposure=0', '--unicharset_file=langdata/eng.unicharset' ]) line_count += 1 run_tesseract_training(font) and run by command: *python3 split_training_text.py * I just train two fonts and I have seen it work as one font. but I have not tested whether it is actually working or not. you can add multiple fonts and try it. *2, create a file called '*tesseract_training.py' *in 'tesstrain' folder and paste the bellow code: * import subprocess # List of font names font_names = ['lato', 'roboto'] for font in font_names: command = f"TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME={font} START_MODEL=eng TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000" subprocess.run(command, shell=True) then run by command: *python3 split_training_text.py * On Wednesday, 18 January, 2023 at 6:52:24 pm UTC+6 Muhammad Hamza wrote: > I want to finetune the ell.traineddata with multiple fonts at once can > anyone tell me the flow of this scenario. > > subprocess.run([ > 'text2image', > '--font=OCRA Medium', > f'--text={line_training_text}', > f'--outputbase={output_directory}/{file_base_name}', > '--max_pages=1', > '--strip_unrenderable_words', > '--leading=32', > '--xsize=3600', > '--ysize=480', > '--char_spacing=1.0', > '--exposure=0', > '--unicharset_file=langdata/bos.unicharset' > ]) > > above in -font only one is mention can anyone tell me how i can train with > multiple fonts at once > thanks > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/6c618011-4bbe-40bd-9303-18f0bcbce59fn%40googlegroups.com.