How is your training going for Bengali? I have been trying to train from scratch. I made about 64,000 lines of text (which produced about 255,000 files, in the end) and run the training for 150,000 iterations; getting 0.51 training error rate. I was hopping to get reasonable accuracy. Unfortunately, when I run the OCR using .traineddata, the accuracy is absolutely terrible. Do you think I made some mistakes, or that is an expected result?
On Tuesday, September 12, 2023 at 11:15:25 PM UTC+3 mdalihu...@gmail.com wrote: > Yes, he doesn't mention all fonts but only one font. That way he didn't > use *MODEL_NAME in a separate **script **file script I think.* > > Actually, here we teach all *tif, gt.txt, and .box files *which are > created by *MODEL_NAME I mean **eng, ben, oro flag or language code *because > when we first create *tif, gt.txt, and .box files, *every file starts by > *MODEL_NAME*. This *MODEL_NAME* we selected on the training script for > looping each tif, gt.txt, and .box files which are created by > *MODEL_NAME.* > > On Tuesday, 12 September, 2023 at 9:42:13 pm UTC+6 desal...@gmail.com > wrote: > >> Yes, I am familiar with the video and have set up the folder structure as >> you did. Indeed, I have tried a number of fine-tuning with a single font >> following Gracia's video. But, your script is much better because supports >> multiple fonts. The whole improvement you made is brilliant; and very >> useful. It is all working for me. >> The only part that I didn't understand is the trick you used in your >> tesseract_train.py script. You see, I have been doing exactly to you did >> except this script. >> >> The scripts seems to have the trick of sending/teaching each of the fonts >> (iteratively) into the model. The script I have been using (which I get >> from Garcia) doesn't mention font at all. >> >> *TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=oro >> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000* >> Does it mean that my model does't train the fonts (even if the fonts have >> been included in the splitting process, in the other script)? >> On Monday, September 11, 2023 at 10:54:08 AM UTC+3 mdalihu...@gmail.com >> wrote: >> >>> >>> >>> >>> >>> >>> >>> *import subprocess# List of font namesfont_names = ['ben']for font in >>> font_names: command = f"TESSDATA_PREFIX=../tesseract/tessdata make >>> training MODEL_NAME={font} START_MODEL=ben TESSDATA=../tesseract/tessdata >>> MAX_ITERATIONS=10000"* >>> >>> >>> * subprocess.run(command, shell=True) 1 . This command is for >>> training data that I have named '*tesseract_training*.py' inside >>> tesstrain folder.* >>> *2. root directory means your main training folder and inside it as like >>> langdata, tessearact, tesstrain folders. if you see this tutorial * >>> https://www.youtube.com/watch?v=KE4xEzFGSU8 you will understand >>> better the folder structure. only I created tesseract_training.py in >>> tesstrain folder for training and FontList.py file is the main path as >>> *like >>> langdata, tessearact, tesstrain, and *split_training_text.py. >>> 3. first of all you have to put all fonts in your Linux fonts folder. >>> /usr/share/fonts/ then run: sudo apt update then sudo fc-cache -fv >>> >>> after that, you have to add the exact font's name in FontList.py file >>> like me. >>> I have added two pic my folder structure. first is main structure pic >>> and the second is the Colopse tesstrain folder. >>> >>> I[image: Screenshot 2023-09-11 134947.png][image: Screenshot 2023-09-11 >>> 135014.png] >>> On Monday, 11 September, 2023 at 12:50:03 pm UTC+6 desal...@gmail.com >>> wrote: >>> >>>> Thank you so much for putting out these brilliant scripts. They make >>>> the process much more efficient. >>>> >>>> I have one more question on the other script that you use to train. >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> *import subprocess# List of font namesfont_names = ['ben']for font in >>>> font_names: command = f"TESSDATA_PREFIX=../tesseract/tessdata make >>>> training MODEL_NAME={font} START_MODEL=ben TESSDATA=../tesseract/tessdata >>>> MAX_ITERATIONS=10000"* >>>> * subprocess.run(command, shell=True) * >>>> >>>> Do you have the name of fonts listed in file in the same/root directory? >>>> How do you setup the names of the fonts in the file, if you don't mind >>>> sharing it? >>>> On Monday, September 11, 2023 at 4:27:27 AM UTC+3 mdalihu...@gmail.com >>>> wrote: >>>> >>>>> You can use the new script below. it's better than the previous two >>>>> scripts. You can create *tif, gt.txt, and .box files *by multiple >>>>> fonts and also use breakpoint if vs code close or anything during >>>>> creating *tif, >>>>> gt.txt, and .box files *then you can checkpoint to navigate where you >>>>> close vs code. >>>>> >>>>> command for *tif, gt.txt, and .box files * >>>>> >>>>> >>>>> import os >>>>> import random >>>>> import pathlib >>>>> import subprocess >>>>> import argparse >>>>> from FontList import FontList >>>>> >>>>> def create_training_data(training_text_file, font_list, >>>>> output_directory, start_line=None, end_line=None): >>>>> lines = [] >>>>> with open(training_text_file, 'r') as input_file: >>>>> lines = input_file.readlines() >>>>> >>>>> if not os.path.exists(output_directory): >>>>> os.mkdir(output_directory) >>>>> >>>>> if start_line is None: >>>>> start_line = 0 >>>>> >>>>> if end_line is None: >>>>> end_line = len(lines) - 1 >>>>> >>>>> for font_name in font_list.fonts: >>>>> for line_index in range(start_line, end_line + 1): >>>>> line = lines[line_index].strip() >>>>> >>>>> training_text_file_name = pathlib.Path(training_text_file >>>>> ).stem >>>>> >>>>> line_serial = f"{line_index:d}" >>>>> >>>>> line_gt_text = os.path.join(output_directory, f'{ >>>>> training_text_file_name}_{line_serial}_{font_name.replace(" ", "_")} >>>>> .gt.txt') >>>>> >>>>> >>>>> with open(line_gt_text, 'w') as output_file: >>>>> output_file.writelines([line]) >>>>> >>>>> file_base_name = f'{training_text_file_name}_{line_serial} >>>>> _{font_name.replace(" ", "_")}' >>>>> subprocess.run([ >>>>> 'text2image', >>>>> f'--font={font_name}', >>>>> f'--text={line_gt_text}', >>>>> f'--outputbase={output_directory}/{file_base_name}', >>>>> '--max_pages=1', >>>>> '--strip_unrenderable_words', >>>>> '--leading=36', >>>>> '--xsize=3600', >>>>> '--ysize=330', >>>>> '--char_spacing=1.0', >>>>> '--exposure=0', >>>>> '--unicharset_file=langdata/eng.unicharset', >>>>> ]) >>>>> >>>>> if __name__ == "__main__": >>>>> parser = argparse.ArgumentParser() >>>>> parser.add_argument('--start', type=int, help='Starting line >>>>> count (inclusive)') >>>>> parser.add_argument('--end', type=int, help='Ending line count >>>>> (inclusive)') >>>>> args = parser.parse_args() >>>>> >>>>> training_text_file = 'langdata/eng.training_text' >>>>> output_directory = 'tesstrain/data/eng-ground-truth' >>>>> >>>>> font_list = FontList() >>>>> >>>>> create_training_data(training_text_file, font_list, >>>>> output_directory, args.start, args.end) >>>>> >>>>> >>>>> >>>>> Then create a file called "FontList" in the root directory and paste >>>>> it. >>>>> >>>>> >>>>> >>>>> class FontList: >>>>> def __init__(self): >>>>> self.fonts = [ >>>>> "Gerlick" >>>>> "Sagar Medium", >>>>> "Ekushey Lohit Normal", >>>>> "Charukola Round Head Regular, weight=433", >>>>> "Charukola Round Head Bold, weight=443", >>>>> "Ador Orjoma Unicode", >>>>> >>>>> >>>>> >>>>> ] >>>>> >>>>> >>>>> >>>>> then import in the above code, >>>>> >>>>> >>>>> *for breakpoint command:* >>>>> >>>>> >>>>> sudo python3 split_training_text.py --start 0 --end 11 >>>>> >>>>> >>>>> >>>>> change checkpoint according to you --start 0 --end 11. >>>>> >>>>> *and training checkpoint as you know already.* >>>>> >>>>> >>>>> On Monday, 11 September, 2023 at 1:22:34 am UTC+6 desal...@gmail.com >>>>> wrote: >>>>> >>>>>> Hi mhalidu, >>>>>> the script you posted here seems much more extensive than you posted >>>>>> before: >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/0e2880d9-64c0-4659-b497-902a5747caf4n%40googlegroups.com >>>>>> . >>>>>> >>>>>> I have been using your earlier script. It is magical. How is this one >>>>>> different from the earlier one? >>>>>> >>>>>> Thank you for posting these scripts, by the way. It has saved my >>>>>> countless hours; by running multiple fonts in one sweep. I was not able >>>>>> to >>>>>> find any instruction on how to train for multiple fonts. The official >>>>>> manual is also unclear. YOUr script helped me to get started. >>>>>> On Wednesday, August 9, 2023 at 11:00:49 PM UTC+3 >>>>>> mdalihu...@gmail.com wrote: >>>>>> >>>>>>> ok, I will try as you said. >>>>>>> one more thing, what's the role of the trained_text lines will be? I >>>>>>> have seen Bengali text are long words of lines. so I wanna know how >>>>>>> many >>>>>>> words or characters will be the better choice for the train? >>>>>>> and '--xsize=3600','--ysize=350', will be according to words of lines? >>>>>>> >>>>>>> On Thursday, 10 August, 2023 at 1:10:14 am UTC+6 shree wrote: >>>>>>> >>>>>>>> Include the default fonts also in your fine-tuning list of fonts >>>>>>>> and see if that helps. >>>>>>>> >>>>>>>> On Wed, Aug 9, 2023, 2:27 PM Ali hussain <mdalihu...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> I have trained some new fonts by fine-tune methods for the Bengali >>>>>>>>> language in Tesseract 5 and I have used all official trained_text and >>>>>>>>> tessdata_best and other things also. everything is good but the >>>>>>>>> problem is >>>>>>>>> the default font which was trained before that does not convert text >>>>>>>>> like >>>>>>>>> prev but my new fonts work well. I don't understand why it's >>>>>>>>> happening. I >>>>>>>>> share code based to understand what going on. >>>>>>>>> >>>>>>>>> >>>>>>>>> *codes for creating tif, gt.txt, .box files:* >>>>>>>>> import os >>>>>>>>> import random >>>>>>>>> import pathlib >>>>>>>>> import subprocess >>>>>>>>> import argparse >>>>>>>>> from FontList import FontList >>>>>>>>> >>>>>>>>> def read_line_count(): >>>>>>>>> if os.path.exists('line_count.txt'): >>>>>>>>> with open('line_count.txt', 'r') as file: >>>>>>>>> return int(file.read()) >>>>>>>>> return 0 >>>>>>>>> >>>>>>>>> def write_line_count(line_count): >>>>>>>>> with open('line_count.txt', 'w') as file: >>>>>>>>> file.write(str(line_count)) >>>>>>>>> >>>>>>>>> def create_training_data(training_text_file, font_list, >>>>>>>>> output_directory, start_line=None, end_line=None): >>>>>>>>> lines = [] >>>>>>>>> with open(training_text_file, 'r') as input_file: >>>>>>>>> for line in input_file.readlines(): >>>>>>>>> lines.append(line.strip()) >>>>>>>>> >>>>>>>>> if not os.path.exists(output_directory): >>>>>>>>> os.mkdir(output_directory) >>>>>>>>> >>>>>>>>> random.shuffle(lines) >>>>>>>>> >>>>>>>>> if start_line is None: >>>>>>>>> line_count = read_line_count() # Set the starting >>>>>>>>> line_count from the file >>>>>>>>> else: >>>>>>>>> line_count = start_line >>>>>>>>> >>>>>>>>> if end_line is None: >>>>>>>>> end_line_count = len(lines) - 1 # Set the ending >>>>>>>>> line_count >>>>>>>>> else: >>>>>>>>> end_line_count = min(end_line, len(lines) - 1) >>>>>>>>> >>>>>>>>> for font in font_list.fonts: # Iterate through all the fonts >>>>>>>>> in the font_list >>>>>>>>> font_serial = 1 >>>>>>>>> for line in lines: >>>>>>>>> training_text_file_name = pathlib.Path( >>>>>>>>> training_text_file).stem >>>>>>>>> >>>>>>>>> # Generate a unique serial number for each line >>>>>>>>> line_serial = f"{line_count:d}" >>>>>>>>> >>>>>>>>> # GT (Ground Truth) text filename >>>>>>>>> line_gt_text = os.path.join(output_directory, f'{ >>>>>>>>> training_text_file_name}_{line_serial}.gt.txt') >>>>>>>>> with open(line_gt_text, 'w') as output_file: >>>>>>>>> output_file.writelines([line]) >>>>>>>>> >>>>>>>>> # Image filename >>>>>>>>> file_base_name = f'ben_{line_serial}' # Unique >>>>>>>>> filename for each font >>>>>>>>> subprocess.run([ >>>>>>>>> 'text2image', >>>>>>>>> f'--font={font}', >>>>>>>>> f'--text={line_gt_text}', >>>>>>>>> f'--outputbase={output_directory}/{file_base_name} >>>>>>>>> ', >>>>>>>>> '--max_pages=1', >>>>>>>>> '--strip_unrenderable_words', >>>>>>>>> '--leading=36', >>>>>>>>> '--xsize=3600', >>>>>>>>> '--ysize=350', >>>>>>>>> '--char_spacing=1.0', >>>>>>>>> '--exposure=0', >>>>>>>>> '--unicharset_file=langdata/ben.unicharset', >>>>>>>>> ]) >>>>>>>>> >>>>>>>>> line_count += 1 >>>>>>>>> font_serial += 1 >>>>>>>>> >>>>>>>>> # Reset font_serial for the next font iteration >>>>>>>>> font_serial = 1 >>>>>>>>> >>>>>>>>> write_line_count(line_count) # Update the line_count in the >>>>>>>>> file >>>>>>>>> >>>>>>>>> if __name__ == "__main__": >>>>>>>>> parser = argparse.ArgumentParser() >>>>>>>>> parser.add_argument('--start', type=int, help='Starting line >>>>>>>>> count (inclusive)') >>>>>>>>> parser.add_argument('--end', type=int, help='Ending line >>>>>>>>> count (inclusive)') >>>>>>>>> args = parser.parse_args() >>>>>>>>> >>>>>>>>> training_text_file = 'langdata/ben.training_text' >>>>>>>>> output_directory = 'tesstrain/data/ben-ground-truth' >>>>>>>>> >>>>>>>>> # Create an instance of the FontList class >>>>>>>>> font_list = FontList() >>>>>>>>> >>>>>>>>> create_training_data(training_text_file, font_list, >>>>>>>>> output_directory, args.start, args.end) >>>>>>>>> >>>>>>>>> >>>>>>>>> *and for training code:* >>>>>>>>> >>>>>>>>> import subprocess >>>>>>>>> >>>>>>>>> # List of font names >>>>>>>>> font_names = ['ben'] >>>>>>>>> >>>>>>>>> for font in font_names: >>>>>>>>> command = f"TESSDATA_PREFIX=../tesseract/tessdata make >>>>>>>>> training MODEL_NAME={font} START_MODEL=ben >>>>>>>>> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000 LANG_TYPE=Indic" >>>>>>>>> subprocess.run(command, shell=True) >>>>>>>>> >>>>>>>>> >>>>>>>>> any suggestion to identify to extract the problem. >>>>>>>>> thanks, everyone >>>>>>>>> >>>>>>>>> -- >>>>>>>>> You received this message because you are subscribed to the Google >>>>>>>>> Groups "tesseract-ocr" group. >>>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>>>>> To view this discussion on the web visit >>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com >>>>>>>>> >>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>> . >>>>>>>>> >>>>>>>> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2563df8d-e261-497c-8fa6-821f013023ban%40googlegroups.com.