EasyOCR I think is best for ID cards or something like that image process. but document images like books, here Tesseract is better than EasyOCR. Even I didn't use EasyOCR. you can try it.
I have added words of dictionaries but the result is the same. what kind of problem you have faced in fine-tuning in few new characters as you said (*but, I failed in every possible way to introduce a few new characters into the database.)* On Wednesday, 13 September, 2023 at 3:33:48 pm UTC+6 desal...@gmail.com wrote: > Yes, we are new to this. I find the instructions (the manual) very hard to > follow. The video you linked above was really helpful to get started. My > plan at the beginning was to fine tune the existing .traineddata. But, I > failed in every possible way to introduce a few new characters into the > database. That is why I started from scratch. > > Sure, I will follow Lorenzo's suggestion: will run more the iterations, > and see if I can improve. > > Another areas we need to explore is usage of dictionaries actually. May be > adding millions of words into the dictionary could help Tesseract. I don't > have millions of words; but I am looking into some corpus to get more words > into the dictionary. > > If this all fails, EasyOCR (and probably other similar open-source > packages) is probably our next option to try on. Sure, sharing > our experiences will be helpful. I will let you know if I made good > progresses in any of these options. > On Wednesday, September 13, 2023 at 12:19:48 PM UTC+3 mdalihu...@gmail.com > wrote: > >> How is your training going for Bengali? It was nearly good but I faced >> space problems between two words, some words are spaces but most of them >> have no space. I think is problem is in the dataset but I use the default >> training dataset from Tesseract which is used in Ben That way I am confused >> so I have to explore more. by the way, you can try as Lorenzo Blz said. >> Actually training from scratch is harder than fine-tuning. so you can use >> different datasets to explore. if you succeed. please let me know how you >> have done this whole process. I'm also new in this field. >> On Wednesday, 13 September, 2023 at 1:13:43 pm UTC+6 desal...@gmail.com >> wrote: >> >>> How is your training going for Bengali? >>> I have been trying to train from scratch. I made about 64,000 lines of >>> text (which produced about 255,000 files, in the end) and run the training >>> for 150,000 iterations; getting 0.51 training error rate. I was hopping to >>> get reasonable accuracy. Unfortunately, when I run the OCR using >>> .traineddata, the accuracy is absolutely terrible. Do you think I made >>> some mistakes, or that is an expected result? >>> >>> On Tuesday, September 12, 2023 at 11:15:25 PM UTC+3 mdalihu...@gmail.com >>> wrote: >>> >>>> Yes, he doesn't mention all fonts but only one font. That way he >>>> didn't use *MODEL_NAME in a separate **script **file script I think.* >>>> >>>> Actually, here we teach all *tif, gt.txt, and .box files *which are >>>> created by *MODEL_NAME I mean **eng, ben, oro flag or language code >>>> *because >>>> when we first create *tif, gt.txt, and .box files, *every file starts >>>> by *MODEL_NAME*. This *MODEL_NAME* we selected on the training >>>> script for looping each tif, gt.txt, and .box files which are created by >>>> *MODEL_NAME.* >>>> >>>> On Tuesday, 12 September, 2023 at 9:42:13 pm UTC+6 desal...@gmail.com >>>> wrote: >>>> >>>>> Yes, I am familiar with the video and have set up the folder structure >>>>> as you did. Indeed, I have tried a number of fine-tuning with a single >>>>> font >>>>> following Gracia's video. But, your script is much better because >>>>> supports >>>>> multiple fonts. The whole improvement you made is brilliant; and very >>>>> useful. It is all working for me. >>>>> The only part that I didn't understand is the trick you used in your >>>>> tesseract_train.py script. You see, I have been doing exactly to you did >>>>> except this script. >>>>> >>>>> The scripts seems to have the trick of sending/teaching each of the >>>>> fonts (iteratively) into the model. The script I have been using (which >>>>> I >>>>> get from Garcia) doesn't mention font at all. >>>>> >>>>> *TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=oro >>>>> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000* >>>>> Does it mean that my model does't train the fonts (even if the fonts >>>>> have been included in the splitting process, in the other script)? >>>>> On Monday, September 11, 2023 at 10:54:08 AM UTC+3 >>>>> mdalihu...@gmail.com wrote: >>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> *import subprocess# List of font namesfont_names = ['ben']for font in >>>>>> font_names: command = f"TESSDATA_PREFIX=../tesseract/tessdata make >>>>>> training MODEL_NAME={font} START_MODEL=ben >>>>>> TESSDATA=../tesseract/tessdata >>>>>> MAX_ITERATIONS=10000"* >>>>>> >>>>>> >>>>>> * subprocess.run(command, shell=True) 1 . This command is for >>>>>> training data that I have named '*tesseract_training*.py' inside >>>>>> tesstrain folder.* >>>>>> *2. root directory means your main training folder and inside it as >>>>>> like langdata, tessearact, tesstrain folders. if you see this tutorial >>>>>> * >>>>>> https://www.youtube.com/watch?v=KE4xEzFGSU8 you will understand >>>>>> better the folder structure. only I created tesseract_training.py in >>>>>> tesstrain folder for training and FontList.py file is the main path as >>>>>> *like >>>>>> langdata, tessearact, tesstrain, and *split_training_text.py. >>>>>> 3. first of all you have to put all fonts in your Linux fonts folder. >>>>>> /usr/share/fonts/ then run: sudo apt update then sudo fc-cache >>>>>> -fv >>>>>> >>>>>> after that, you have to add the exact font's name in FontList.py file >>>>>> like me. >>>>>> I have added two pic my folder structure. first is main structure >>>>>> pic and the second is the Colopse tesstrain folder. >>>>>> >>>>>> I[image: Screenshot 2023-09-11 134947.png][image: Screenshot >>>>>> 2023-09-11 135014.png] >>>>>> On Monday, 11 September, 2023 at 12:50:03 pm UTC+6 desal...@gmail.com >>>>>> wrote: >>>>>> >>>>>>> Thank you so much for putting out these brilliant scripts. They make >>>>>>> the process much more efficient. >>>>>>> >>>>>>> I have one more question on the other script that you use to train. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> *import subprocess# List of font namesfont_names = ['ben']for font >>>>>>> in font_names: command = f"TESSDATA_PREFIX=../tesseract/tessdata >>>>>>> make >>>>>>> training MODEL_NAME={font} START_MODEL=ben >>>>>>> TESSDATA=../tesseract/tessdata >>>>>>> MAX_ITERATIONS=10000"* >>>>>>> * subprocess.run(command, shell=True) * >>>>>>> >>>>>>> Do you have the name of fonts listed in file in the same/root >>>>>>> directory? >>>>>>> How do you setup the names of the fonts in the file, if you don't >>>>>>> mind sharing it? >>>>>>> On Monday, September 11, 2023 at 4:27:27 AM UTC+3 >>>>>>> mdalihu...@gmail.com wrote: >>>>>>> >>>>>>>> You can use the new script below. it's better than the previous two >>>>>>>> scripts. You can create *tif, gt.txt, and .box files *by multiple >>>>>>>> fonts and also use breakpoint if vs code close or anything during >>>>>>>> creating *tif, >>>>>>>> gt.txt, and .box files *then you can checkpoint to navigate where >>>>>>>> you close vs code. >>>>>>>> >>>>>>>> command for *tif, gt.txt, and .box files * >>>>>>>> >>>>>>>> >>>>>>>> import os >>>>>>>> import random >>>>>>>> import pathlib >>>>>>>> import subprocess >>>>>>>> import argparse >>>>>>>> from FontList import FontList >>>>>>>> >>>>>>>> def create_training_data(training_text_file, font_list, >>>>>>>> output_directory, start_line=None, end_line=None): >>>>>>>> lines = [] >>>>>>>> with open(training_text_file, 'r') as input_file: >>>>>>>> lines = input_file.readlines() >>>>>>>> >>>>>>>> if not os.path.exists(output_directory): >>>>>>>> os.mkdir(output_directory) >>>>>>>> >>>>>>>> if start_line is None: >>>>>>>> start_line = 0 >>>>>>>> >>>>>>>> if end_line is None: >>>>>>>> end_line = len(lines) - 1 >>>>>>>> >>>>>>>> for font_name in font_list.fonts: >>>>>>>> for line_index in range(start_line, end_line + 1): >>>>>>>> line = lines[line_index].strip() >>>>>>>> >>>>>>>> training_text_file_name = pathlib.Path( >>>>>>>> training_text_file).stem >>>>>>>> >>>>>>>> line_serial = f"{line_index:d}" >>>>>>>> >>>>>>>> line_gt_text = os.path.join(output_directory, f'{ >>>>>>>> training_text_file_name}_{line_serial}_{font_name.replace(" ", "_") >>>>>>>> }.gt.txt') >>>>>>>> >>>>>>>> >>>>>>>> with open(line_gt_text, 'w') as output_file: >>>>>>>> output_file.writelines([line]) >>>>>>>> >>>>>>>> file_base_name = f'{training_text_file_name}_{ >>>>>>>> line_serial}_{font_name.replace(" ", "_")}' >>>>>>>> subprocess.run([ >>>>>>>> 'text2image', >>>>>>>> f'--font={font_name}', >>>>>>>> f'--text={line_gt_text}', >>>>>>>> f'--outputbase={output_directory}/{file_base_name}' >>>>>>>> , >>>>>>>> '--max_pages=1', >>>>>>>> '--strip_unrenderable_words', >>>>>>>> '--leading=36', >>>>>>>> '--xsize=3600', >>>>>>>> '--ysize=330', >>>>>>>> '--char_spacing=1.0', >>>>>>>> '--exposure=0', >>>>>>>> '--unicharset_file=langdata/eng.unicharset', >>>>>>>> ]) >>>>>>>> >>>>>>>> if __name__ == "__main__": >>>>>>>> parser = argparse.ArgumentParser() >>>>>>>> parser.add_argument('--start', type=int, help='Starting line >>>>>>>> count (inclusive)') >>>>>>>> parser.add_argument('--end', type=int, help='Ending line count >>>>>>>> (inclusive)') >>>>>>>> args = parser.parse_args() >>>>>>>> >>>>>>>> training_text_file = 'langdata/eng.training_text' >>>>>>>> output_directory = 'tesstrain/data/eng-ground-truth' >>>>>>>> >>>>>>>> font_list = FontList() >>>>>>>> >>>>>>>> create_training_data(training_text_file, font_list, >>>>>>>> output_directory, args.start, args.end) >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Then create a file called "FontList" in the root directory and >>>>>>>> paste it. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> class FontList: >>>>>>>> def __init__(self): >>>>>>>> self.fonts = [ >>>>>>>> "Gerlick" >>>>>>>> "Sagar Medium", >>>>>>>> "Ekushey Lohit Normal", >>>>>>>> "Charukola Round Head Regular, weight=433", >>>>>>>> "Charukola Round Head Bold, weight=443", >>>>>>>> "Ador Orjoma Unicode", >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> ] >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> then import in the above code, >>>>>>>> >>>>>>>> >>>>>>>> *for breakpoint command:* >>>>>>>> >>>>>>>> >>>>>>>> sudo python3 split_training_text.py --start 0 --end 11 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> change checkpoint according to you --start 0 --end 11. >>>>>>>> >>>>>>>> *and training checkpoint as you know already.* >>>>>>>> >>>>>>>> >>>>>>>> On Monday, 11 September, 2023 at 1:22:34 am UTC+6 >>>>>>>> desal...@gmail.com wrote: >>>>>>>> >>>>>>>>> Hi mhalidu, >>>>>>>>> the script you posted here seems much more extensive than you >>>>>>>>> posted before: >>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/0e2880d9-64c0-4659-b497-902a5747caf4n%40googlegroups.com >>>>>>>>> . >>>>>>>>> >>>>>>>>> I have been using your earlier script. It is magical. How is this >>>>>>>>> one different from the earlier one? >>>>>>>>> >>>>>>>>> Thank you for posting these scripts, by the way. It has saved my >>>>>>>>> countless hours; by running multiple fonts in one sweep. I was not >>>>>>>>> able to >>>>>>>>> find any instruction on how to train for multiple fonts. The >>>>>>>>> official >>>>>>>>> manual is also unclear. YOUr script helped me to get started. >>>>>>>>> On Wednesday, August 9, 2023 at 11:00:49 PM UTC+3 >>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>> >>>>>>>>>> ok, I will try as you said. >>>>>>>>>> one more thing, what's the role of the trained_text lines will >>>>>>>>>> be? I have seen Bengali text are long words of lines. so I wanna >>>>>>>>>> know how >>>>>>>>>> many words or characters will be the better choice for the train? >>>>>>>>>> and '--xsize=3600','--ysize=350', will be according to words of >>>>>>>>>> lines? >>>>>>>>>> >>>>>>>>>> On Thursday, 10 August, 2023 at 1:10:14 am UTC+6 shree wrote: >>>>>>>>>> >>>>>>>>>>> Include the default fonts also in your fine-tuning list of fonts >>>>>>>>>>> and see if that helps. >>>>>>>>>>> >>>>>>>>>>> On Wed, Aug 9, 2023, 2:27 PM Ali hussain <mdalihu...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> I have trained some new fonts by fine-tune methods for the >>>>>>>>>>>> Bengali language in Tesseract 5 and I have used all official >>>>>>>>>>>> trained_text >>>>>>>>>>>> and tessdata_best and other things also. everything is good but >>>>>>>>>>>> the >>>>>>>>>>>> problem is the default font which was trained before that does not >>>>>>>>>>>> convert >>>>>>>>>>>> text like prev but my new fonts work well. I don't understand why >>>>>>>>>>>> it's >>>>>>>>>>>> happening. I share code based to understand what going on. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> *codes for creating tif, gt.txt, .box files:* >>>>>>>>>>>> import os >>>>>>>>>>>> import random >>>>>>>>>>>> import pathlib >>>>>>>>>>>> import subprocess >>>>>>>>>>>> import argparse >>>>>>>>>>>> from FontList import FontList >>>>>>>>>>>> >>>>>>>>>>>> def read_line_count(): >>>>>>>>>>>> if os.path.exists('line_count.txt'): >>>>>>>>>>>> with open('line_count.txt', 'r') as file: >>>>>>>>>>>> return int(file.read()) >>>>>>>>>>>> return 0 >>>>>>>>>>>> >>>>>>>>>>>> def write_line_count(line_count): >>>>>>>>>>>> with open('line_count.txt', 'w') as file: >>>>>>>>>>>> file.write(str(line_count)) >>>>>>>>>>>> >>>>>>>>>>>> def create_training_data(training_text_file, font_list, >>>>>>>>>>>> output_directory, start_line=None, end_line=None): >>>>>>>>>>>> lines = [] >>>>>>>>>>>> with open(training_text_file, 'r') as input_file: >>>>>>>>>>>> for line in input_file.readlines(): >>>>>>>>>>>> lines.append(line.strip()) >>>>>>>>>>>> >>>>>>>>>>>> if not os.path.exists(output_directory): >>>>>>>>>>>> os.mkdir(output_directory) >>>>>>>>>>>> >>>>>>>>>>>> random.shuffle(lines) >>>>>>>>>>>> >>>>>>>>>>>> if start_line is None: >>>>>>>>>>>> line_count = read_line_count() # Set the starting >>>>>>>>>>>> line_count from the file >>>>>>>>>>>> else: >>>>>>>>>>>> line_count = start_line >>>>>>>>>>>> >>>>>>>>>>>> if end_line is None: >>>>>>>>>>>> end_line_count = len(lines) - 1 # Set the ending >>>>>>>>>>>> line_count >>>>>>>>>>>> else: >>>>>>>>>>>> end_line_count = min(end_line, len(lines) - 1) >>>>>>>>>>>> >>>>>>>>>>>> for font in font_list.fonts: # Iterate through all the >>>>>>>>>>>> fonts in the font_list >>>>>>>>>>>> font_serial = 1 >>>>>>>>>>>> for line in lines: >>>>>>>>>>>> training_text_file_name = pathlib.Path( >>>>>>>>>>>> training_text_file).stem >>>>>>>>>>>> >>>>>>>>>>>> # Generate a unique serial number for each line >>>>>>>>>>>> line_serial = f"{line_count:d}" >>>>>>>>>>>> >>>>>>>>>>>> # GT (Ground Truth) text filename >>>>>>>>>>>> line_gt_text = os.path.join(output_directory, f'{ >>>>>>>>>>>> training_text_file_name}_{line_serial}.gt.txt') >>>>>>>>>>>> with open(line_gt_text, 'w') as output_file: >>>>>>>>>>>> output_file.writelines([line]) >>>>>>>>>>>> >>>>>>>>>>>> # Image filename >>>>>>>>>>>> file_base_name = f'ben_{line_serial}' # Unique >>>>>>>>>>>> filename for each font >>>>>>>>>>>> subprocess.run([ >>>>>>>>>>>> 'text2image', >>>>>>>>>>>> f'--font={font}', >>>>>>>>>>>> f'--text={line_gt_text}', >>>>>>>>>>>> f'--outputbase={output_directory}/{ >>>>>>>>>>>> file_base_name}', >>>>>>>>>>>> '--max_pages=1', >>>>>>>>>>>> '--strip_unrenderable_words', >>>>>>>>>>>> '--leading=36', >>>>>>>>>>>> '--xsize=3600', >>>>>>>>>>>> '--ysize=350', >>>>>>>>>>>> '--char_spacing=1.0', >>>>>>>>>>>> '--exposure=0', >>>>>>>>>>>> '--unicharset_file=langdata/ben.unicharset', >>>>>>>>>>>> ]) >>>>>>>>>>>> >>>>>>>>>>>> line_count += 1 >>>>>>>>>>>> font_serial += 1 >>>>>>>>>>>> >>>>>>>>>>>> # Reset font_serial for the next font iteration >>>>>>>>>>>> font_serial = 1 >>>>>>>>>>>> >>>>>>>>>>>> write_line_count(line_count) # Update the line_count in >>>>>>>>>>>> the file >>>>>>>>>>>> >>>>>>>>>>>> if __name__ == "__main__": >>>>>>>>>>>> parser = argparse.ArgumentParser() >>>>>>>>>>>> parser.add_argument('--start', type=int, help='Starting >>>>>>>>>>>> line count (inclusive)') >>>>>>>>>>>> parser.add_argument('--end', type=int, help='Ending line >>>>>>>>>>>> count (inclusive)') >>>>>>>>>>>> args = parser.parse_args() >>>>>>>>>>>> >>>>>>>>>>>> training_text_file = 'langdata/ben.training_text' >>>>>>>>>>>> output_directory = 'tesstrain/data/ben-ground-truth' >>>>>>>>>>>> >>>>>>>>>>>> # Create an instance of the FontList class >>>>>>>>>>>> font_list = FontList() >>>>>>>>>>>> >>>>>>>>>>>> create_training_data(training_text_file, font_list, >>>>>>>>>>>> output_directory, args.start, args.end) >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> *and for training code:* >>>>>>>>>>>> >>>>>>>>>>>> import subprocess >>>>>>>>>>>> >>>>>>>>>>>> # List of font names >>>>>>>>>>>> font_names = ['ben'] >>>>>>>>>>>> >>>>>>>>>>>> for font in font_names: >>>>>>>>>>>> command = f"TESSDATA_PREFIX=../tesseract/tessdata make >>>>>>>>>>>> training MODEL_NAME={font} START_MODEL=ben >>>>>>>>>>>> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000 >>>>>>>>>>>> LANG_TYPE=Indic" >>>>>>>>>>>> subprocess.run(command, shell=True) >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> any suggestion to identify to extract the problem. >>>>>>>>>>>> thanks, everyone >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from >>>>>>>>>>>> it, send an email to tesseract-oc...@googlegroups.com. >>>>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com >>>>>>>>>>>> >>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>>>> . >>>>>>>>>>>> >>>>>>>>>>> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/44437232-5807-4ec6-adb7-d25452880815n%40googlegroups.com.