The characters are getting missed, even after fine-tuning. I never made any progress. I tried many different ways. Some specific characters are always missing from the OCR result.
On Wednesday, September 13, 2023 at 12:49:20 PM UTC+3 mdalihu...@gmail.com wrote: > EasyOCR I think is best for ID cards or something like that image process. > but document images like books, here Tesseract is better than EasyOCR. > Even I didn't use EasyOCR. you can try it. > > I have added words of dictionaries but the result is the same. > > what kind of problem you have faced in fine-tuning in few new characters > as you said (*but, I failed in every possible way to introduce a few new > characters into the database.)* > On Wednesday, 13 September, 2023 at 3:33:48 pm UTC+6 desal...@gmail.com > wrote: > >> Yes, we are new to this. I find the instructions (the manual) very hard >> to follow. The video you linked above was really helpful to get started. >> My plan at the beginning was to fine tune the existing .traineddata. But, I >> failed in every possible way to introduce a few new characters into the >> database. That is why I started from scratch. >> >> Sure, I will follow Lorenzo's suggestion: will run more the iterations, >> and see if I can improve. >> >> Another areas we need to explore is usage of dictionaries actually. May >> be adding millions of words into the dictionary could help Tesseract. I >> don't have millions of words; but I am looking into some corpus to get more >> words into the dictionary. >> >> If this all fails, EasyOCR (and probably other similar open-source >> packages) is probably our next option to try on. Sure, sharing >> our experiences will be helpful. I will let you know if I made good >> progresses in any of these options. >> On Wednesday, September 13, 2023 at 12:19:48 PM UTC+3 >> mdalihu...@gmail.com wrote: >> >>> How is your training going for Bengali? It was nearly good but I faced >>> space problems between two words, some words are spaces but most of them >>> have no space. I think is problem is in the dataset but I use the default >>> training dataset from Tesseract which is used in Ben That way I am confused >>> so I have to explore more. by the way, you can try as Lorenzo Blz said. >>> Actually training from scratch is harder than fine-tuning. so you can use >>> different datasets to explore. if you succeed. please let me know how you >>> have done this whole process. I'm also new in this field. >>> On Wednesday, 13 September, 2023 at 1:13:43 pm UTC+6 desal...@gmail.com >>> wrote: >>> >>>> How is your training going for Bengali? >>>> I have been trying to train from scratch. I made about 64,000 lines of >>>> text (which produced about 255,000 files, in the end) and run the training >>>> for 150,000 iterations; getting 0.51 training error rate. I was hopping to >>>> get reasonable accuracy. Unfortunately, when I run the OCR using >>>> .traineddata, the accuracy is absolutely terrible. Do you think I made >>>> some mistakes, or that is an expected result? >>>> >>>> On Tuesday, September 12, 2023 at 11:15:25 PM UTC+3 >>>> mdalihu...@gmail.com wrote: >>>> >>>>> Yes, he doesn't mention all fonts but only one font. That way he >>>>> didn't use *MODEL_NAME in a separate **script **file script I think.* >>>>> >>>>> Actually, here we teach all *tif, gt.txt, and .box files *which are >>>>> created by *MODEL_NAME I mean **eng, ben, oro flag or language code >>>>> *because >>>>> when we first create *tif, gt.txt, and .box files, *every file starts >>>>> by *MODEL_NAME*. This *MODEL_NAME* we selected on the training >>>>> script for looping each tif, gt.txt, and .box files which are created by >>>>> *MODEL_NAME.* >>>>> >>>>> On Tuesday, 12 September, 2023 at 9:42:13 pm UTC+6 desal...@gmail.com >>>>> wrote: >>>>> >>>>>> Yes, I am familiar with the video and have set up the folder >>>>>> structure as you did. Indeed, I have tried a number of fine-tuning with >>>>>> a >>>>>> single font following Gracia's video. But, your script is much better >>>>>> because supports multiple fonts. The whole improvement you made is >>>>>> brilliant; and very useful. It is all working for me. >>>>>> The only part that I didn't understand is the trick you used in your >>>>>> tesseract_train.py script. You see, I have been doing exactly to you did >>>>>> except this script. >>>>>> >>>>>> The scripts seems to have the trick of sending/teaching each of the >>>>>> fonts (iteratively) into the model. The script I have been using (which >>>>>> I >>>>>> get from Garcia) doesn't mention font at all. >>>>>> >>>>>> *TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=oro >>>>>> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000* >>>>>> Does it mean that my model does't train the fonts (even if the fonts >>>>>> have been included in the splitting process, in the other script)? >>>>>> On Monday, September 11, 2023 at 10:54:08 AM UTC+3 >>>>>> mdalihu...@gmail.com wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> *import subprocess# List of font namesfont_names = ['ben']for font >>>>>>> in font_names: command = f"TESSDATA_PREFIX=../tesseract/tessdata >>>>>>> make >>>>>>> training MODEL_NAME={font} START_MODEL=ben >>>>>>> TESSDATA=../tesseract/tessdata >>>>>>> MAX_ITERATIONS=10000"* >>>>>>> >>>>>>> >>>>>>> * subprocess.run(command, shell=True) 1 . This command is for >>>>>>> training data that I have named '*tesseract_training*.py' inside >>>>>>> tesstrain folder.* >>>>>>> *2. root directory means your main training folder and inside it as >>>>>>> like langdata, tessearact, tesstrain folders. if you see this tutorial >>>>>>> * >>>>>>> https://www.youtube.com/watch?v=KE4xEzFGSU8 you will understand >>>>>>> better the folder structure. only I created tesseract_training.py in >>>>>>> tesstrain folder for training and FontList.py file is the main path as >>>>>>> *like >>>>>>> langdata, tessearact, tesstrain, and *split_training_text.py. >>>>>>> 3. first of all you have to put all fonts in your Linux fonts >>>>>>> folder. /usr/share/fonts/ then run: sudo apt update then sudo >>>>>>> fc-cache -fv >>>>>>> >>>>>>> after that, you have to add the exact font's name in FontList.py >>>>>>> file like me. >>>>>>> I have added two pic my folder structure. first is main structure >>>>>>> pic and the second is the Colopse tesstrain folder. >>>>>>> >>>>>>> I[image: Screenshot 2023-09-11 134947.png][image: Screenshot >>>>>>> 2023-09-11 135014.png] >>>>>>> On Monday, 11 September, 2023 at 12:50:03 pm UTC+6 >>>>>>> desal...@gmail.com wrote: >>>>>>> >>>>>>>> Thank you so much for putting out these brilliant scripts. They >>>>>>>> make the process much more efficient. >>>>>>>> >>>>>>>> I have one more question on the other script that you use to train. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> *import subprocess# List of font namesfont_names = ['ben']for font >>>>>>>> in font_names: command = f"TESSDATA_PREFIX=../tesseract/tessdata >>>>>>>> make >>>>>>>> training MODEL_NAME={font} START_MODEL=ben >>>>>>>> TESSDATA=../tesseract/tessdata >>>>>>>> MAX_ITERATIONS=10000"* >>>>>>>> * subprocess.run(command, shell=True) * >>>>>>>> >>>>>>>> Do you have the name of fonts listed in file in the same/root >>>>>>>> directory? >>>>>>>> How do you setup the names of the fonts in the file, if you don't >>>>>>>> mind sharing it? >>>>>>>> On Monday, September 11, 2023 at 4:27:27 AM UTC+3 >>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>> >>>>>>>>> You can use the new script below. it's better than the previous >>>>>>>>> two scripts. You can create *tif, gt.txt, and .box files *by >>>>>>>>> multiple fonts and also use breakpoint if vs code close or anything >>>>>>>>> during >>>>>>>>> creating *tif, gt.txt, and .box files *then you can checkpoint to >>>>>>>>> navigate where you close vs code. >>>>>>>>> >>>>>>>>> command for *tif, gt.txt, and .box files * >>>>>>>>> >>>>>>>>> >>>>>>>>> import os >>>>>>>>> import random >>>>>>>>> import pathlib >>>>>>>>> import subprocess >>>>>>>>> import argparse >>>>>>>>> from FontList import FontList >>>>>>>>> >>>>>>>>> def create_training_data(training_text_file, font_list, >>>>>>>>> output_directory, start_line=None, end_line=None): >>>>>>>>> lines = [] >>>>>>>>> with open(training_text_file, 'r') as input_file: >>>>>>>>> lines = input_file.readlines() >>>>>>>>> >>>>>>>>> if not os.path.exists(output_directory): >>>>>>>>> os.mkdir(output_directory) >>>>>>>>> >>>>>>>>> if start_line is None: >>>>>>>>> start_line = 0 >>>>>>>>> >>>>>>>>> if end_line is None: >>>>>>>>> end_line = len(lines) - 1 >>>>>>>>> >>>>>>>>> for font_name in font_list.fonts: >>>>>>>>> for line_index in range(start_line, end_line + 1): >>>>>>>>> line = lines[line_index].strip() >>>>>>>>> >>>>>>>>> training_text_file_name = pathlib.Path( >>>>>>>>> training_text_file).stem >>>>>>>>> >>>>>>>>> line_serial = f"{line_index:d}" >>>>>>>>> >>>>>>>>> line_gt_text = os.path.join(output_directory, f'{ >>>>>>>>> training_text_file_name}_{line_serial}_{font_name.replace(" ", "_" >>>>>>>>> )}.gt.txt') >>>>>>>>> >>>>>>>>> >>>>>>>>> with open(line_gt_text, 'w') as output_file: >>>>>>>>> output_file.writelines([line]) >>>>>>>>> >>>>>>>>> file_base_name = f'{training_text_file_name}_{ >>>>>>>>> line_serial}_{font_name.replace(" ", "_")}' >>>>>>>>> subprocess.run([ >>>>>>>>> 'text2image', >>>>>>>>> f'--font={font_name}', >>>>>>>>> f'--text={line_gt_text}', >>>>>>>>> f'--outputbase={output_directory}/{file_base_name} >>>>>>>>> ', >>>>>>>>> '--max_pages=1', >>>>>>>>> '--strip_unrenderable_words', >>>>>>>>> '--leading=36', >>>>>>>>> '--xsize=3600', >>>>>>>>> '--ysize=330', >>>>>>>>> '--char_spacing=1.0', >>>>>>>>> '--exposure=0', >>>>>>>>> '--unicharset_file=langdata/eng.unicharset', >>>>>>>>> ]) >>>>>>>>> >>>>>>>>> if __name__ == "__main__": >>>>>>>>> parser = argparse.ArgumentParser() >>>>>>>>> parser.add_argument('--start', type=int, help='Starting line >>>>>>>>> count (inclusive)') >>>>>>>>> parser.add_argument('--end', type=int, help='Ending line >>>>>>>>> count (inclusive)') >>>>>>>>> args = parser.parse_args() >>>>>>>>> >>>>>>>>> training_text_file = 'langdata/eng.training_text' >>>>>>>>> output_directory = 'tesstrain/data/eng-ground-truth' >>>>>>>>> >>>>>>>>> font_list = FontList() >>>>>>>>> >>>>>>>>> create_training_data(training_text_file, font_list, >>>>>>>>> output_directory, args.start, args.end) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Then create a file called "FontList" in the root directory and >>>>>>>>> paste it. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> class FontList: >>>>>>>>> def __init__(self): >>>>>>>>> self.fonts = [ >>>>>>>>> "Gerlick" >>>>>>>>> "Sagar Medium", >>>>>>>>> "Ekushey Lohit Normal", >>>>>>>>> "Charukola Round Head Regular, weight=433", >>>>>>>>> "Charukola Round Head Bold, weight=443", >>>>>>>>> "Ador Orjoma Unicode", >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> ] >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> then import in the above code, >>>>>>>>> >>>>>>>>> >>>>>>>>> *for breakpoint command:* >>>>>>>>> >>>>>>>>> >>>>>>>>> sudo python3 split_training_text.py --start 0 --end 11 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> change checkpoint according to you --start 0 --end 11. >>>>>>>>> >>>>>>>>> *and training checkpoint as you know already.* >>>>>>>>> >>>>>>>>> >>>>>>>>> On Monday, 11 September, 2023 at 1:22:34 am UTC+6 >>>>>>>>> desal...@gmail.com wrote: >>>>>>>>> >>>>>>>>>> Hi mhalidu, >>>>>>>>>> the script you posted here seems much more extensive than you >>>>>>>>>> posted before: >>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/0e2880d9-64c0-4659-b497-902a5747caf4n%40googlegroups.com >>>>>>>>>> . >>>>>>>>>> >>>>>>>>>> I have been using your earlier script. It is magical. How is this >>>>>>>>>> one different from the earlier one? >>>>>>>>>> >>>>>>>>>> Thank you for posting these scripts, by the way. It has saved my >>>>>>>>>> countless hours; by running multiple fonts in one sweep. I was not >>>>>>>>>> able to >>>>>>>>>> find any instruction on how to train for multiple fonts. The >>>>>>>>>> official >>>>>>>>>> manual is also unclear. YOUr script helped me to get started. >>>>>>>>>> On Wednesday, August 9, 2023 at 11:00:49 PM UTC+3 >>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>> >>>>>>>>>>> ok, I will try as you said. >>>>>>>>>>> one more thing, what's the role of the trained_text lines will >>>>>>>>>>> be? I have seen Bengali text are long words of lines. so I wanna >>>>>>>>>>> know how >>>>>>>>>>> many words or characters will be the better choice for the train? >>>>>>>>>>> and '--xsize=3600','--ysize=350', will be according to words of >>>>>>>>>>> lines? >>>>>>>>>>> >>>>>>>>>>> On Thursday, 10 August, 2023 at 1:10:14 am UTC+6 shree wrote: >>>>>>>>>>> >>>>>>>>>>>> Include the default fonts also in your fine-tuning list of >>>>>>>>>>>> fonts and see if that helps. >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Aug 9, 2023, 2:27 PM Ali hussain <mdalihu...@gmail.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> I have trained some new fonts by fine-tune methods for the >>>>>>>>>>>>> Bengali language in Tesseract 5 and I have used all official >>>>>>>>>>>>> trained_text >>>>>>>>>>>>> and tessdata_best and other things also. everything is good but >>>>>>>>>>>>> the >>>>>>>>>>>>> problem is the default font which was trained before that does >>>>>>>>>>>>> not convert >>>>>>>>>>>>> text like prev but my new fonts work well. I don't understand why >>>>>>>>>>>>> it's >>>>>>>>>>>>> happening. I share code based to understand what going on. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> *codes for creating tif, gt.txt, .box files:* >>>>>>>>>>>>> import os >>>>>>>>>>>>> import random >>>>>>>>>>>>> import pathlib >>>>>>>>>>>>> import subprocess >>>>>>>>>>>>> import argparse >>>>>>>>>>>>> from FontList import FontList >>>>>>>>>>>>> >>>>>>>>>>>>> def read_line_count(): >>>>>>>>>>>>> if os.path.exists('line_count.txt'): >>>>>>>>>>>>> with open('line_count.txt', 'r') as file: >>>>>>>>>>>>> return int(file.read()) >>>>>>>>>>>>> return 0 >>>>>>>>>>>>> >>>>>>>>>>>>> def write_line_count(line_count): >>>>>>>>>>>>> with open('line_count.txt', 'w') as file: >>>>>>>>>>>>> file.write(str(line_count)) >>>>>>>>>>>>> >>>>>>>>>>>>> def create_training_data(training_text_file, font_list, >>>>>>>>>>>>> output_directory, start_line=None, end_line=None): >>>>>>>>>>>>> lines = [] >>>>>>>>>>>>> with open(training_text_file, 'r') as input_file: >>>>>>>>>>>>> for line in input_file.readlines(): >>>>>>>>>>>>> lines.append(line.strip()) >>>>>>>>>>>>> >>>>>>>>>>>>> if not os.path.exists(output_directory): >>>>>>>>>>>>> os.mkdir(output_directory) >>>>>>>>>>>>> >>>>>>>>>>>>> random.shuffle(lines) >>>>>>>>>>>>> >>>>>>>>>>>>> if start_line is None: >>>>>>>>>>>>> line_count = read_line_count() # Set the starting >>>>>>>>>>>>> line_count from the file >>>>>>>>>>>>> else: >>>>>>>>>>>>> line_count = start_line >>>>>>>>>>>>> >>>>>>>>>>>>> if end_line is None: >>>>>>>>>>>>> end_line_count = len(lines) - 1 # Set the ending >>>>>>>>>>>>> line_count >>>>>>>>>>>>> else: >>>>>>>>>>>>> end_line_count = min(end_line, len(lines) - 1) >>>>>>>>>>>>> >>>>>>>>>>>>> for font in font_list.fonts: # Iterate through all the >>>>>>>>>>>>> fonts in the font_list >>>>>>>>>>>>> font_serial = 1 >>>>>>>>>>>>> for line in lines: >>>>>>>>>>>>> training_text_file_name = pathlib.Path( >>>>>>>>>>>>> training_text_file).stem >>>>>>>>>>>>> >>>>>>>>>>>>> # Generate a unique serial number for each line >>>>>>>>>>>>> line_serial = f"{line_count:d}" >>>>>>>>>>>>> >>>>>>>>>>>>> # GT (Ground Truth) text filename >>>>>>>>>>>>> line_gt_text = os.path.join(output_directory, f'{ >>>>>>>>>>>>> training_text_file_name}_{line_serial}.gt.txt') >>>>>>>>>>>>> with open(line_gt_text, 'w') as output_file: >>>>>>>>>>>>> output_file.writelines([line]) >>>>>>>>>>>>> >>>>>>>>>>>>> # Image filename >>>>>>>>>>>>> file_base_name = f'ben_{line_serial}' # Unique >>>>>>>>>>>>> filename for each font >>>>>>>>>>>>> subprocess.run([ >>>>>>>>>>>>> 'text2image', >>>>>>>>>>>>> f'--font={font}', >>>>>>>>>>>>> f'--text={line_gt_text}', >>>>>>>>>>>>> f'--outputbase={output_directory}/{ >>>>>>>>>>>>> file_base_name}', >>>>>>>>>>>>> '--max_pages=1', >>>>>>>>>>>>> '--strip_unrenderable_words', >>>>>>>>>>>>> '--leading=36', >>>>>>>>>>>>> '--xsize=3600', >>>>>>>>>>>>> '--ysize=350', >>>>>>>>>>>>> '--char_spacing=1.0', >>>>>>>>>>>>> '--exposure=0', >>>>>>>>>>>>> '--unicharset_file=langdata/ben.unicharset', >>>>>>>>>>>>> ]) >>>>>>>>>>>>> >>>>>>>>>>>>> line_count += 1 >>>>>>>>>>>>> font_serial += 1 >>>>>>>>>>>>> >>>>>>>>>>>>> # Reset font_serial for the next font iteration >>>>>>>>>>>>> font_serial = 1 >>>>>>>>>>>>> >>>>>>>>>>>>> write_line_count(line_count) # Update the line_count in >>>>>>>>>>>>> the file >>>>>>>>>>>>> >>>>>>>>>>>>> if __name__ == "__main__": >>>>>>>>>>>>> parser = argparse.ArgumentParser() >>>>>>>>>>>>> parser.add_argument('--start', type=int, help='Starting >>>>>>>>>>>>> line count (inclusive)') >>>>>>>>>>>>> parser.add_argument('--end', type=int, help='Ending line >>>>>>>>>>>>> count (inclusive)') >>>>>>>>>>>>> args = parser.parse_args() >>>>>>>>>>>>> >>>>>>>>>>>>> training_text_file = 'langdata/ben.training_text' >>>>>>>>>>>>> output_directory = 'tesstrain/data/ben-ground-truth' >>>>>>>>>>>>> >>>>>>>>>>>>> # Create an instance of the FontList class >>>>>>>>>>>>> font_list = FontList() >>>>>>>>>>>>> >>>>>>>>>>>>> create_training_data(training_text_file, font_list, >>>>>>>>>>>>> output_directory, args.start, args.end) >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> *and for training code:* >>>>>>>>>>>>> >>>>>>>>>>>>> import subprocess >>>>>>>>>>>>> >>>>>>>>>>>>> # List of font names >>>>>>>>>>>>> font_names = ['ben'] >>>>>>>>>>>>> >>>>>>>>>>>>> for font in font_names: >>>>>>>>>>>>> command = f"TESSDATA_PREFIX=../tesseract/tessdata make >>>>>>>>>>>>> training MODEL_NAME={font} START_MODEL=ben >>>>>>>>>>>>> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000 >>>>>>>>>>>>> LANG_TYPE=Indic" >>>>>>>>>>>>> subprocess.run(command, shell=True) >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> any suggestion to identify to extract the problem. >>>>>>>>>>>>> thanks, everyone >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from >>>>>>>>>>>>> it, send an email to tesseract-oc...@googlegroups.com. >>>>>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com >>>>>>>>>>>>> >>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>>>>> . >>>>>>>>>>>>> >>>>>>>>>>>> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/042222e0-335f-41b4-9539-efdc34d79eb8n%40googlegroups.com.