Re: [tesseract-ocr] accuracy problem after trained in fine-tune

Ali hussain Wed, 13 Sep 2023 02:49:24 -0700

EasyOCR I think is best for ID cards or something like that image process. 
but document images like books, here Tesseract is better than EasyOCR.  
Even I didn't use EasyOCR. you can try it.


I have added words of dictionaries but the result is the same. 

what kind of problem you have faced in fine-tuning in few new characters as 
you said (*but, I failed in every possible way to introduce a few new 
characters into the database.)*
On Wednesday, 13 September, 2023 at 3:33:48 pm UTC+6 desal...@gmail.com 
wrote:

> Yes, we are new to this. I find the instructions (the manual) very hard to 
> follow. The video you linked above was really helpful  to get started. My 
> plan at the beginning was to fine tune the existing .traineddata. But, I 
> failed in every possible way to introduce a few new characters into the 
> database. That is why I started from scratch. 
>
> Sure, I will follow Lorenzo's suggestion: will run more the iterations, 
> and see if I can improve. 
>
> Another areas we need to explore is usage of dictionaries actually. May be 
> adding millions of words into the dictionary could help Tesseract. I don't 
> have millions of words; but I am looking into some corpus to get more words 
> into the dictionary. 
>
> If this all fails, EasyOCR (and probably other similar open-source 
> packages)  is probably our next option to try on. Sure, sharing 
> our experiences will be helpful. I will let you know if I made good 
> progresses in any of these options. 
> On Wednesday, September 13, 2023 at 12:19:48 PM UTC+3 mdalihu...@gmail.com 
> wrote:
>
>> How is your training going for Bengali?  It was nearly good but I faced 
>> space problems between two words, some words are spaces but most of them 
>> have no space. I think is problem is in the dataset but I use the default 
>> training dataset from Tesseract which is used in Ben That way I am confused 
>> so I have to explore more. by the way,  you can try as Lorenzo Blz said.  
>> Actually training from scratch is harder than fine-tuning. so you can use 
>> different datasets to explore. if you succeed. please let me know how you 
>> have done this whole process.  I'm also new in this field.
>> On Wednesday, 13 September, 2023 at 1:13:43 pm UTC+6 desal...@gmail.com 
>> wrote:
>>
>>> How is your training going for Bengali?
>>> I have been trying to train from scratch. I made about 64,000 lines of 
>>> text (which produced about 255,000 files, in the end) and run the training 
>>> for 150,000 iterations; getting 0.51 training error rate. I was hopping to 
>>> get reasonable accuracy. Unfortunately, when I run the OCR using  
>>> .traineddata,  the accuracy is absolutely terrible. Do you think I made 
>>> some mistakes, or that is an expected result?
>>>
>>> On Tuesday, September 12, 2023 at 11:15:25 PM UTC+3 mdalihu...@gmail.com 
>>> wrote:
>>>
>>>> Yes, he doesn't mention all fonts but only one font.  That way he 
>>>> didn't use *MODEL_NAME in a separate **script **file script I think.*
>>>>
>>>> Actually, here we teach all *tif, gt.txt, and .box files *which are 
>>>> created by  *MODEL_NAME I mean **eng, ben, oro flag or language code 
>>>> *because 
>>>> when we first create *tif, gt.txt, and .box files, *every file starts 
>>>> by  *MODEL_NAME*. This  *MODEL_NAME*  we selected on the training 
>>>> script for looping each tif, gt.txt, and .box files which are created by
>>>>   *MODEL_NAME.*
>>>>
>>>> On Tuesday, 12 September, 2023 at 9:42:13 pm UTC+6 desal...@gmail.com 
>>>> wrote:
>>>>
>>>>> Yes, I am familiar with the video and have set up the folder structure 
>>>>> as you did. Indeed, I have tried a number of fine-tuning with a single 
>>>>> font 
>>>>> following Gracia's video. But, your script is much  better because 
>>>>> supports 
>>>>> multiple fonts. The whole improvement you made is  brilliant; and very 
>>>>> useful. It is all working for me. 
>>>>> The only part that I didn't understand is the trick you used in your 
>>>>> tesseract_train.py script. You see, I have been doing exactly to you did 
>>>>> except this script. 
>>>>>
>>>>> The scripts seems to have the trick of sending/teaching each of the 
>>>>> fonts (iteratively) into the model. The script I have been using  (which 
>>>>> I 
>>>>> get from Garcia) doesn't mention font at all. 
>>>>>
>>>>> *TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=oro 
>>>>> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000*
>>>>> Does it mean that my model does't train the fonts (even if the fonts 
>>>>> have been included in the splitting process, in the other script)?
>>>>> On Monday, September 11, 2023 at 10:54:08 AM UTC+3 
>>>>> mdalihu...@gmail.com wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> *import subprocess# List of font namesfont_names = ['ben']for font in 
>>>>>> font_names:    command = f"TESSDATA_PREFIX=../tesseract/tessdata make 
>>>>>> training MODEL_NAME={font} START_MODEL=ben 
>>>>>> TESSDATA=../tesseract/tessdata 
>>>>>> MAX_ITERATIONS=10000"*
>>>>>>
>>>>>>
>>>>>> *    subprocess.run(command, shell=True) 1 . This command is for 
>>>>>> training data that I have named '*tesseract_training*.py' inside 
>>>>>> tesstrain folder.*
>>>>>> *2. root directory means your main training folder and inside it as 
>>>>>> like langdata, tessearact,  tesstrain folders. if you see this tutorial  
>>>>>>   *
>>>>>> https://www.youtube.com/watch?v=KE4xEzFGSU8   you will understand 
>>>>>> better the folder structure. only I created tesseract_training.py in 
>>>>>> tesstrain folder for training and  FontList.py file is the main path as 
>>>>>> *like 
>>>>>> langdata, tessearact,  tesstrain, and *split_training_text.py.
>>>>>> 3. first of all you have to put all fonts in your Linux fonts folder. 
>>>>>>   /usr/share/fonts/  then run:  sudo apt update  then sudo fc-cache 
>>>>>> -fv
>>>>>>
>>>>>> after that, you have to add the exact font's name in FontList.py file 
>>>>>> like me.
>>>>>> I  have added two pic my folder structure. first is main structure 
>>>>>> pic and the second is the Colopse tesstrain folder.
>>>>>>
>>>>>> I[image: Screenshot 2023-09-11 134947.png][image: Screenshot 
>>>>>> 2023-09-11 135014.png] 
>>>>>> On Monday, 11 September, 2023 at 12:50:03 pm UTC+6 desal...@gmail.com 
>>>>>> wrote:
>>>>>>
>>>>>>> Thank you so much for putting out these brilliant scripts. They make 
>>>>>>> the process  much more efficient.
>>>>>>>
>>>>>>> I have one more question on the other script that you use to train. 
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *import subprocess# List of font namesfont_names = ['ben']for font 
>>>>>>> in font_names:    command = f"TESSDATA_PREFIX=../tesseract/tessdata 
>>>>>>> make 
>>>>>>> training MODEL_NAME={font} START_MODEL=ben 
>>>>>>> TESSDATA=../tesseract/tessdata 
>>>>>>> MAX_ITERATIONS=10000"*
>>>>>>> *    subprocess.run(command, shell=True) *
>>>>>>>
>>>>>>> Do you have the name of fonts listed in file in the same/root 
>>>>>>> directory?
>>>>>>> How do you setup the names of the fonts in the file, if you don't 
>>>>>>> mind sharing it?
>>>>>>> On Monday, September 11, 2023 at 4:27:27 AM UTC+3 
>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>
>>>>>>>> You can use the new script below. it's better than the previous two 
>>>>>>>> scripts.  You can create *tif, gt.txt, and .box files *by multiple 
>>>>>>>> fonts and also use breakpoint if vs code close or anything during 
>>>>>>>> creating *tif, 
>>>>>>>> gt.txt, and .box files *then you can checkpoint to navigate where 
>>>>>>>> you close vs code.
>>>>>>>>
>>>>>>>> command for *tif, gt.txt, and .box files *
>>>>>>>>
>>>>>>>>
>>>>>>>> import os
>>>>>>>> import random
>>>>>>>> import pathlib
>>>>>>>> import subprocess
>>>>>>>> import argparse
>>>>>>>> from FontList import FontList
>>>>>>>>
>>>>>>>> def create_training_data(training_text_file, font_list, 
>>>>>>>> output_directory, start_line=None, end_line=None):
>>>>>>>>     lines = []
>>>>>>>>     with open(training_text_file, 'r') as input_file:
>>>>>>>>         lines = input_file.readlines()
>>>>>>>>
>>>>>>>>     if not os.path.exists(output_directory):
>>>>>>>>         os.mkdir(output_directory)
>>>>>>>>
>>>>>>>>     if start_line is None:
>>>>>>>>         start_line = 0
>>>>>>>>
>>>>>>>>     if end_line is None:
>>>>>>>>         end_line = len(lines) - 1
>>>>>>>>
>>>>>>>>     for font_name in font_list.fonts:
>>>>>>>>         for line_index in range(start_line, end_line + 1):
>>>>>>>>             line = lines[line_index].strip()
>>>>>>>>
>>>>>>>>             training_text_file_name = pathlib.Path(
>>>>>>>> training_text_file).stem
>>>>>>>>
>>>>>>>>             line_serial = f"{line_index:d}"
>>>>>>>>
>>>>>>>>             line_gt_text = os.path.join(output_directory, f'{
>>>>>>>> training_text_file_name}_{line_serial}_{font_name.replace(" ", "_")
>>>>>>>> }.gt.txt')
>>>>>>>>
>>>>>>>>
>>>>>>>>             with open(line_gt_text, 'w') as output_file:
>>>>>>>>                 output_file.writelines([line])
>>>>>>>>
>>>>>>>>             file_base_name = f'{training_text_file_name}_{
>>>>>>>> line_serial}_{font_name.replace(" ", "_")}'
>>>>>>>>             subprocess.run([
>>>>>>>>                 'text2image',
>>>>>>>>                 f'--font={font_name}',
>>>>>>>>                 f'--text={line_gt_text}',
>>>>>>>>                 f'--outputbase={output_directory}/{file_base_name}'
>>>>>>>> ,
>>>>>>>>                 '--max_pages=1',
>>>>>>>>                 '--strip_unrenderable_words',
>>>>>>>>                 '--leading=36',
>>>>>>>>                 '--xsize=3600',
>>>>>>>>                 '--ysize=330',
>>>>>>>>                 '--char_spacing=1.0',
>>>>>>>>                 '--exposure=0',
>>>>>>>>                 '--unicharset_file=langdata/eng.unicharset',
>>>>>>>>             ])
>>>>>>>>
>>>>>>>> if __name__ == "__main__":
>>>>>>>>     parser = argparse.ArgumentParser()
>>>>>>>>     parser.add_argument('--start', type=int, help='Starting line 
>>>>>>>> count (inclusive)')
>>>>>>>>     parser.add_argument('--end', type=int, help='Ending line count 
>>>>>>>> (inclusive)')
>>>>>>>>     args = parser.parse_args()
>>>>>>>>
>>>>>>>>     training_text_file = 'langdata/eng.training_text'
>>>>>>>>     output_directory = 'tesstrain/data/eng-ground-truth'
>>>>>>>>
>>>>>>>>     font_list = FontList()
>>>>>>>>
>>>>>>>>     create_training_data(training_text_file, font_list, 
>>>>>>>> output_directory, args.start, args.end)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Then create a file called "FontList" in the root directory and 
>>>>>>>> paste it.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> class FontList:
>>>>>>>>     def __init__(self):
>>>>>>>>         self.fonts = [
>>>>>>>>         "Gerlick"
>>>>>>>>             "Sagar Medium",
>>>>>>>>             "Ekushey Lohit Normal",  
>>>>>>>>            "Charukola Round Head Regular, weight=433",
>>>>>>>>             "Charukola Round Head Bold, weight=443",
>>>>>>>>             "Ador Orjoma Unicode",
>>>>>>>>       
>>>>>>>>           
>>>>>>>>                        
>>>>>>>> ]                         
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> then import in the above code,
>>>>>>>>
>>>>>>>>
>>>>>>>> *for breakpoint command:*
>>>>>>>>
>>>>>>>>
>>>>>>>> sudo python3 split_training_text.py --start 0  --end 11
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> change checkpoint according to you  --start 0 --end 11.
>>>>>>>>
>>>>>>>> *and training checkpoint as you know already.*
>>>>>>>>
>>>>>>>>
>>>>>>>> On Monday, 11 September, 2023 at 1:22:34 am UTC+6 
>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>
>>>>>>>>> Hi mhalidu, 
>>>>>>>>> the script you posted here seems much more extensive than you 
>>>>>>>>> posted before: 
>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/0e2880d9-64c0-4659-b497-902a5747caf4n%40googlegroups.com
>>>>>>>>> .
>>>>>>>>>
>>>>>>>>> I have been using your earlier script. It is magical. How is this 
>>>>>>>>> one different from the earlier one?
>>>>>>>>>
>>>>>>>>> Thank you for posting these scripts, by the way. It has saved my 
>>>>>>>>> countless hours; by running multiple fonts in one sweep. I was not 
>>>>>>>>> able to 
>>>>>>>>> find any instruction on how to train for  multiple fonts. The 
>>>>>>>>> official 
>>>>>>>>> manual is also unclear. YOUr script helped me to get started. 
>>>>>>>>> On Wednesday, August 9, 2023 at 11:00:49 PM UTC+3 
>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>
>>>>>>>>>> ok, I will try as you said.
>>>>>>>>>> one more thing, what's the role of the trained_text lines will 
>>>>>>>>>> be? I have seen Bengali text are long words of lines. so I wanna 
>>>>>>>>>> know how 
>>>>>>>>>> many words or characters will be the better choice for the train? 
>>>>>>>>>> and '--xsize=3600','--ysize=350',  will be according to words of 
>>>>>>>>>> lines?
>>>>>>>>>>
>>>>>>>>>> On Thursday, 10 August, 2023 at 1:10:14 am UTC+6 shree wrote:
>>>>>>>>>>
>>>>>>>>>>> Include the default fonts also in your fine-tuning list of fonts 
>>>>>>>>>>> and see if that helps.
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Aug 9, 2023, 2:27 PM Ali hussain <mdalihu...@gmail.com> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I have trained some new fonts by fine-tune methods for the 
>>>>>>>>>>>> Bengali language in Tesseract 5 and I have used all official 
>>>>>>>>>>>> trained_text 
>>>>>>>>>>>> and tessdata_best and other things also.  everything is good but 
>>>>>>>>>>>> the 
>>>>>>>>>>>> problem is the default font which was trained before that does not 
>>>>>>>>>>>> convert 
>>>>>>>>>>>> text like prev but my new fonts work well. I don't understand why 
>>>>>>>>>>>> it's 
>>>>>>>>>>>> happening. I share code based to understand what going on.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> *codes  for creating tif, gt.txt, .box files:*
>>>>>>>>>>>> import os
>>>>>>>>>>>> import random
>>>>>>>>>>>> import pathlib
>>>>>>>>>>>> import subprocess
>>>>>>>>>>>> import argparse
>>>>>>>>>>>> from FontList import FontList
>>>>>>>>>>>>
>>>>>>>>>>>> def read_line_count():
>>>>>>>>>>>>     if os.path.exists('line_count.txt'):
>>>>>>>>>>>>         with open('line_count.txt', 'r') as file:
>>>>>>>>>>>>             return int(file.read())
>>>>>>>>>>>>     return 0
>>>>>>>>>>>>
>>>>>>>>>>>> def write_line_count(line_count):
>>>>>>>>>>>>     with open('line_count.txt', 'w') as file:
>>>>>>>>>>>>         file.write(str(line_count))
>>>>>>>>>>>>
>>>>>>>>>>>> def create_training_data(training_text_file, font_list, 
>>>>>>>>>>>> output_directory, start_line=None, end_line=None):
>>>>>>>>>>>>     lines = []
>>>>>>>>>>>>     with open(training_text_file, 'r') as input_file:
>>>>>>>>>>>>         for line in input_file.readlines():
>>>>>>>>>>>>             lines.append(line.strip())
>>>>>>>>>>>>     
>>>>>>>>>>>>     if not os.path.exists(output_directory):
>>>>>>>>>>>>         os.mkdir(output_directory)
>>>>>>>>>>>>     
>>>>>>>>>>>>     random.shuffle(lines)
>>>>>>>>>>>>     
>>>>>>>>>>>>     if start_line is None:
>>>>>>>>>>>>         line_count = read_line_count()  # Set the starting 
>>>>>>>>>>>> line_count from the file
>>>>>>>>>>>>     else:
>>>>>>>>>>>>         line_count = start_line
>>>>>>>>>>>>     
>>>>>>>>>>>>     if end_line is None:
>>>>>>>>>>>>         end_line_count = len(lines) - 1  # Set the ending 
>>>>>>>>>>>> line_count
>>>>>>>>>>>>     else:
>>>>>>>>>>>>         end_line_count = min(end_line, len(lines) - 1)
>>>>>>>>>>>>     
>>>>>>>>>>>>     for font in font_list.fonts:  # Iterate through all the 
>>>>>>>>>>>> fonts in the font_list
>>>>>>>>>>>>         font_serial = 1
>>>>>>>>>>>>         for line in lines:
>>>>>>>>>>>>             training_text_file_name = pathlib.Path(
>>>>>>>>>>>> training_text_file).stem
>>>>>>>>>>>>             
>>>>>>>>>>>>             # Generate a unique serial number for each line
>>>>>>>>>>>>             line_serial = f"{line_count:d}"
>>>>>>>>>>>>             
>>>>>>>>>>>>             # GT (Ground Truth) text filename
>>>>>>>>>>>>             line_gt_text = os.path.join(output_directory, f'{
>>>>>>>>>>>> training_text_file_name}_{line_serial}.gt.txt')
>>>>>>>>>>>>             with open(line_gt_text, 'w') as output_file:
>>>>>>>>>>>>                 output_file.writelines([line])
>>>>>>>>>>>>             
>>>>>>>>>>>>             # Image filename
>>>>>>>>>>>>             file_base_name = f'ben_{line_serial}'  # Unique 
>>>>>>>>>>>> filename for each font
>>>>>>>>>>>>             subprocess.run([
>>>>>>>>>>>>                 'text2image',
>>>>>>>>>>>>                 f'--font={font}',
>>>>>>>>>>>>                 f'--text={line_gt_text}',
>>>>>>>>>>>>                 f'--outputbase={output_directory}/{
>>>>>>>>>>>> file_base_name}',
>>>>>>>>>>>>                 '--max_pages=1',
>>>>>>>>>>>>                 '--strip_unrenderable_words',
>>>>>>>>>>>>                 '--leading=36',
>>>>>>>>>>>>                 '--xsize=3600',
>>>>>>>>>>>>                 '--ysize=350',
>>>>>>>>>>>>                 '--char_spacing=1.0',
>>>>>>>>>>>>                 '--exposure=0',
>>>>>>>>>>>>                 '--unicharset_file=langdata/ben.unicharset',
>>>>>>>>>>>>             ])
>>>>>>>>>>>>             
>>>>>>>>>>>>             line_count += 1
>>>>>>>>>>>>             font_serial += 1
>>>>>>>>>>>>         
>>>>>>>>>>>>         # Reset font_serial for the next font iteration
>>>>>>>>>>>>         font_serial = 1
>>>>>>>>>>>>     
>>>>>>>>>>>>     write_line_count(line_count)  # Update the line_count in 
>>>>>>>>>>>> the file
>>>>>>>>>>>>
>>>>>>>>>>>> if __name__ == "__main__":
>>>>>>>>>>>>     parser = argparse.ArgumentParser()
>>>>>>>>>>>>     parser.add_argument('--start', type=int, help='Starting 
>>>>>>>>>>>> line count (inclusive)')
>>>>>>>>>>>>     parser.add_argument('--end', type=int, help='Ending line 
>>>>>>>>>>>> count (inclusive)')
>>>>>>>>>>>>     args = parser.parse_args()
>>>>>>>>>>>>     
>>>>>>>>>>>>     training_text_file = 'langdata/ben.training_text'
>>>>>>>>>>>>     output_directory = 'tesstrain/data/ben-ground-truth'
>>>>>>>>>>>>     
>>>>>>>>>>>>     # Create an instance of the FontList class
>>>>>>>>>>>>     font_list = FontList()
>>>>>>>>>>>>      
>>>>>>>>>>>>     create_training_data(training_text_file, font_list, 
>>>>>>>>>>>> output_directory, args.start, args.end)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> *and for training code:*
>>>>>>>>>>>>
>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>
>>>>>>>>>>>> # List of font names
>>>>>>>>>>>> font_names = ['ben']
>>>>>>>>>>>>
>>>>>>>>>>>> for font in font_names:
>>>>>>>>>>>>     command = f"TESSDATA_PREFIX=../tesseract/tessdata make 
>>>>>>>>>>>> training MODEL_NAME={font} START_MODEL=ben 
>>>>>>>>>>>> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000 
>>>>>>>>>>>> LANG_TYPE=Indic"
>>>>>>>>>>>>     subprocess.run(command, shell=True)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> any suggestion to identify to extract the problem.
>>>>>>>>>>>> thanks, everyone
>>>>>>>>>>>>
>>>>>>>>>>>> -- 
>>>>>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from 
>>>>>>>>>>>> it, send an email to tesseract-oc...@googlegroups.com.
>>>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com
>>>>>>>>>>>>  
>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>> .
>>>>>>>>>>>>
>>>>>>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/44437232-5807-4ec6-adb7-d25452880815n%40googlegroups.com.

Re: [tesseract-ocr] accuracy problem after trained in fine-tune

Reply via email to