Re: [tesseract-ocr] accuracy problem after trained in fine-tune

Des Bw Wed, 13 Sep 2023 00:13:48 -0700

How is your training going for Bengali?
I have been trying to train from scratch. I made about 64,000 lines of text 
(which produced about 255,000 files, in the end) and run the training for 
150,000 iterations; getting 0.51 training error rate. I was hopping to get 
reasonable accuracy. Unfortunately, when I run the OCR using  
.traineddata,  the accuracy is absolutely terrible. Do you think I made 
some mistakes, or that is an expected result?


On Tuesday, September 12, 2023 at 11:15:25 PM UTC+3 [email protected] 
wrote:

> Yes, he doesn't mention all fonts but only one font.  That way he didn't 
> use *MODEL_NAME in a separate **script **file script I think.*
>
> Actually, here we teach all *tif, gt.txt, and .box files *which are 
> created by  *MODEL_NAME I mean **eng, ben, oro flag or language code *because 
> when we first create *tif, gt.txt, and .box files, *every file starts by  
> *MODEL_NAME*. This  *MODEL_NAME*  we selected on the training script for 
> looping each tif, gt.txt, and .box files which are created by  
> *MODEL_NAME.*
>
> On Tuesday, 12 September, 2023 at 9:42:13 pm UTC+6 [email protected] 
> wrote:
>
>> Yes, I am familiar with the video and have set up the folder structure as 
>> you did. Indeed, I have tried a number of fine-tuning with a single font 
>> following Gracia's video. But, your script is much  better because supports 
>> multiple fonts. The whole improvement you made is  brilliant; and very 
>> useful. It is all working for me. 
>> The only part that I didn't understand is the trick you used in your 
>> tesseract_train.py script. You see, I have been doing exactly to you did 
>> except this script. 
>>
>> The scripts seems to have the trick of sending/teaching each of the fonts 
>> (iteratively) into the model. The script I have been using  (which I get 
>> from Garcia) doesn't mention font at all. 
>>
>> *TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=oro 
>> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000*
>> Does it mean that my model does't train the fonts (even if the fonts have 
>> been included in the splitting process, in the other script)?
>> On Monday, September 11, 2023 at 10:54:08 AM UTC+3 [email protected] 
>> wrote:
>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *import subprocess# List of font namesfont_names = ['ben']for font in 
>>> font_names:    command = f"TESSDATA_PREFIX=../tesseract/tessdata make 
>>> training MODEL_NAME={font} START_MODEL=ben TESSDATA=../tesseract/tessdata 
>>> MAX_ITERATIONS=10000"*
>>>
>>>
>>> *    subprocess.run(command, shell=True) 1 . This command is for 
>>> training data that I have named '*tesseract_training*.py' inside 
>>> tesstrain folder.*
>>> *2. root directory means your main training folder and inside it as like 
>>> langdata, tessearact,  tesstrain folders. if you see this tutorial    *
>>> https://www.youtube.com/watch?v=KE4xEzFGSU8   you will understand 
>>> better the folder structure. only I created tesseract_training.py in 
>>> tesstrain folder for training and  FontList.py file is the main path as 
>>> *like 
>>> langdata, tessearact,  tesstrain, and *split_training_text.py.
>>> 3. first of all you have to put all fonts in your Linux fonts folder.  
>>>  /usr/share/fonts/  then run:  sudo apt update  then sudo fc-cache -fv
>>>
>>> after that, you have to add the exact font's name in FontList.py file 
>>> like me.
>>> I  have added two pic my folder structure. first is main structure pic 
>>> and the second is the Colopse tesstrain folder.
>>>
>>> I[image: Screenshot 2023-09-11 134947.png][image: Screenshot 2023-09-11 
>>> 135014.png] 
>>> On Monday, 11 September, 2023 at 12:50:03 pm UTC+6 [email protected] 
>>> wrote:
>>>
>>>> Thank you so much for putting out these brilliant scripts. They make 
>>>> the process  much more efficient.
>>>>
>>>> I have one more question on the other script that you use to train. 
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *import subprocess# List of font namesfont_names = ['ben']for font in 
>>>> font_names:    command = f"TESSDATA_PREFIX=../tesseract/tessdata make 
>>>> training MODEL_NAME={font} START_MODEL=ben TESSDATA=../tesseract/tessdata 
>>>> MAX_ITERATIONS=10000"*
>>>> *    subprocess.run(command, shell=True) *
>>>>
>>>> Do you have the name of fonts listed in file in the same/root directory?
>>>> How do you setup the names of the fonts in the file, if you don't mind 
>>>> sharing it?
>>>> On Monday, September 11, 2023 at 4:27:27 AM UTC+3 [email protected] 
>>>> wrote:
>>>>
>>>>> You can use the new script below. it's better than the previous two 
>>>>> scripts.  You can create *tif, gt.txt, and .box files *by multiple 
>>>>> fonts and also use breakpoint if vs code close or anything during 
>>>>> creating *tif, 
>>>>> gt.txt, and .box files *then you can checkpoint to navigate where you 
>>>>> close vs code.
>>>>>
>>>>> command for *tif, gt.txt, and .box files *
>>>>>
>>>>>
>>>>> import os
>>>>> import random
>>>>> import pathlib
>>>>> import subprocess
>>>>> import argparse
>>>>> from FontList import FontList
>>>>>
>>>>> def create_training_data(training_text_file, font_list, 
>>>>> output_directory, start_line=None, end_line=None):
>>>>>     lines = []
>>>>>     with open(training_text_file, 'r') as input_file:
>>>>>         lines = input_file.readlines()
>>>>>
>>>>>     if not os.path.exists(output_directory):
>>>>>         os.mkdir(output_directory)
>>>>>
>>>>>     if start_line is None:
>>>>>         start_line = 0
>>>>>
>>>>>     if end_line is None:
>>>>>         end_line = len(lines) - 1
>>>>>
>>>>>     for font_name in font_list.fonts:
>>>>>         for line_index in range(start_line, end_line + 1):
>>>>>             line = lines[line_index].strip()
>>>>>
>>>>>             training_text_file_name = pathlib.Path(training_text_file
>>>>> ).stem
>>>>>
>>>>>             line_serial = f"{line_index:d}"
>>>>>
>>>>>             line_gt_text = os.path.join(output_directory, f'{
>>>>> training_text_file_name}_{line_serial}_{font_name.replace(" ", "_")}
>>>>> .gt.txt')
>>>>>
>>>>>
>>>>>             with open(line_gt_text, 'w') as output_file:
>>>>>                 output_file.writelines([line])
>>>>>
>>>>>             file_base_name = f'{training_text_file_name}_{line_serial}
>>>>> _{font_name.replace(" ", "_")}'
>>>>>             subprocess.run([
>>>>>                 'text2image',
>>>>>                 f'--font={font_name}',
>>>>>                 f'--text={line_gt_text}',
>>>>>                 f'--outputbase={output_directory}/{file_base_name}',
>>>>>                 '--max_pages=1',
>>>>>                 '--strip_unrenderable_words',
>>>>>                 '--leading=36',
>>>>>                 '--xsize=3600',
>>>>>                 '--ysize=330',
>>>>>                 '--char_spacing=1.0',
>>>>>                 '--exposure=0',
>>>>>                 '--unicharset_file=langdata/eng.unicharset',
>>>>>             ])
>>>>>
>>>>> if __name__ == "__main__":
>>>>>     parser = argparse.ArgumentParser()
>>>>>     parser.add_argument('--start', type=int, help='Starting line 
>>>>> count (inclusive)')
>>>>>     parser.add_argument('--end', type=int, help='Ending line count 
>>>>> (inclusive)')
>>>>>     args = parser.parse_args()
>>>>>
>>>>>     training_text_file = 'langdata/eng.training_text'
>>>>>     output_directory = 'tesstrain/data/eng-ground-truth'
>>>>>
>>>>>     font_list = FontList()
>>>>>
>>>>>     create_training_data(training_text_file, font_list, 
>>>>> output_directory, args.start, args.end)
>>>>>
>>>>>
>>>>>
>>>>> Then create a file called "FontList" in the root directory and paste 
>>>>> it.
>>>>>
>>>>>
>>>>>
>>>>> class FontList:
>>>>>     def __init__(self):
>>>>>         self.fonts = [
>>>>>         "Gerlick"
>>>>>             "Sagar Medium",
>>>>>             "Ekushey Lohit Normal",  
>>>>>            "Charukola Round Head Regular, weight=433",
>>>>>             "Charukola Round Head Bold, weight=443",
>>>>>             "Ador Orjoma Unicode",
>>>>>       
>>>>>           
>>>>>                        
>>>>> ]                         
>>>>>
>>>>>
>>>>>
>>>>> then import in the above code,
>>>>>
>>>>>
>>>>> *for breakpoint command:*
>>>>>
>>>>>
>>>>> sudo python3 split_training_text.py --start 0  --end 11
>>>>>
>>>>>
>>>>>
>>>>> change checkpoint according to you  --start 0 --end 11.
>>>>>
>>>>> *and training checkpoint as you know already.*
>>>>>
>>>>>
>>>>> On Monday, 11 September, 2023 at 1:22:34 am UTC+6 [email protected] 
>>>>> wrote:
>>>>>
>>>>>> Hi mhalidu, 
>>>>>> the script you posted here seems much more extensive than you posted 
>>>>>> before: 
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/0e2880d9-64c0-4659-b497-902a5747caf4n%40googlegroups.com
>>>>>> .
>>>>>>
>>>>>> I have been using your earlier script. It is magical. How is this one 
>>>>>> different from the earlier one?
>>>>>>
>>>>>> Thank you for posting these scripts, by the way. It has saved my 
>>>>>> countless hours; by running multiple fonts in one sweep. I was not able 
>>>>>> to 
>>>>>> find any instruction on how to train for  multiple fonts. The official 
>>>>>> manual is also unclear. YOUr script helped me to get started. 
>>>>>> On Wednesday, August 9, 2023 at 11:00:49 PM UTC+3 
>>>>>> [email protected] wrote:
>>>>>>
>>>>>>> ok, I will try as you said.
>>>>>>> one more thing, what's the role of the trained_text lines will be? I 
>>>>>>> have seen Bengali text are long words of lines. so I wanna know how 
>>>>>>> many 
>>>>>>> words or characters will be the better choice for the train? 
>>>>>>> and '--xsize=3600','--ysize=350',  will be according to words of lines?
>>>>>>>
>>>>>>> On Thursday, 10 August, 2023 at 1:10:14 am UTC+6 shree wrote:
>>>>>>>
>>>>>>>> Include the default fonts also in your fine-tuning list of fonts 
>>>>>>>> and see if that helps.
>>>>>>>>
>>>>>>>> On Wed, Aug 9, 2023, 2:27 PM Ali hussain <[email protected]> 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I have trained some new fonts by fine-tune methods for the Bengali 
>>>>>>>>> language in Tesseract 5 and I have used all official trained_text and 
>>>>>>>>> tessdata_best and other things also.  everything is good but the 
>>>>>>>>> problem is 
>>>>>>>>> the default font which was trained before that does not convert text 
>>>>>>>>> like 
>>>>>>>>> prev but my new fonts work well. I don't understand why it's 
>>>>>>>>> happening. I 
>>>>>>>>> share code based to understand what going on.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *codes  for creating tif, gt.txt, .box files:*
>>>>>>>>> import os
>>>>>>>>> import random
>>>>>>>>> import pathlib
>>>>>>>>> import subprocess
>>>>>>>>> import argparse
>>>>>>>>> from FontList import FontList
>>>>>>>>>
>>>>>>>>> def read_line_count():
>>>>>>>>>     if os.path.exists('line_count.txt'):
>>>>>>>>>         with open('line_count.txt', 'r') as file:
>>>>>>>>>             return int(file.read())
>>>>>>>>>     return 0
>>>>>>>>>
>>>>>>>>> def write_line_count(line_count):
>>>>>>>>>     with open('line_count.txt', 'w') as file:
>>>>>>>>>         file.write(str(line_count))
>>>>>>>>>
>>>>>>>>> def create_training_data(training_text_file, font_list, 
>>>>>>>>> output_directory, start_line=None, end_line=None):
>>>>>>>>>     lines = []
>>>>>>>>>     with open(training_text_file, 'r') as input_file:
>>>>>>>>>         for line in input_file.readlines():
>>>>>>>>>             lines.append(line.strip())
>>>>>>>>>     
>>>>>>>>>     if not os.path.exists(output_directory):
>>>>>>>>>         os.mkdir(output_directory)
>>>>>>>>>     
>>>>>>>>>     random.shuffle(lines)
>>>>>>>>>     
>>>>>>>>>     if start_line is None:
>>>>>>>>>         line_count = read_line_count()  # Set the starting 
>>>>>>>>> line_count from the file
>>>>>>>>>     else:
>>>>>>>>>         line_count = start_line
>>>>>>>>>     
>>>>>>>>>     if end_line is None:
>>>>>>>>>         end_line_count = len(lines) - 1  # Set the ending 
>>>>>>>>> line_count
>>>>>>>>>     else:
>>>>>>>>>         end_line_count = min(end_line, len(lines) - 1)
>>>>>>>>>     
>>>>>>>>>     for font in font_list.fonts:  # Iterate through all the fonts 
>>>>>>>>> in the font_list
>>>>>>>>>         font_serial = 1
>>>>>>>>>         for line in lines:
>>>>>>>>>             training_text_file_name = pathlib.Path(
>>>>>>>>> training_text_file).stem
>>>>>>>>>             
>>>>>>>>>             # Generate a unique serial number for each line
>>>>>>>>>             line_serial = f"{line_count:d}"
>>>>>>>>>             
>>>>>>>>>             # GT (Ground Truth) text filename
>>>>>>>>>             line_gt_text = os.path.join(output_directory, f'{
>>>>>>>>> training_text_file_name}_{line_serial}.gt.txt')
>>>>>>>>>             with open(line_gt_text, 'w') as output_file:
>>>>>>>>>                 output_file.writelines([line])
>>>>>>>>>             
>>>>>>>>>             # Image filename
>>>>>>>>>             file_base_name = f'ben_{line_serial}'  # Unique 
>>>>>>>>> filename for each font
>>>>>>>>>             subprocess.run([
>>>>>>>>>                 'text2image',
>>>>>>>>>                 f'--font={font}',
>>>>>>>>>                 f'--text={line_gt_text}',
>>>>>>>>>                 f'--outputbase={output_directory}/{file_base_name}
>>>>>>>>> ',
>>>>>>>>>                 '--max_pages=1',
>>>>>>>>>                 '--strip_unrenderable_words',
>>>>>>>>>                 '--leading=36',
>>>>>>>>>                 '--xsize=3600',
>>>>>>>>>                 '--ysize=350',
>>>>>>>>>                 '--char_spacing=1.0',
>>>>>>>>>                 '--exposure=0',
>>>>>>>>>                 '--unicharset_file=langdata/ben.unicharset',
>>>>>>>>>             ])
>>>>>>>>>             
>>>>>>>>>             line_count += 1
>>>>>>>>>             font_serial += 1
>>>>>>>>>         
>>>>>>>>>         # Reset font_serial for the next font iteration
>>>>>>>>>         font_serial = 1
>>>>>>>>>     
>>>>>>>>>     write_line_count(line_count)  # Update the line_count in the 
>>>>>>>>> file
>>>>>>>>>
>>>>>>>>> if __name__ == "__main__":
>>>>>>>>>     parser = argparse.ArgumentParser()
>>>>>>>>>     parser.add_argument('--start', type=int, help='Starting line 
>>>>>>>>> count (inclusive)')
>>>>>>>>>     parser.add_argument('--end', type=int, help='Ending line 
>>>>>>>>> count (inclusive)')
>>>>>>>>>     args = parser.parse_args()
>>>>>>>>>     
>>>>>>>>>     training_text_file = 'langdata/ben.training_text'
>>>>>>>>>     output_directory = 'tesstrain/data/ben-ground-truth'
>>>>>>>>>     
>>>>>>>>>     # Create an instance of the FontList class
>>>>>>>>>     font_list = FontList()
>>>>>>>>>      
>>>>>>>>>     create_training_data(training_text_file, font_list, 
>>>>>>>>> output_directory, args.start, args.end)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *and for training code:*
>>>>>>>>>
>>>>>>>>> import subprocess
>>>>>>>>>
>>>>>>>>> # List of font names
>>>>>>>>> font_names = ['ben']
>>>>>>>>>
>>>>>>>>> for font in font_names:
>>>>>>>>>     command = f"TESSDATA_PREFIX=../tesseract/tessdata make 
>>>>>>>>> training MODEL_NAME={font} START_MODEL=ben 
>>>>>>>>> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000 LANG_TYPE=Indic"
>>>>>>>>>     subprocess.run(command, shell=True)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> any suggestion to identify to extract the problem.
>>>>>>>>> thanks, everyone
>>>>>>>>>
>>>>>>>>> -- 
>>>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>>> send an email to [email protected].
>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com
>>>>>>>>>  
>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>> .
>>>>>>>>>
>>>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/2563df8d-e261-497c-8fa6-821f013023ban%40googlegroups.com.

Re: [tesseract-ocr] accuracy problem after trained in fine-tune

Reply via email to