Re: [tesseract-ocr] accuracy problem after trained in fine-tune

Des Bw Wed, 13 Sep 2023 02:17:36 -0700

Dear Lorenzo, 
Yes, the accuracy is going up and down; and then finally selects the best 
(lowest) error rate. 150000 actually should use the whole dataset because, 
if i remember correctly, Tesseract uses the .lstmfs. 
I think the process is: Text lines --> images --> boxes. The boxes contain 
information both about the texts and their positions (coordinate positions) 
in the image. Those boxes are then transformed to .lstmfs files to be used 
by Tesseract. The lstmfs files are 64,000. So, I was assuming that each 
file was run at least twice. 
That was my understanding (assumption). But, I think you are right that 
running every line multiple times might be a way to go. I will try to run 
more iterations then. I will up it to 200,000 and see if there will be 
better results.


On Wednesday, September 13, 2023 at 10:34:23 AM UTC+3 Lorenzo Blz wrote:

> I'm not 100% sure but, if I remember correctly, one iteration, in this 
> context, means one image so with 150000 iterations you did not even use the 
> whole dataset.
>
> Also, especially when training from scratch, you likely need to pass over 
> the whole dataset multiple times.
>
> You should let the training running until you see that the accuracy score 
> stops improving (you should use a separate dataset for this, but let's keep 
> it simple). The accuracy will go up and down a little but you should see a 
> constant improvement over time. If this does not happen there is a problem.
>
> It may take 24 hours or more depending on the hardware, dataset, etc.
>
> The training process should save intermediate models so you should be able 
> to stop it and resume it later from the last saved model.
>
>
> Lorenzo
>
> Il giorno mer 13 set 2023 alle ore 09:13 Des Bw <[email protected]> ha 
> scritto:
>
>> How is your training going for Bengali?
>> I have been trying to train from scratch. I made about 64,000 lines of 
>> text (which produced about 255,000 files, in the end) and run the training 
>> for 150,000 iterations; getting 0.51 training error rate. I was hopping to 
>> get reasonable accuracy. Unfortunately, when I run the OCR using  
>> .traineddata,  the accuracy is absolutely terrible. Do you think I made 
>> some mistakes, or that is an expected result?
>>
>> On Tuesday, September 12, 2023 at 11:15:25 PM UTC+3 [email protected] 
>> wrote:
>>
>>> Yes, he doesn't mention all fonts but only one font.  That way he didn't 
>>> use *MODEL_NAME in a separate **script **file script I think.*
>>>
>>> Actually, here we teach all *tif, gt.txt, and .box files *which are 
>>> created by  *MODEL_NAME I mean **eng, ben, oro flag or language code 
>>> *because 
>>> when we first create *tif, gt.txt, and .box files, *every file starts 
>>> by  *MODEL_NAME*. This  *MODEL_NAME*  we selected on the training 
>>> script for looping each tif, gt.txt, and .box files which are created by
>>>   *MODEL_NAME.*
>>>
>>> On Tuesday, 12 September, 2023 at 9:42:13 pm UTC+6 [email protected] 
>>> wrote:
>>>
>>>> Yes, I am familiar with the video and have set up the folder structure 
>>>> as you did. Indeed, I have tried a number of fine-tuning with a single 
>>>> font 
>>>> following Gracia's video. But, your script is much  better because 
>>>> supports 
>>>> multiple fonts. The whole improvement you made is  brilliant; and very 
>>>> useful. It is all working for me. 
>>>> The only part that I didn't understand is the trick you used in your 
>>>> tesseract_train.py script. You see, I have been doing exactly to you did 
>>>> except this script. 
>>>>
>>>> The scripts seems to have the trick of sending/teaching each of the 
>>>> fonts (iteratively) into the model. The script I have been using  (which I 
>>>> get from Garcia) doesn't mention font at all. 
>>>>
>>>> *TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=oro 
>>>> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000*
>>>> Does it mean that my model does't train the fonts (even if the fonts 
>>>> have been included in the splitting process, in the other script)?
>>>> On Monday, September 11, 2023 at 10:54:08 AM UTC+3 [email protected] 
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> *import subprocess# List of font namesfont_names = ['ben']for font in 
>>>>> font_names:    command = f"TESSDATA_PREFIX=../tesseract/tessdata make 
>>>>> training MODEL_NAME={font} START_MODEL=ben TESSDATA=../tesseract/tessdata 
>>>>> MAX_ITERATIONS=10000"*
>>>>>
>>>>>
>>>>> *    subprocess.run(command, shell=True) 1 . This command is for 
>>>>> training data that I have named '*tesseract_training*.py' inside 
>>>>> tesstrain folder.*
>>>>> *2. root directory means your main training folder and inside it as 
>>>>> like langdata, tessearact,  tesstrain folders. if you see this tutorial   
>>>>>  *
>>>>> https://www.youtube.com/watch?v=KE4xEzFGSU8   you will understand 
>>>>> better the folder structure. only I created tesseract_training.py in 
>>>>> tesstrain folder for training and  FontList.py file is the main path as 
>>>>> *like 
>>>>> langdata, tessearact,  tesstrain, and *split_training_text.py.
>>>>> 3. first of all you have to put all fonts in your Linux fonts folder. 
>>>>>   /usr/share/fonts/  then run:  sudo apt update  then sudo fc-cache 
>>>>> -fv
>>>>>
>>>>> after that, you have to add the exact font's name in FontList.py file 
>>>>> like me.
>>>>> I  have added two pic my folder structure. first is main structure 
>>>>> pic and the second is the Colopse tesstrain folder.
>>>>>
>>>>> I[image: Screenshot 2023-09-11 134947.png][image: Screenshot 
>>>>> 2023-09-11 135014.png] 
>>>>> On Monday, 11 September, 2023 at 12:50:03 pm UTC+6 [email protected] 
>>>>> wrote:
>>>>>
>>>>>> Thank you so much for putting out these brilliant scripts. They make 
>>>>>> the process  much more efficient.
>>>>>>
>>>>>> I have one more question on the other script that you use to train. 
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> *import subprocess# List of font namesfont_names = ['ben']for font in 
>>>>>> font_names:    command = f"TESSDATA_PREFIX=../tesseract/tessdata make 
>>>>>> training MODEL_NAME={font} START_MODEL=ben 
>>>>>> TESSDATA=../tesseract/tessdata 
>>>>>> MAX_ITERATIONS=10000"*
>>>>>> *    subprocess.run(command, shell=True) *
>>>>>>
>>>>>> Do you have the name of fonts listed in file in the same/root 
>>>>>> directory?
>>>>>> How do you setup the names of the fonts in the file, if you don't 
>>>>>> mind sharing it?
>>>>>> On Monday, September 11, 2023 at 4:27:27 AM UTC+3 
>>>>>> [email protected] wrote:
>>>>>>
>>>>>>> You can use the new script below. it's better than the previous two 
>>>>>>> scripts.  You can create *tif, gt.txt, and .box files *by multiple 
>>>>>>> fonts and also use breakpoint if vs code close or anything during 
>>>>>>> creating *tif, 
>>>>>>> gt.txt, and .box files *then you can checkpoint to navigate where 
>>>>>>> you close vs code.
>>>>>>>
>>>>>>> command for *tif, gt.txt, and .box files *
>>>>>>>
>>>>>>>
>>>>>>> import os
>>>>>>> import random
>>>>>>> import pathlib
>>>>>>> import subprocess
>>>>>>> import argparse
>>>>>>> from FontList import FontList
>>>>>>>
>>>>>>> def create_training_data(training_text_file, font_list, 
>>>>>>> output_directory, start_line=None, end_line=None):
>>>>>>>     lines = []
>>>>>>>     with open(training_text_file, 'r') as input_file:
>>>>>>>         lines = input_file.readlines()
>>>>>>>
>>>>>>>     if not os.path.exists(output_directory):
>>>>>>>         os.mkdir(output_directory)
>>>>>>>
>>>>>>>     if start_line is None:
>>>>>>>         start_line = 0
>>>>>>>
>>>>>>>     if end_line is None:
>>>>>>>         end_line = len(lines) - 1
>>>>>>>
>>>>>>>     for font_name in font_list.fonts:
>>>>>>>         for line_index in range(start_line, end_line + 1):
>>>>>>>             line = lines[line_index].strip()
>>>>>>>
>>>>>>>             training_text_file_name = pathlib.Path(
>>>>>>> training_text_file).stem
>>>>>>>
>>>>>>>             line_serial = f"{line_index:d}"
>>>>>>>
>>>>>>>             line_gt_text = os.path.join(output_directory, f'{
>>>>>>> training_text_file_name}_{line_serial}_{font_name.replace(" ", "_")}
>>>>>>> .gt.txt')
>>>>>>>
>>>>>>>
>>>>>>>             with open(line_gt_text, 'w') as output_file:
>>>>>>>                 output_file.writelines([line])
>>>>>>>
>>>>>>>             file_base_name = f'{training_text_file_name}_{
>>>>>>> line_serial}_{font_name.replace(" ", "_")}'
>>>>>>>             subprocess.run([
>>>>>>>                 'text2image',
>>>>>>>                 f'--font={font_name}',
>>>>>>>                 f'--text={line_gt_text}',
>>>>>>>                 f'--outputbase={output_directory}/{file_base_name}',
>>>>>>>                 '--max_pages=1',
>>>>>>>                 '--strip_unrenderable_words',
>>>>>>>                 '--leading=36',
>>>>>>>                 '--xsize=3600',
>>>>>>>                 '--ysize=330',
>>>>>>>                 '--char_spacing=1.0',
>>>>>>>                 '--exposure=0',
>>>>>>>                 '--unicharset_file=langdata/eng.unicharset',
>>>>>>>             ])
>>>>>>>
>>>>>>> if __name__ == "__main__":
>>>>>>>     parser = argparse.ArgumentParser()
>>>>>>>     parser.add_argument('--start', type=int, help='Starting line 
>>>>>>> count (inclusive)')
>>>>>>>     parser.add_argument('--end', type=int, help='Ending line count 
>>>>>>> (inclusive)')
>>>>>>>     args = parser.parse_args()
>>>>>>>
>>>>>>>     training_text_file = 'langdata/eng.training_text'
>>>>>>>     output_directory = 'tesstrain/data/eng-ground-truth'
>>>>>>>
>>>>>>>     font_list = FontList()
>>>>>>>
>>>>>>>     create_training_data(training_text_file, font_list, 
>>>>>>> output_directory, args.start, args.end)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Then create a file called "FontList" in the root directory and paste 
>>>>>>> it.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> class FontList:
>>>>>>>     def __init__(self):
>>>>>>>         self.fonts = [
>>>>>>>         "Gerlick"
>>>>>>>             "Sagar Medium",
>>>>>>>             "Ekushey Lohit Normal",  
>>>>>>>            "Charukola Round Head Regular, weight=433",
>>>>>>>             "Charukola Round Head Bold, weight=443",
>>>>>>>             "Ador Orjoma Unicode",
>>>>>>>       
>>>>>>>           
>>>>>>>                        
>>>>>>> ]                         
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> then import in the above code,
>>>>>>>
>>>>>>>
>>>>>>> *for breakpoint command:*
>>>>>>>
>>>>>>>
>>>>>>> sudo python3 split_training_text.py --start 0  --end 11
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> change checkpoint according to you  --start 0 --end 11.
>>>>>>>
>>>>>>> *and training checkpoint as you know already.*
>>>>>>>
>>>>>>>
>>>>>>> On Monday, 11 September, 2023 at 1:22:34 am UTC+6 [email protected] 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi mhalidu, 
>>>>>>>> the script you posted here seems much more extensive than you 
>>>>>>>> posted before: 
>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/0e2880d9-64c0-4659-b497-902a5747caf4n%40googlegroups.com
>>>>>>>> .
>>>>>>>>
>>>>>>>> I have been using your earlier script. It is magical. How is this 
>>>>>>>> one different from the earlier one?
>>>>>>>>
>>>>>>>> Thank you for posting these scripts, by the way. It has saved my 
>>>>>>>> countless hours; by running multiple fonts in one sweep. I was not 
>>>>>>>> able to 
>>>>>>>> find any instruction on how to train for  multiple fonts. The official 
>>>>>>>> manual is also unclear. YOUr script helped me to get started. 
>>>>>>>> On Wednesday, August 9, 2023 at 11:00:49 PM UTC+3 
>>>>>>>> [email protected] wrote:
>>>>>>>>
>>>>>>>>> ok, I will try as you said.
>>>>>>>>> one more thing, what's the role of the trained_text lines will be? 
>>>>>>>>> I have seen Bengali text are long words of lines. so I wanna know how 
>>>>>>>>> many 
>>>>>>>>> words or characters will be the better choice for the train? 
>>>>>>>>> and '--xsize=3600','--ysize=350',  will be according to words of 
>>>>>>>>> lines?
>>>>>>>>>
>>>>>>>>> On Thursday, 10 August, 2023 at 1:10:14 am UTC+6 shree wrote:
>>>>>>>>>
>>>>>>>>>> Include the default fonts also in your fine-tuning list of fonts 
>>>>>>>>>> and see if that helps.
>>>>>>>>>>
>>>>>>>>>> On Wed, Aug 9, 2023, 2:27 PM Ali hussain <[email protected]> 
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> I have trained some new fonts by fine-tune methods for the 
>>>>>>>>>>> Bengali language in Tesseract 5 and I have used all official 
>>>>>>>>>>> trained_text 
>>>>>>>>>>> and tessdata_best and other things also.  everything is good but 
>>>>>>>>>>> the 
>>>>>>>>>>> problem is the default font which was trained before that does not 
>>>>>>>>>>> convert 
>>>>>>>>>>> text like prev but my new fonts work well. I don't understand why 
>>>>>>>>>>> it's 
>>>>>>>>>>> happening. I share code based to understand what going on.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *codes  for creating tif, gt.txt, .box files:*
>>>>>>>>>>> import os
>>>>>>>>>>> import random
>>>>>>>>>>> import pathlib
>>>>>>>>>>> import subprocess
>>>>>>>>>>> import argparse
>>>>>>>>>>> from FontList import FontList
>>>>>>>>>>>
>>>>>>>>>>> def read_line_count():
>>>>>>>>>>>     if os.path.exists('line_count.txt'):
>>>>>>>>>>>         with open('line_count.txt', 'r') as file:
>>>>>>>>>>>             return int(file.read())
>>>>>>>>>>>     return 0
>>>>>>>>>>>
>>>>>>>>>>> def write_line_count(line_count):
>>>>>>>>>>>     with open('line_count.txt', 'w') as file:
>>>>>>>>>>>         file.write(str(line_count))
>>>>>>>>>>>
>>>>>>>>>>> def create_training_data(training_text_file, font_list, 
>>>>>>>>>>> output_directory, start_line=None, end_line=None):
>>>>>>>>>>>     lines = []
>>>>>>>>>>>     with open(training_text_file, 'r') as input_file:
>>>>>>>>>>>         for line in input_file.readlines():
>>>>>>>>>>>             lines.append(line.strip())
>>>>>>>>>>>     
>>>>>>>>>>>     if not os.path.exists(output_directory):
>>>>>>>>>>>         os.mkdir(output_directory)
>>>>>>>>>>>     
>>>>>>>>>>>     random.shuffle(lines)
>>>>>>>>>>>     
>>>>>>>>>>>     if start_line is None:
>>>>>>>>>>>         line_count = read_line_count()  # Set the starting 
>>>>>>>>>>> line_count from the file
>>>>>>>>>>>     else:
>>>>>>>>>>>         line_count = start_line
>>>>>>>>>>>     
>>>>>>>>>>>     if end_line is None:
>>>>>>>>>>>         end_line_count = len(lines) - 1  # Set the ending 
>>>>>>>>>>> line_count
>>>>>>>>>>>     else:
>>>>>>>>>>>         end_line_count = min(end_line, len(lines) - 1)
>>>>>>>>>>>     
>>>>>>>>>>>     for font in font_list.fonts:  # Iterate through all the 
>>>>>>>>>>> fonts in the font_list
>>>>>>>>>>>         font_serial = 1
>>>>>>>>>>>         for line in lines:
>>>>>>>>>>>             training_text_file_name = pathlib.Path(
>>>>>>>>>>> training_text_file).stem
>>>>>>>>>>>             
>>>>>>>>>>>             # Generate a unique serial number for each line
>>>>>>>>>>>             line_serial = f"{line_count:d}"
>>>>>>>>>>>             
>>>>>>>>>>>             # GT (Ground Truth) text filename
>>>>>>>>>>>             line_gt_text = os.path.join(output_directory, f'{
>>>>>>>>>>> training_text_file_name}_{line_serial}.gt.txt')
>>>>>>>>>>>             with open(line_gt_text, 'w') as output_file:
>>>>>>>>>>>                 output_file.writelines([line])
>>>>>>>>>>>             
>>>>>>>>>>>             # Image filename
>>>>>>>>>>>             file_base_name = f'ben_{line_serial}'  # Unique 
>>>>>>>>>>> filename for each font
>>>>>>>>>>>             subprocess.run([
>>>>>>>>>>>                 'text2image',
>>>>>>>>>>>                 f'--font={font}',
>>>>>>>>>>>                 f'--text={line_gt_text}',
>>>>>>>>>>>                 f'--outputbase={output_directory}/{
>>>>>>>>>>> file_base_name}',
>>>>>>>>>>>                 '--max_pages=1',
>>>>>>>>>>>                 '--strip_unrenderable_words',
>>>>>>>>>>>                 '--leading=36',
>>>>>>>>>>>                 '--xsize=3600',
>>>>>>>>>>>                 '--ysize=350',
>>>>>>>>>>>                 '--char_spacing=1.0',
>>>>>>>>>>>                 '--exposure=0',
>>>>>>>>>>>                 '--unicharset_file=langdata/ben.unicharset',
>>>>>>>>>>>             ])
>>>>>>>>>>>             
>>>>>>>>>>>             line_count += 1
>>>>>>>>>>>             font_serial += 1
>>>>>>>>>>>         
>>>>>>>>>>>         # Reset font_serial for the next font iteration
>>>>>>>>>>>         font_serial = 1
>>>>>>>>>>>     
>>>>>>>>>>>     write_line_count(line_count)  # Update the line_count in 
>>>>>>>>>>> the file
>>>>>>>>>>>
>>>>>>>>>>> if __name__ == "__main__":
>>>>>>>>>>>     parser = argparse.ArgumentParser()
>>>>>>>>>>>     parser.add_argument('--start', type=int, help='Starting 
>>>>>>>>>>> line count (inclusive)')
>>>>>>>>>>>     parser.add_argument('--end', type=int, help='Ending line 
>>>>>>>>>>> count (inclusive)')
>>>>>>>>>>>     args = parser.parse_args()
>>>>>>>>>>>     
>>>>>>>>>>>     training_text_file = 'langdata/ben.training_text'
>>>>>>>>>>>     output_directory = 'tesstrain/data/ben-ground-truth'
>>>>>>>>>>>     
>>>>>>>>>>>     # Create an instance of the FontList class
>>>>>>>>>>>     font_list = FontList()
>>>>>>>>>>>      
>>>>>>>>>>>     create_training_data(training_text_file, font_list, 
>>>>>>>>>>> output_directory, args.start, args.end)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *and for training code:*
>>>>>>>>>>>
>>>>>>>>>>> import subprocess
>>>>>>>>>>>
>>>>>>>>>>> # List of font names
>>>>>>>>>>> font_names = ['ben']
>>>>>>>>>>>
>>>>>>>>>>> for font in font_names:
>>>>>>>>>>>     command = f"TESSDATA_PREFIX=../tesseract/tessdata make 
>>>>>>>>>>> training MODEL_NAME={font} START_MODEL=ben 
>>>>>>>>>>> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000 LANG_TYPE=Indic"
>>>>>>>>>>>     subprocess.run(command, shell=True)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> any suggestion to identify to extract the problem.
>>>>>>>>>>> thanks, everyone
>>>>>>>>>>>
>>>>>>>>>>> -- 
>>>>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from 
>>>>>>>>>>> it, send an email to [email protected].
>>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com
>>>>>>>>>>>  
>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>> .
>>>>>>>>>>>
>>>>>>>>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>>
> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/2563df8d-e261-497c-8fa6-821f013023ban%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/2563df8d-e261-497c-8fa6-821f013023ban%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a8ae7109-9f56-45bb-85d6-92cdd61a08bbn%40googlegroups.com.

Re: [tesseract-ocr] accuracy problem after trained in fine-tune

Reply via email to