Re: [tesseract-ocr] accuracy problem after trained in fine-tune

Ali hussain Tue, 12 Sep 2023 13:11:39 -0700

Yes, he doesn't mention all fonts but only one font.  That way he didn't 
use *MODEL_NAME*.


Actually, here we teach all *tif, gt.txt, and .box files *which are created 
by  *MODEL_NAME I mean **eng, ben, oro flag or language code *because when 
we first create *tif, gt.txt, and .box files, *every file starts by  
*MODEL_NAME*. This  *MODEL_NAME*  we selected on the training script for 
looping each tif, gt.txt, and .box files which are created by  *MODEL_NAME.*
On Tuesday, 12 September, 2023 at 9:42:13 pm UTC+6 [email protected] wrote:

> Yes, I am familiar with the video and have set up the folder structure as 
> you did. Indeed, I have tried a number of fine-tuning with a single font 
> following Gracia's video. But, your script is much  better because supports 
> multiple fonts. The whole improvement you made is  brilliant; and very 
> useful. It is all working for me. 
> The only part that I didn't understand is the trick you used in your 
> tesseract_train.py script. You see, I have been doing exactly to you did 
> except this script. 
>
> The scripts seems to have the trick of sending/teaching each of the fonts 
> (iteratively) into the model. The script I have been using  (which I get 
> from Garcia) doesn't mention font at all. 
>
> *TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=oro 
> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000*
> Does it mean that my model does't train the fonts (even if the fonts have 
> been included in the splitting process, in the other script)?
> On Monday, September 11, 2023 at 10:54:08 AM UTC+3 [email protected] 
> wrote:
>
>>
>>
>>
>>
>>
>>
>> *import subprocess# List of font namesfont_names = ['ben']for font in 
>> font_names:    command = f"TESSDATA_PREFIX=../tesseract/tessdata make 
>> training MODEL_NAME={font} START_MODEL=ben TESSDATA=../tesseract/tessdata 
>> MAX_ITERATIONS=10000"*
>>
>>
>> *    subprocess.run(command, shell=True) 1 . This command is for training 
>> data that I have named '*tesseract_training*.py' inside tesstrain 
>> folder.*
>> *2. root directory means your main training folder and inside it as like 
>> langdata, tessearact,  tesstrain folders. if you see this tutorial    *
>> https://www.youtube.com/watch?v=KE4xEzFGSU8   you will understand better 
>> the folder structure. only I created tesseract_training.py in tesstrain 
>> folder for training and  FontList.py file is the main path as *like 
>> langdata, tessearact,  tesstrain, and *split_training_text.py.
>> 3. first of all you have to put all fonts in your Linux fonts folder.  
>>  /usr/share/fonts/  then run:  sudo apt update  then sudo fc-cache -fv
>>
>> after that, you have to add the exact font's name in FontList.py file 
>> like me.
>> I  have added two pic my folder structure. first is main structure pic 
>> and the second is the Colopse tesstrain folder.
>>
>> I[image: Screenshot 2023-09-11 134947.png][image: Screenshot 2023-09-11 
>> 135014.png] 
>> On Monday, 11 September, 2023 at 12:50:03 pm UTC+6 [email protected] 
>> wrote:
>>
>>> Thank you so much for putting out these brilliant scripts. They make the 
>>> process  much more efficient.
>>>
>>> I have one more question on the other script that you use to train. 
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *import subprocess# List of font namesfont_names = ['ben']for font in 
>>> font_names:    command = f"TESSDATA_PREFIX=../tesseract/tessdata make 
>>> training MODEL_NAME={font} START_MODEL=ben TESSDATA=../tesseract/tessdata 
>>> MAX_ITERATIONS=10000"*
>>> *    subprocess.run(command, shell=True) *
>>>
>>> Do you have the name of fonts listed in file in the same/root directory?
>>> How do you setup the names of the fonts in the file, if you don't mind 
>>> sharing it?
>>> On Monday, September 11, 2023 at 4:27:27 AM UTC+3 [email protected] 
>>> wrote:
>>>
>>>> You can use the new script below. it's better than the previous two 
>>>> scripts.  You can create *tif, gt.txt, and .box files *by multiple 
>>>> fonts and also use breakpoint if vs code close or anything during creating 
>>>> *tif, 
>>>> gt.txt, and .box files *then you can checkpoint to navigate where you 
>>>> close vs code.
>>>>
>>>> command for *tif, gt.txt, and .box files *
>>>>
>>>>
>>>> import os
>>>> import random
>>>> import pathlib
>>>> import subprocess
>>>> import argparse
>>>> from FontList import FontList
>>>>
>>>> def create_training_data(training_text_file, font_list, 
>>>> output_directory, start_line=None, end_line=None):
>>>>     lines = []
>>>>     with open(training_text_file, 'r') as input_file:
>>>>         lines = input_file.readlines()
>>>>
>>>>     if not os.path.exists(output_directory):
>>>>         os.mkdir(output_directory)
>>>>
>>>>     if start_line is None:
>>>>         start_line = 0
>>>>
>>>>     if end_line is None:
>>>>         end_line = len(lines) - 1
>>>>
>>>>     for font_name in font_list.fonts:
>>>>         for line_index in range(start_line, end_line + 1):
>>>>             line = lines[line_index].strip()
>>>>
>>>>             training_text_file_name = pathlib.Path(training_text_file
>>>> ).stem
>>>>
>>>>             line_serial = f"{line_index:d}"
>>>>
>>>>             line_gt_text = os.path.join(output_directory, f'{
>>>> training_text_file_name}_{line_serial}_{font_name.replace(" ", "_")}
>>>> .gt.txt')
>>>>
>>>>
>>>>             with open(line_gt_text, 'w') as output_file:
>>>>                 output_file.writelines([line])
>>>>
>>>>             file_base_name = f'{training_text_file_name}_{line_serial}_
>>>> {font_name.replace(" ", "_")}'
>>>>             subprocess.run([
>>>>                 'text2image',
>>>>                 f'--font={font_name}',
>>>>                 f'--text={line_gt_text}',
>>>>                 f'--outputbase={output_directory}/{file_base_name}',
>>>>                 '--max_pages=1',
>>>>                 '--strip_unrenderable_words',
>>>>                 '--leading=36',
>>>>                 '--xsize=3600',
>>>>                 '--ysize=330',
>>>>                 '--char_spacing=1.0',
>>>>                 '--exposure=0',
>>>>                 '--unicharset_file=langdata/eng.unicharset',
>>>>             ])
>>>>
>>>> if __name__ == "__main__":
>>>>     parser = argparse.ArgumentParser()
>>>>     parser.add_argument('--start', type=int, help='Starting line count 
>>>> (inclusive)')
>>>>     parser.add_argument('--end', type=int, help='Ending line count 
>>>> (inclusive)')
>>>>     args = parser.parse_args()
>>>>
>>>>     training_text_file = 'langdata/eng.training_text'
>>>>     output_directory = 'tesstrain/data/eng-ground-truth'
>>>>
>>>>     font_list = FontList()
>>>>
>>>>     create_training_data(training_text_file, font_list, 
>>>> output_directory, args.start, args.end)
>>>>
>>>>
>>>>
>>>> Then create a file called "FontList" in the root directory and paste it.
>>>>
>>>>
>>>>
>>>> class FontList:
>>>>     def __init__(self):
>>>>         self.fonts = [
>>>>         "Gerlick"
>>>>             "Sagar Medium",
>>>>             "Ekushey Lohit Normal",  
>>>>            "Charukola Round Head Regular, weight=433",
>>>>             "Charukola Round Head Bold, weight=443",
>>>>             "Ador Orjoma Unicode",
>>>>       
>>>>           
>>>>                        
>>>> ]                         
>>>>
>>>>
>>>>
>>>> then import in the above code,
>>>>
>>>>
>>>> *for breakpoint command:*
>>>>
>>>>
>>>> sudo python3 split_training_text.py --start 0  --end 11
>>>>
>>>>
>>>>
>>>> change checkpoint according to you  --start 0 --end 11.
>>>>
>>>> *and training checkpoint as you know already.*
>>>>
>>>>
>>>> On Monday, 11 September, 2023 at 1:22:34 am UTC+6 [email protected] 
>>>> wrote:
>>>>
>>>>> Hi mhalidu, 
>>>>> the script you posted here seems much more extensive than you posted 
>>>>> before: 
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/0e2880d9-64c0-4659-b497-902a5747caf4n%40googlegroups.com
>>>>> .
>>>>>
>>>>> I have been using your earlier script. It is magical. How is this one 
>>>>> different from the earlier one?
>>>>>
>>>>> Thank you for posting these scripts, by the way. It has saved my 
>>>>> countless hours; by running multiple fonts in one sweep. I was not able 
>>>>> to 
>>>>> find any instruction on how to train for  multiple fonts. The official 
>>>>> manual is also unclear. YOUr script helped me to get started. 
>>>>> On Wednesday, August 9, 2023 at 11:00:49 PM UTC+3 [email protected] 
>>>>> wrote:
>>>>>
>>>>>> ok, I will try as you said.
>>>>>> one more thing, what's the role of the trained_text lines will be? I 
>>>>>> have seen Bengali text are long words of lines. so I wanna know how many 
>>>>>> words or characters will be the better choice for the train? 
>>>>>> and '--xsize=3600','--ysize=350',  will be according to words of lines?
>>>>>>
>>>>>> On Thursday, 10 August, 2023 at 1:10:14 am UTC+6 shree wrote:
>>>>>>
>>>>>>> Include the default fonts also in your fine-tuning list of fonts and 
>>>>>>> see if that helps.
>>>>>>>
>>>>>>> On Wed, Aug 9, 2023, 2:27 PM Ali hussain <[email protected]> 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I have trained some new fonts by fine-tune methods for the Bengali 
>>>>>>>> language in Tesseract 5 and I have used all official trained_text and 
>>>>>>>> tessdata_best and other things also.  everything is good but the 
>>>>>>>> problem is 
>>>>>>>> the default font which was trained before that does not convert text 
>>>>>>>> like 
>>>>>>>> prev but my new fonts work well. I don't understand why it's 
>>>>>>>> happening. I 
>>>>>>>> share code based to understand what going on.
>>>>>>>>
>>>>>>>>
>>>>>>>> *codes  for creating tif, gt.txt, .box files:*
>>>>>>>> import os
>>>>>>>> import random
>>>>>>>> import pathlib
>>>>>>>> import subprocess
>>>>>>>> import argparse
>>>>>>>> from FontList import FontList
>>>>>>>>
>>>>>>>> def read_line_count():
>>>>>>>>     if os.path.exists('line_count.txt'):
>>>>>>>>         with open('line_count.txt', 'r') as file:
>>>>>>>>             return int(file.read())
>>>>>>>>     return 0
>>>>>>>>
>>>>>>>> def write_line_count(line_count):
>>>>>>>>     with open('line_count.txt', 'w') as file:
>>>>>>>>         file.write(str(line_count))
>>>>>>>>
>>>>>>>> def create_training_data(training_text_file, font_list, 
>>>>>>>> output_directory, start_line=None, end_line=None):
>>>>>>>>     lines = []
>>>>>>>>     with open(training_text_file, 'r') as input_file:
>>>>>>>>         for line in input_file.readlines():
>>>>>>>>             lines.append(line.strip())
>>>>>>>>     
>>>>>>>>     if not os.path.exists(output_directory):
>>>>>>>>         os.mkdir(output_directory)
>>>>>>>>     
>>>>>>>>     random.shuffle(lines)
>>>>>>>>     
>>>>>>>>     if start_line is None:
>>>>>>>>         line_count = read_line_count()  # Set the starting 
>>>>>>>> line_count from the file
>>>>>>>>     else:
>>>>>>>>         line_count = start_line
>>>>>>>>     
>>>>>>>>     if end_line is None:
>>>>>>>>         end_line_count = len(lines) - 1  # Set the ending 
>>>>>>>> line_count
>>>>>>>>     else:
>>>>>>>>         end_line_count = min(end_line, len(lines) - 1)
>>>>>>>>     
>>>>>>>>     for font in font_list.fonts:  # Iterate through all the fonts 
>>>>>>>> in the font_list
>>>>>>>>         font_serial = 1
>>>>>>>>         for line in lines:
>>>>>>>>             training_text_file_name = pathlib.Path(
>>>>>>>> training_text_file).stem
>>>>>>>>             
>>>>>>>>             # Generate a unique serial number for each line
>>>>>>>>             line_serial = f"{line_count:d}"
>>>>>>>>             
>>>>>>>>             # GT (Ground Truth) text filename
>>>>>>>>             line_gt_text = os.path.join(output_directory, f'{
>>>>>>>> training_text_file_name}_{line_serial}.gt.txt')
>>>>>>>>             with open(line_gt_text, 'w') as output_file:
>>>>>>>>                 output_file.writelines([line])
>>>>>>>>             
>>>>>>>>             # Image filename
>>>>>>>>             file_base_name = f'ben_{line_serial}'  # Unique 
>>>>>>>> filename for each font
>>>>>>>>             subprocess.run([
>>>>>>>>                 'text2image',
>>>>>>>>                 f'--font={font}',
>>>>>>>>                 f'--text={line_gt_text}',
>>>>>>>>                 f'--outputbase={output_directory}/{file_base_name}'
>>>>>>>> ,
>>>>>>>>                 '--max_pages=1',
>>>>>>>>                 '--strip_unrenderable_words',
>>>>>>>>                 '--leading=36',
>>>>>>>>                 '--xsize=3600',
>>>>>>>>                 '--ysize=350',
>>>>>>>>                 '--char_spacing=1.0',
>>>>>>>>                 '--exposure=0',
>>>>>>>>                 '--unicharset_file=langdata/ben.unicharset',
>>>>>>>>             ])
>>>>>>>>             
>>>>>>>>             line_count += 1
>>>>>>>>             font_serial += 1
>>>>>>>>         
>>>>>>>>         # Reset font_serial for the next font iteration
>>>>>>>>         font_serial = 1
>>>>>>>>     
>>>>>>>>     write_line_count(line_count)  # Update the line_count in the 
>>>>>>>> file
>>>>>>>>
>>>>>>>> if __name__ == "__main__":
>>>>>>>>     parser = argparse.ArgumentParser()
>>>>>>>>     parser.add_argument('--start', type=int, help='Starting line 
>>>>>>>> count (inclusive)')
>>>>>>>>     parser.add_argument('--end', type=int, help='Ending line count 
>>>>>>>> (inclusive)')
>>>>>>>>     args = parser.parse_args()
>>>>>>>>     
>>>>>>>>     training_text_file = 'langdata/ben.training_text'
>>>>>>>>     output_directory = 'tesstrain/data/ben-ground-truth'
>>>>>>>>     
>>>>>>>>     # Create an instance of the FontList class
>>>>>>>>     font_list = FontList()
>>>>>>>>      
>>>>>>>>     create_training_data(training_text_file, font_list, 
>>>>>>>> output_directory, args.start, args.end)
>>>>>>>>
>>>>>>>>
>>>>>>>> *and for training code:*
>>>>>>>>
>>>>>>>> import subprocess
>>>>>>>>
>>>>>>>> # List of font names
>>>>>>>> font_names = ['ben']
>>>>>>>>
>>>>>>>> for font in font_names:
>>>>>>>>     command = f"TESSDATA_PREFIX=../tesseract/tessdata make 
>>>>>>>> training MODEL_NAME={font} START_MODEL=ben 
>>>>>>>> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000 LANG_TYPE=Indic"
>>>>>>>>     subprocess.run(command, shell=True)
>>>>>>>>
>>>>>>>>
>>>>>>>> any suggestion to identify to extract the problem.
>>>>>>>> thanks, everyone
>>>>>>>>
>>>>>>>> -- 
>>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>> send an email to [email protected].
>>>>>>>> To view this discussion on the web visit 
>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com
>>>>>>>>  
>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>> .
>>>>>>>>
>>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/04cb5241-3be3-4c86-a316-4be4a5a3e0f5n%40googlegroups.com.

Re: [tesseract-ocr] accuracy problem after trained in fine-tune

Reply via email to