Re: [tesseract-ocr] accuracy problem after trained in fine-tune

Ali hussain Sun, 10 Sep 2023 18:27:33 -0700

You can use the new script below. it's better than the previous two 
scripts.  You can create *tif, gt.txt, and .box files *by multiple fonts 
and also use breakpoint if vs code close or anything during creating *tif, 
gt.txt, and .box files *then you can checkpoint to navigate where you close 
vs code.


command for *tif, gt.txt, and .box files *


import os
import random
import pathlib
import subprocess
import argparse
from FontList import FontList

def create_training_data(training_text_file, font_list, output_directory, 
start_line=None, end_line=None):
    lines = []
    with open(training_text_file, 'r') as input_file:
        lines = input_file.readlines()

    if not os.path.exists(output_directory):
        os.mkdir(output_directory)

    if start_line is None:
        start_line = 0

    if end_line is None:
        end_line = len(lines) - 1

    for font_name in font_list.fonts:
        for line_index in range(start_line, end_line + 1):
            line = lines[line_index].strip()

            training_text_file_name = pathlib.Path(training_text_file).stem

            line_serial = f"{line_index:d}"

            line_gt_text = os.path.join(output_directory, f'{
training_text_file_name}_{line_serial}_{font_name.replace(" ", "_")}.gt.txt'
)


            with open(line_gt_text, 'w') as output_file:
                output_file.writelines([line])

            file_base_name = f'{training_text_file_name}_{line_serial}_{
font_name.replace(" ", "_")}'
            subprocess.run([
                'text2image',
                f'--font={font_name}',
                f'--text={line_gt_text}',
                f'--outputbase={output_directory}/{file_base_name}',
                '--max_pages=1',
                '--strip_unrenderable_words',
                '--leading=36',
                '--xsize=3600',
                '--ysize=330',
                '--char_spacing=1.0',
                '--exposure=0',
                '--unicharset_file=langdata/eng.unicharset',
            ])

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('--start', type=int, help='Starting line count 
(inclusive)')
    parser.add_argument('--end', type=int, help='Ending line count 
(inclusive)')
    args = parser.parse_args()

    training_text_file = 'langdata/eng.training_text'
    output_directory = 'tesstrain/data/eng-ground-truth'

    font_list = FontList()

    create_training_data(training_text_file, font_list, output_directory, 
args.start, args.end)



Then create a file called "FontList" in the root directory and paste it.



class FontList:
    def __init__(self):
        self.fonts = [
        "Gerlick"
            "Sagar Medium",
            "Ekushey Lohit Normal",  
           "Charukola Round Head Regular, weight=433",
            "Charukola Round Head Bold, weight=443",
            "Ador Orjoma Unicode",
      
          
                       
]                         



then import in the above code,


*for breakpoint command:*


sudo python3 split_training_text.py --start 0  --end 11



change checkpoint according to you  --start 0 --end 11.

*and training checkpoint as you know already.*


On Monday, 11 September, 2023 at 1:22:34 am UTC+6 desal...@gmail.com wrote:

> Hi mhalidu, 
> the script you posted here seems much more extensive than you posted 
> before: 
> https://groups.google.com/d/msgid/tesseract-ocr/0e2880d9-64c0-4659-b497-902a5747caf4n%40googlegroups.com
> .
>
> I have been using your earlier script. It is magical. How is this one 
> different from the earlier one?
>
> Thank you for posting these scripts, by the way. It has saved my countless 
> hours; by running multiple fonts in one sweep. I was not able to find any 
> instruction on how to train for  multiple fonts. The official manual is 
> also unclear. YOUr script helped me to get started. 
> On Wednesday, August 9, 2023 at 11:00:49 PM UTC+3 mdalihu...@gmail.com 
> wrote:
>
>> ok, I will try as you said.
>> one more thing, what's the role of the trained_text lines will be? I have 
>> seen Bengali text are long words of lines. so I wanna know how many words 
>> or characters will be the better choice for the train? 
>> and '--xsize=3600','--ysize=350',  will be according to words of lines?
>>
>> On Thursday, 10 August, 2023 at 1:10:14 am UTC+6 shree wrote:
>>
>>> Include the default fonts also in your fine-tuning list of fonts and see 
>>> if that helps.
>>>
>>> On Wed, Aug 9, 2023, 2:27 PM Ali hussain <mdalihu...@gmail.com> wrote:
>>>
>>>> I have trained some new fonts by fine-tune methods for the Bengali 
>>>> language in Tesseract 5 and I have used all official trained_text and 
>>>> tessdata_best and other things also.  everything is good but the problem 
>>>> is 
>>>> the default font which was trained before that does not convert text like 
>>>> prev but my new fonts work well. I don't understand why it's happening. I 
>>>> share code based to understand what going on.
>>>>
>>>>
>>>> *codes  for creating tif, gt.txt, .box files:*
>>>> import os
>>>> import random
>>>> import pathlib
>>>> import subprocess
>>>> import argparse
>>>> from FontList import FontList
>>>>
>>>> def read_line_count():
>>>>     if os.path.exists('line_count.txt'):
>>>>         with open('line_count.txt', 'r') as file:
>>>>             return int(file.read())
>>>>     return 0
>>>>
>>>> def write_line_count(line_count):
>>>>     with open('line_count.txt', 'w') as file:
>>>>         file.write(str(line_count))
>>>>
>>>> def create_training_data(training_text_file, font_list, 
>>>> output_directory, start_line=None, end_line=None):
>>>>     lines = []
>>>>     with open(training_text_file, 'r') as input_file:
>>>>         for line in input_file.readlines():
>>>>             lines.append(line.strip())
>>>>     
>>>>     if not os.path.exists(output_directory):
>>>>         os.mkdir(output_directory)
>>>>     
>>>>     random.shuffle(lines)
>>>>     
>>>>     if start_line is None:
>>>>         line_count = read_line_count()  # Set the starting line_count 
>>>> from the file
>>>>     else:
>>>>         line_count = start_line
>>>>     
>>>>     if end_line is None:
>>>>         end_line_count = len(lines) - 1  # Set the ending line_count
>>>>     else:
>>>>         end_line_count = min(end_line, len(lines) - 1)
>>>>     
>>>>     for font in font_list.fonts:  # Iterate through all the fonts in 
>>>> the font_list
>>>>         font_serial = 1
>>>>         for line in lines:
>>>>             training_text_file_name = pathlib.Path(training_text_file
>>>> ).stem
>>>>             
>>>>             # Generate a unique serial number for each line
>>>>             line_serial = f"{line_count:d}"
>>>>             
>>>>             # GT (Ground Truth) text filename
>>>>             line_gt_text = os.path.join(output_directory, f'{
>>>> training_text_file_name}_{line_serial}.gt.txt')
>>>>             with open(line_gt_text, 'w') as output_file:
>>>>                 output_file.writelines([line])
>>>>             
>>>>             # Image filename
>>>>             file_base_name = f'ben_{line_serial}'  # Unique filename 
>>>> for each font
>>>>             subprocess.run([
>>>>                 'text2image',
>>>>                 f'--font={font}',
>>>>                 f'--text={line_gt_text}',
>>>>                 f'--outputbase={output_directory}/{file_base_name}',
>>>>                 '--max_pages=1',
>>>>                 '--strip_unrenderable_words',
>>>>                 '--leading=36',
>>>>                 '--xsize=3600',
>>>>                 '--ysize=350',
>>>>                 '--char_spacing=1.0',
>>>>                 '--exposure=0',
>>>>                 '--unicharset_file=langdata/ben.unicharset',
>>>>             ])
>>>>             
>>>>             line_count += 1
>>>>             font_serial += 1
>>>>         
>>>>         # Reset font_serial for the next font iteration
>>>>         font_serial = 1
>>>>     
>>>>     write_line_count(line_count)  # Update the line_count in the file
>>>>
>>>> if __name__ == "__main__":
>>>>     parser = argparse.ArgumentParser()
>>>>     parser.add_argument('--start', type=int, help='Starting line count 
>>>> (inclusive)')
>>>>     parser.add_argument('--end', type=int, help='Ending line count 
>>>> (inclusive)')
>>>>     args = parser.parse_args()
>>>>     
>>>>     training_text_file = 'langdata/ben.training_text'
>>>>     output_directory = 'tesstrain/data/ben-ground-truth'
>>>>     
>>>>     # Create an instance of the FontList class
>>>>     font_list = FontList()
>>>>      
>>>>     create_training_data(training_text_file, font_list, 
>>>> output_directory, args.start, args.end)
>>>>
>>>>
>>>> *and for training code:*
>>>>
>>>> import subprocess
>>>>
>>>> # List of font names
>>>> font_names = ['ben']
>>>>
>>>> for font in font_names:
>>>>     command = f"TESSDATA_PREFIX=../tesseract/tessdata make training 
>>>> MODEL_NAME={font} START_MODEL=ben TESSDATA=../tesseract/tessdata 
>>>> MAX_ITERATIONS=10000 LANG_TYPE=Indic"
>>>>     subprocess.run(command, shell=True)
>>>>
>>>>
>>>> any suggestion to identify to extract the problem.
>>>> thanks, everyone
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to tesseract-oc...@googlegroups.com.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/c504c4e3-4cd3-4514-b61d-819c77ba933en%40googlegroups.com.

Re: [tesseract-ocr] accuracy problem after trained in fine-tune

Reply via email to