Re: [tesseract-ocr] accuracy problem after trained in fine-tune

Des Bw Wed, 13 Sep 2023 03:18:26 -0700

The problem for regex is that Tesseract is not consistent in its 
replacement. 
Think of the original training of English data doesn't contain the letter 
/u/. What does Tesseract do when it faces /u/ in actual processing??
In some cases, it replaces it with closely similar letters such as /v/ and 
/w/. In other cases, it completely removes it. That is what is happening 
with my case. Those characters re sometimes completely removed; other 
times, they are replaced by closely resembling characters. Because of this 
inconsistency, applying regex is very difficult.


On Wednesday, September 13, 2023 at 1:02:01 PM UTC+3 mdalihu...@gmail.com 
wrote:

> if Some specific characters or words are always missing from the OCR 
> result.  then you can apply logic with the Regular expressions method on 
> your applications. After OCR, these specific characters or words will be 
> replaced by current characters or words that you defined in your 
> applications by  Regular expressions. it can be done in some major problems.
>
> On Wednesday, 13 September, 2023 at 3:51:29 pm UTC+6 desal...@gmail.com 
> wrote:
>
>> The characters are getting missed, even after fine-tuning. 
>> I never made any progress. I tried many different ways. Some  specific 
>> characters are always missing from the OCR result.  
>>
>> On Wednesday, September 13, 2023 at 12:49:20 PM UTC+3 
>> mdalihu...@gmail.com wrote:
>>
>>> EasyOCR I think is best for ID cards or something like that image 
>>> process. but document images like books, here Tesseract is better than 
>>> EasyOCR.  Even I didn't use EasyOCR. you can try it.
>>>
>>> I have added words of dictionaries but the result is the same. 
>>>
>>> what kind of problem you have faced in fine-tuning in few new characters 
>>> as you said (*but, I failed in every possible way to introduce a few 
>>> new characters into the database.)*
>>> On Wednesday, 13 September, 2023 at 3:33:48 pm UTC+6 desal...@gmail.com 
>>> wrote:
>>>
>>>> Yes, we are new to this. I find the instructions (the manual) very hard 
>>>> to follow. The video you linked above was really helpful  to get started. 
>>>> My plan at the beginning was to fine tune the existing .traineddata. But, 
>>>> I 
>>>> failed in every possible way to introduce a few new characters into the 
>>>> database. That is why I started from scratch. 
>>>>
>>>> Sure, I will follow Lorenzo's suggestion: will run more the iterations, 
>>>> and see if I can improve. 
>>>>
>>>> Another areas we need to explore is usage of dictionaries actually. May 
>>>> be adding millions of words into the dictionary could help Tesseract. I 
>>>> don't have millions of words; but I am looking into some corpus to get 
>>>> more 
>>>> words into the dictionary. 
>>>>
>>>> If this all fails, EasyOCR (and probably other similar open-source 
>>>> packages)  is probably our next option to try on. Sure, sharing 
>>>> our experiences will be helpful. I will let you know if I made good 
>>>> progresses in any of these options. 
>>>> On Wednesday, September 13, 2023 at 12:19:48 PM UTC+3 
>>>> mdalihu...@gmail.com wrote:
>>>>
>>>>> How is your training going for Bengali?  It was nearly good but I 
>>>>> faced space problems between two words, some words are spaces but most of 
>>>>> them have no space. I think is problem is in the dataset but I use the 
>>>>> default training dataset from Tesseract which is used in Ben That way I 
>>>>> am 
>>>>> confused so I have to explore more. by the way,  you can try as Lorenzo 
>>>>> Blz said.  Actually training from scratch is harder than fine-tuning. 
>>>>> so you can use different datasets to explore. if you succeed. please let 
>>>>> me 
>>>>> know how you have done this whole process.  I'm also new in this field.
>>>>> On Wednesday, 13 September, 2023 at 1:13:43 pm UTC+6 
>>>>> desal...@gmail.com wrote:
>>>>>
>>>>>> How is your training going for Bengali?
>>>>>> I have been trying to train from scratch. I made about 64,000 lines 
>>>>>> of text (which produced about 255,000 files, in the end) and run the 
>>>>>> training for 150,000 iterations; getting 0.51 training error rate. I was 
>>>>>> hopping to get reasonable accuracy. Unfortunately, when I run the OCR 
>>>>>> using  .traineddata,  the accuracy is absolutely terrible. Do you think 
>>>>>> I 
>>>>>> made some mistakes, or that is an expected result?
>>>>>>
>>>>>> On Tuesday, September 12, 2023 at 11:15:25 PM UTC+3 
>>>>>> mdalihu...@gmail.com wrote:
>>>>>>
>>>>>>> Yes, he doesn't mention all fonts but only one font.  That way he 
>>>>>>> didn't use *MODEL_NAME in a separate **script **file script I 
>>>>>>> think.*
>>>>>>>
>>>>>>> Actually, here we teach all *tif, gt.txt, and .box files *which are 
>>>>>>> created by  *MODEL_NAME I mean **eng, ben, oro flag or language 
>>>>>>> code *because when we first create *tif, gt.txt, and .box files, *every 
>>>>>>> file starts by  *MODEL_NAME*. This  *MODEL_NAME*  we selected on 
>>>>>>> the training script for looping each tif, gt.txt, and .box files which 
>>>>>>> are 
>>>>>>> created by  *MODEL_NAME.*
>>>>>>>
>>>>>>> On Tuesday, 12 September, 2023 at 9:42:13 pm UTC+6 
>>>>>>> desal...@gmail.com wrote:
>>>>>>>
>>>>>>>> Yes, I am familiar with the video and have set up the folder 
>>>>>>>> structure as you did. Indeed, I have tried a number of fine-tuning 
>>>>>>>> with a 
>>>>>>>> single font following Gracia's video. But, your script is much  better 
>>>>>>>> because supports multiple fonts. The whole improvement you made is  
>>>>>>>> brilliant; and very useful. It is all working for me. 
>>>>>>>> The only part that I didn't understand is the trick you used in 
>>>>>>>> your tesseract_train.py script. You see, I have been doing exactly to 
>>>>>>>> you 
>>>>>>>> did except this script. 
>>>>>>>>
>>>>>>>> The scripts seems to have the trick of sending/teaching each of the 
>>>>>>>> fonts (iteratively) into the model. The script I have been using  
>>>>>>>> (which I 
>>>>>>>> get from Garcia) doesn't mention font at all. 
>>>>>>>>
>>>>>>>> *TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=oro 
>>>>>>>> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000*
>>>>>>>> Does it mean that my model does't train the fonts (even if the 
>>>>>>>> fonts have been included in the splitting process, in the other 
>>>>>>>> script)?
>>>>>>>> On Monday, September 11, 2023 at 10:54:08 AM UTC+3 
>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *import subprocess# List of font namesfont_names = ['ben']for font 
>>>>>>>>> in font_names:    command = f"TESSDATA_PREFIX=../tesseract/tessdata 
>>>>>>>>> make 
>>>>>>>>> training MODEL_NAME={font} START_MODEL=ben 
>>>>>>>>> TESSDATA=../tesseract/tessdata 
>>>>>>>>> MAX_ITERATIONS=10000"*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *    subprocess.run(command, shell=True) 1 . This command is for 
>>>>>>>>> training data that I have named '*tesseract_training*.py' inside 
>>>>>>>>> tesstrain folder.*
>>>>>>>>> *2. root directory means your main training folder and inside it 
>>>>>>>>> as like langdata, tessearact,  tesstrain folders. if you see this 
>>>>>>>>> tutorial  
>>>>>>>>>   *https://www.youtube.com/watch?v=KE4xEzFGSU8   you will 
>>>>>>>>> understand better the folder structure. only I 
>>>>>>>>> created tesseract_training.py in tesstrain folder for training and  
>>>>>>>>> FontList.py file is the main path as *like langdata, tessearact,  
>>>>>>>>> tesstrain, and *split_training_text.py.
>>>>>>>>> 3. first of all you have to put all fonts in your Linux fonts 
>>>>>>>>> folder.   /usr/share/fonts/  then run:  sudo apt update  then sudo 
>>>>>>>>> fc-cache -fv
>>>>>>>>>
>>>>>>>>> after that, you have to add the exact font's name in FontList.py 
>>>>>>>>> file like me.
>>>>>>>>> I  have added two pic my folder structure. first is main 
>>>>>>>>> structure pic and the second is the Colopse tesstrain folder.
>>>>>>>>>
>>>>>>>>> I[image: Screenshot 2023-09-11 134947.png][image: Screenshot 
>>>>>>>>> 2023-09-11 135014.png] 
>>>>>>>>> On Monday, 11 September, 2023 at 12:50:03 pm UTC+6 
>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>
>>>>>>>>>> Thank you so much for putting out these brilliant scripts. They 
>>>>>>>>>> make the process  much more efficient.
>>>>>>>>>>
>>>>>>>>>> I have one more question on the other script that you use to 
>>>>>>>>>> train. 
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *import subprocess# List of font namesfont_names = ['ben']for 
>>>>>>>>>> font in font_names:    command = 
>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata 
>>>>>>>>>> make training MODEL_NAME={font} START_MODEL=ben 
>>>>>>>>>> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000"*
>>>>>>>>>> *    subprocess.run(command, shell=True) *
>>>>>>>>>>
>>>>>>>>>> Do you have the name of fonts listed in file in the same/root 
>>>>>>>>>> directory?
>>>>>>>>>> How do you setup the names of the fonts in the file, if you don't 
>>>>>>>>>> mind sharing it?
>>>>>>>>>> On Monday, September 11, 2023 at 4:27:27 AM UTC+3 
>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>
>>>>>>>>>>> You can use the new script below. it's better than the previous 
>>>>>>>>>>> two scripts.  You can create *tif, gt.txt, and .box files *by 
>>>>>>>>>>> multiple fonts and also use breakpoint if vs code close or anything 
>>>>>>>>>>> during 
>>>>>>>>>>> creating *tif, gt.txt, and .box files *then you can checkpoint 
>>>>>>>>>>> to navigate where you close vs code.
>>>>>>>>>>>
>>>>>>>>>>> command for *tif, gt.txt, and .box files *
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> import os
>>>>>>>>>>> import random
>>>>>>>>>>> import pathlib
>>>>>>>>>>> import subprocess
>>>>>>>>>>> import argparse
>>>>>>>>>>> from FontList import FontList
>>>>>>>>>>>
>>>>>>>>>>> def create_training_data(training_text_file, font_list, 
>>>>>>>>>>> output_directory, start_line=None, end_line=None):
>>>>>>>>>>>     lines = []
>>>>>>>>>>>     with open(training_text_file, 'r') as input_file:
>>>>>>>>>>>         lines = input_file.readlines()
>>>>>>>>>>>
>>>>>>>>>>>     if not os.path.exists(output_directory):
>>>>>>>>>>>         os.mkdir(output_directory)
>>>>>>>>>>>
>>>>>>>>>>>     if start_line is None:
>>>>>>>>>>>         start_line = 0
>>>>>>>>>>>
>>>>>>>>>>>     if end_line is None:
>>>>>>>>>>>         end_line = len(lines) - 1
>>>>>>>>>>>
>>>>>>>>>>>     for font_name in font_list.fonts:
>>>>>>>>>>>         for line_index in range(start_line, end_line + 1):
>>>>>>>>>>>             line = lines[line_index].strip()
>>>>>>>>>>>
>>>>>>>>>>>             training_text_file_name = pathlib.Path(
>>>>>>>>>>> training_text_file).stem
>>>>>>>>>>>
>>>>>>>>>>>             line_serial = f"{line_index:d}"
>>>>>>>>>>>
>>>>>>>>>>>             line_gt_text = os.path.join(output_directory, f'{
>>>>>>>>>>> training_text_file_name}_{line_serial}_{font_name.replace(" ", "
>>>>>>>>>>> _")}.gt.txt')
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>             with open(line_gt_text, 'w') as output_file:
>>>>>>>>>>>                 output_file.writelines([line])
>>>>>>>>>>>
>>>>>>>>>>>             file_base_name = f'{training_text_file_name}_{
>>>>>>>>>>> line_serial}_{font_name.replace(" ", "_")}'
>>>>>>>>>>>             subprocess.run([
>>>>>>>>>>>                 'text2image',
>>>>>>>>>>>                 f'--font={font_name}',
>>>>>>>>>>>                 f'--text={line_gt_text}',
>>>>>>>>>>>                 f'--outputbase={output_directory}/{
>>>>>>>>>>> file_base_name}',
>>>>>>>>>>>                 '--max_pages=1',
>>>>>>>>>>>                 '--strip_unrenderable_words',
>>>>>>>>>>>                 '--leading=36',
>>>>>>>>>>>                 '--xsize=3600',
>>>>>>>>>>>                 '--ysize=330',
>>>>>>>>>>>                 '--char_spacing=1.0',
>>>>>>>>>>>                 '--exposure=0',
>>>>>>>>>>>                 '--unicharset_file=langdata/eng.unicharset',
>>>>>>>>>>>             ])
>>>>>>>>>>>
>>>>>>>>>>> if __name__ == "__main__":
>>>>>>>>>>>     parser = argparse.ArgumentParser()
>>>>>>>>>>>     parser.add_argument('--start', type=int, help='Starting 
>>>>>>>>>>> line count (inclusive)')
>>>>>>>>>>>     parser.add_argument('--end', type=int, help='Ending line 
>>>>>>>>>>> count (inclusive)')
>>>>>>>>>>>     args = parser.parse_args()
>>>>>>>>>>>
>>>>>>>>>>>     training_text_file = 'langdata/eng.training_text'
>>>>>>>>>>>     output_directory = 'tesstrain/data/eng-ground-truth'
>>>>>>>>>>>
>>>>>>>>>>>     font_list = FontList()
>>>>>>>>>>>
>>>>>>>>>>>     create_training_data(training_text_file, font_list, 
>>>>>>>>>>> output_directory, args.start, args.end)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Then create a file called "FontList" in the root directory and 
>>>>>>>>>>> paste it.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> class FontList:
>>>>>>>>>>>     def __init__(self):
>>>>>>>>>>>         self.fonts = [
>>>>>>>>>>>         "Gerlick"
>>>>>>>>>>>             "Sagar Medium",
>>>>>>>>>>>             "Ekushey Lohit Normal",  
>>>>>>>>>>>            "Charukola Round Head Regular, weight=433",
>>>>>>>>>>>             "Charukola Round Head Bold, weight=443",
>>>>>>>>>>>             "Ador Orjoma Unicode",
>>>>>>>>>>>       
>>>>>>>>>>>           
>>>>>>>>>>>                        
>>>>>>>>>>> ]                         
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> then import in the above code,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *for breakpoint command:*
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> sudo python3 split_training_text.py --start 0  --end 11
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> change checkpoint according to you  --start 0 --end 11.
>>>>>>>>>>>
>>>>>>>>>>> *and training checkpoint as you know already.*
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Monday, 11 September, 2023 at 1:22:34 am UTC+6 
>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi mhalidu, 
>>>>>>>>>>>> the script you posted here seems much more extensive than you 
>>>>>>>>>>>> posted before: 
>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/0e2880d9-64c0-4659-b497-902a5747caf4n%40googlegroups.com
>>>>>>>>>>>> .
>>>>>>>>>>>>
>>>>>>>>>>>> I have been using your earlier script. It is magical. How is 
>>>>>>>>>>>> this one different from the earlier one?
>>>>>>>>>>>>
>>>>>>>>>>>> Thank you for posting these scripts, by the way. It has saved 
>>>>>>>>>>>> my countless hours; by running multiple fonts in one sweep. I was 
>>>>>>>>>>>> not able 
>>>>>>>>>>>> to find any instruction on how to train for  multiple fonts. The 
>>>>>>>>>>>> official 
>>>>>>>>>>>> manual is also unclear. YOUr script helped me to get started. 
>>>>>>>>>>>> On Wednesday, August 9, 2023 at 11:00:49 PM UTC+3 
>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> ok, I will try as you said.
>>>>>>>>>>>>> one more thing, what's the role of the trained_text lines will 
>>>>>>>>>>>>> be? I have seen Bengali text are long words of lines. so I wanna 
>>>>>>>>>>>>> know how 
>>>>>>>>>>>>> many words or characters will be the better choice for the train? 
>>>>>>>>>>>>> and '--xsize=3600','--ysize=350',  will be according to words of 
>>>>>>>>>>>>> lines?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thursday, 10 August, 2023 at 1:10:14 am UTC+6 shree wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Include the default fonts also in your fine-tuning list of 
>>>>>>>>>>>>>> fonts and see if that helps.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Aug 9, 2023, 2:27 PM Ali hussain <
>>>>>>>>>>>>>> mdalihu...@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I have trained some new fonts by fine-tune methods for the 
>>>>>>>>>>>>>>> Bengali language in Tesseract 5 and I have used all official 
>>>>>>>>>>>>>>> trained_text 
>>>>>>>>>>>>>>> and tessdata_best and other things also.  everything is good 
>>>>>>>>>>>>>>> but the 
>>>>>>>>>>>>>>> problem is the default font which was trained before that does 
>>>>>>>>>>>>>>> not convert 
>>>>>>>>>>>>>>> text like prev but my new fonts work well. I don't understand 
>>>>>>>>>>>>>>> why it's 
>>>>>>>>>>>>>>> happening. I share code based to understand what going on.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> *codes  for creating tif, gt.txt, .box files:*
>>>>>>>>>>>>>>> import os
>>>>>>>>>>>>>>> import random
>>>>>>>>>>>>>>> import pathlib
>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>> import argparse
>>>>>>>>>>>>>>> from FontList import FontList
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> def read_line_count():
>>>>>>>>>>>>>>>     if os.path.exists('line_count.txt'):
>>>>>>>>>>>>>>>         with open('line_count.txt', 'r') as file:
>>>>>>>>>>>>>>>             return int(file.read())
>>>>>>>>>>>>>>>     return 0
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> def write_line_count(line_count):
>>>>>>>>>>>>>>>     with open('line_count.txt', 'w') as file:
>>>>>>>>>>>>>>>         file.write(str(line_count))
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> def create_training_data(training_text_file, font_list, 
>>>>>>>>>>>>>>> output_directory, start_line=None, end_line=None):
>>>>>>>>>>>>>>>     lines = []
>>>>>>>>>>>>>>>     with open(training_text_file, 'r') as input_file:
>>>>>>>>>>>>>>>         for line in input_file.readlines():
>>>>>>>>>>>>>>>             lines.append(line.strip())
>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>     if not os.path.exists(output_directory):
>>>>>>>>>>>>>>>         os.mkdir(output_directory)
>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>     random.shuffle(lines)
>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>     if start_line is None:
>>>>>>>>>>>>>>>         line_count = read_line_count()  # Set the starting 
>>>>>>>>>>>>>>> line_count from the file
>>>>>>>>>>>>>>>     else:
>>>>>>>>>>>>>>>         line_count = start_line
>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>     if end_line is None:
>>>>>>>>>>>>>>>         end_line_count = len(lines) - 1  # Set the ending 
>>>>>>>>>>>>>>> line_count
>>>>>>>>>>>>>>>     else:
>>>>>>>>>>>>>>>         end_line_count = min(end_line, len(lines) - 1)
>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>     for font in font_list.fonts:  # Iterate through all the 
>>>>>>>>>>>>>>> fonts in the font_list
>>>>>>>>>>>>>>>         font_serial = 1
>>>>>>>>>>>>>>>         for line in lines:
>>>>>>>>>>>>>>>             training_text_file_name = pathlib.Path(
>>>>>>>>>>>>>>> training_text_file).stem
>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>             # Generate a unique serial number for each line
>>>>>>>>>>>>>>>             line_serial = f"{line_count:d}"
>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>             # GT (Ground Truth) text filename
>>>>>>>>>>>>>>>             line_gt_text = os.path.join(output_directory, f'
>>>>>>>>>>>>>>> {training_text_file_name}_{line_serial}.gt.txt')
>>>>>>>>>>>>>>>             with open(line_gt_text, 'w') as output_file:
>>>>>>>>>>>>>>>                 output_file.writelines([line])
>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>             # Image filename
>>>>>>>>>>>>>>>             file_base_name = f'ben_{line_serial}'  # Unique 
>>>>>>>>>>>>>>> filename for each font
>>>>>>>>>>>>>>>             subprocess.run([
>>>>>>>>>>>>>>>                 'text2image',
>>>>>>>>>>>>>>>                 f'--font={font}',
>>>>>>>>>>>>>>>                 f'--text={line_gt_text}',
>>>>>>>>>>>>>>>                 f'--outputbase={output_directory}/{
>>>>>>>>>>>>>>> file_base_name}',
>>>>>>>>>>>>>>>                 '--max_pages=1',
>>>>>>>>>>>>>>>                 '--strip_unrenderable_words',
>>>>>>>>>>>>>>>                 '--leading=36',
>>>>>>>>>>>>>>>                 '--xsize=3600',
>>>>>>>>>>>>>>>                 '--ysize=350',
>>>>>>>>>>>>>>>                 '--char_spacing=1.0',
>>>>>>>>>>>>>>>                 '--exposure=0',
>>>>>>>>>>>>>>>                 '--unicharset_file=langdata/ben.unicharset',
>>>>>>>>>>>>>>>             ])
>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>             line_count += 1
>>>>>>>>>>>>>>>             font_serial += 1
>>>>>>>>>>>>>>>         
>>>>>>>>>>>>>>>         # Reset font_serial for the next font iteration
>>>>>>>>>>>>>>>         font_serial = 1
>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>     write_line_count(line_count)  # Update the line_count 
>>>>>>>>>>>>>>> in the file
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> if __name__ == "__main__":
>>>>>>>>>>>>>>>     parser = argparse.ArgumentParser()
>>>>>>>>>>>>>>>     parser.add_argument('--start', type=int, help='Starting 
>>>>>>>>>>>>>>> line count (inclusive)')
>>>>>>>>>>>>>>>     parser.add_argument('--end', type=int, help='Ending 
>>>>>>>>>>>>>>> line count (inclusive)')
>>>>>>>>>>>>>>>     args = parser.parse_args()
>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>     training_text_file = 'langdata/ben.training_text'
>>>>>>>>>>>>>>>     output_directory = 'tesstrain/data/ben-ground-truth'
>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>     # Create an instance of the FontList class
>>>>>>>>>>>>>>>     font_list = FontList()
>>>>>>>>>>>>>>>      
>>>>>>>>>>>>>>>     create_training_data(training_text_file, font_list, 
>>>>>>>>>>>>>>> output_directory, args.start, args.end)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> *and for training code:*
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> # List of font names
>>>>>>>>>>>>>>> font_names = ['ben']
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> for font in font_names:
>>>>>>>>>>>>>>>     command = f"TESSDATA_PREFIX=../tesseract/tessdata make 
>>>>>>>>>>>>>>> training MODEL_NAME={font} START_MODEL=ben 
>>>>>>>>>>>>>>> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000 
>>>>>>>>>>>>>>> LANG_TYPE=Indic"
>>>>>>>>>>>>>>>     subprocess.run(command, shell=True)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> any suggestion to identify to extract the problem.
>>>>>>>>>>>>>>> thanks, everyone
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails 
>>>>>>>>>>>>>>> from it, send an email to tesseract-oc...@googlegroups.com.
>>>>>>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com
>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/4893f0d4-c580-4dc4-b5b7-2bb99ee14540n%40googlegroups.com.

Re: [tesseract-ocr] accuracy problem after trained in fine-tune

Reply via email to