Re: [tesseract-ocr] accuracy problem after trained in fine-tune

Ali hussain Wed, 13 Sep 2023 20:58:12 -0700

you are training from Tessearact default text data or your own collected 
text data?
On Thursday, 14 September, 2023 at 12:19:53 am UTC+6 desal...@gmail.com 
wrote:


> I now get to 200000 iterations; and the error rate is stuck at 0.46. The 
> result is absolutely trash: nowhere close to the default/Ray's training. 
>
> On Wednesday, September 13, 2023 at 2:47:05 PM UTC+3 mdalihu...@gmail.com 
> wrote:
>
>>
>> after Tesseact recognizes text from images. then you can apply regex to 
>> replace the wrong word with to correct word.
>> I'm not familiar with paddleOcr and scanTailor also.
>>
>> On Wednesday, 13 September, 2023 at 5:06:12 pm UTC+6 desal...@gmail.com 
>> wrote:
>>
>>> At what stage are you doing the regex replacement?
>>> My process has been: Scan (tif)--> ScanTailor --> Tesseract --> pdf
>>>
>>> >EasyOCR I think is best for ID cards or something like that image 
>>> process. but document images like books, here Tesseract is better than 
>>> EasyOCR.
>>>
>>> How about paddleOcr?, are you familiar with it?
>>>
>>> On Wednesday, September 13, 2023 at 1:45:54 PM UTC+3 
>>> mdalihu...@gmail.com wrote:
>>>
>>>> I know what you mean. but in some cases, it helps me.  I have faced 
>>>> specific characters and words are always not recognized by Tesseract. That 
>>>> way I use these regex to replace those characters   and words if  those 
>>>> characters are incorrect.
>>>>
>>>> see what I have done: 
>>>>
>>>>    " ী": "ী",
>>>>     " ্": " ",
>>>>     " ে": " ",
>>>>     জ্া: "জা",
>>>>     "  ": " ",
>>>>     "   ": " ",
>>>>     "    ": " ",
>>>>     "্প": " ",
>>>>     " য": "র্য",
>>>>     য: "য",
>>>>     " া": "া",
>>>>     আা: "আ",
>>>>     ম্ি: "মি",
>>>>     স্ু: "সু",
>>>>     "হূ ": "হূ",
>>>>     " ণ": "ণ",
>>>>     র্্: "র",
>>>>     "চিন্ত ": "চিন্তা ",
>>>>     ন্া: "না",
>>>>     "সম ূর্ন": "সম্পূর্ণ",
>>>> On Wednesday, 13 September, 2023 at 4:18:22 pm UTC+6 desal...@gmail.com 
>>>> wrote:
>>>>
>>>>> The problem for regex is that Tesseract is not consistent in its 
>>>>> replacement. 
>>>>> Think of the original training of English data doesn't contain the 
>>>>> letter /u/. What does Tesseract do when it faces /u/ in actual 
>>>>> processing??
>>>>> In some cases, it replaces it with closely similar letters such as /v/ 
>>>>> and /w/. In other cases, it completely removes it. That is what is 
>>>>> happening with my case. Those characters re sometimes completely removed; 
>>>>> other times, they are replaced by closely resembling characters. Because 
>>>>> of 
>>>>> this inconsistency, applying regex is very difficult. 
>>>>>
>>>>> On Wednesday, September 13, 2023 at 1:02:01 PM UTC+3 
>>>>> mdalihu...@gmail.com wrote:
>>>>>
>>>>>> if Some specific characters or words are always missing from the OCR 
>>>>>> result.  then you can apply logic with the Regular expressions method on 
>>>>>> your applications. After OCR, these specific characters or words will be 
>>>>>> replaced by current characters or words that you defined in your 
>>>>>> applications by  Regular expressions. it can be done in some major 
>>>>>> problems.
>>>>>>
>>>>>> On Wednesday, 13 September, 2023 at 3:51:29 pm UTC+6 
>>>>>> desal...@gmail.com wrote:
>>>>>>
>>>>>>> The characters are getting missed, even after fine-tuning. 
>>>>>>> I never made any progress. I tried many different ways. Some  
>>>>>>> specific characters are always missing from the OCR result.  
>>>>>>>
>>>>>>> On Wednesday, September 13, 2023 at 12:49:20 PM UTC+3 
>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>
>>>>>>>> EasyOCR I think is best for ID cards or something like that image 
>>>>>>>> process. but document images like books, here Tesseract is better than 
>>>>>>>> EasyOCR.  Even I didn't use EasyOCR. you can try it.
>>>>>>>>
>>>>>>>> I have added words of dictionaries but the result is the same. 
>>>>>>>>
>>>>>>>> what kind of problem you have faced in fine-tuning in few new 
>>>>>>>> characters as you said (*but, I failed in every possible way to 
>>>>>>>> introduce a few new characters into the database.)*
>>>>>>>> On Wednesday, 13 September, 2023 at 3:33:48 pm UTC+6 
>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>
>>>>>>>>> Yes, we are new to this. I find the instructions (the manual) very 
>>>>>>>>> hard to follow. The video you linked above was really helpful  to get 
>>>>>>>>> started. My plan at the beginning was to fine tune the existing 
>>>>>>>>> .traineddata. But, I failed in every possible way to introduce a few 
>>>>>>>>> new 
>>>>>>>>> characters into the database. That is why I started from scratch. 
>>>>>>>>>
>>>>>>>>> Sure, I will follow Lorenzo's suggestion: will run more the 
>>>>>>>>> iterations, and see if I can improve. 
>>>>>>>>>
>>>>>>>>> Another areas we need to explore is usage of dictionaries 
>>>>>>>>> actually. May be adding millions of words into the dictionary could 
>>>>>>>>> help 
>>>>>>>>> Tesseract. I don't have millions of words; but I am looking into some 
>>>>>>>>> corpus to get more words into the dictionary. 
>>>>>>>>>
>>>>>>>>> If this all fails, EasyOCR (and probably other similar open-source 
>>>>>>>>> packages)  is probably our next option to try on. Sure, sharing 
>>>>>>>>> our experiences will be helpful. I will let you know if I made good 
>>>>>>>>> progresses in any of these options. 
>>>>>>>>> On Wednesday, September 13, 2023 at 12:19:48 PM UTC+3 
>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>
>>>>>>>>>> How is your training going for Bengali?  It was nearly good but I 
>>>>>>>>>> faced space problems between two words, some words are spaces but 
>>>>>>>>>> most of 
>>>>>>>>>> them have no space. I think is problem is in the dataset but I use 
>>>>>>>>>> the 
>>>>>>>>>> default training dataset from Tesseract which is used in Ben That 
>>>>>>>>>> way I am 
>>>>>>>>>> confused so I have to explore more. by the way,  you can try as 
>>>>>>>>>> Lorenzo 
>>>>>>>>>> Blz said.  Actually training from scratch is harder than 
>>>>>>>>>> fine-tuning. so you can use different datasets to explore. if you 
>>>>>>>>>> succeed. 
>>>>>>>>>> please let me know how you have done this whole process.  I'm also 
>>>>>>>>>> new in 
>>>>>>>>>> this field.
>>>>>>>>>> On Wednesday, 13 September, 2023 at 1:13:43 pm UTC+6 
>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>
>>>>>>>>>>> How is your training going for Bengali?
>>>>>>>>>>> I have been trying to train from scratch. I made about 64,000 
>>>>>>>>>>> lines of text (which produced about 255,000 files, in the end) and 
>>>>>>>>>>> run the 
>>>>>>>>>>> training for 150,000 iterations; getting 0.51 training error rate. 
>>>>>>>>>>> I was 
>>>>>>>>>>> hopping to get reasonable accuracy. Unfortunately, when I run the 
>>>>>>>>>>> OCR 
>>>>>>>>>>> using  .traineddata,  the accuracy is absolutely terrible. Do you 
>>>>>>>>>>> think I 
>>>>>>>>>>> made some mistakes, or that is an expected result?
>>>>>>>>>>>
>>>>>>>>>>> On Tuesday, September 12, 2023 at 11:15:25 PM UTC+3 
>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Yes, he doesn't mention all fonts but only one font.  That way 
>>>>>>>>>>>> he didn't use *MODEL_NAME in a separate **script **file script 
>>>>>>>>>>>> I think.*
>>>>>>>>>>>>
>>>>>>>>>>>> Actually, here we teach all *tif, gt.txt, and .box files *which 
>>>>>>>>>>>> are created by  *MODEL_NAME I mean **eng, ben, oro flag or 
>>>>>>>>>>>> language code *because when we first create *tif, gt.txt, and 
>>>>>>>>>>>> .box files, *every file starts by  *MODEL_NAME*. This  
>>>>>>>>>>>> *MODEL_NAME*  we selected on the training script for looping 
>>>>>>>>>>>> each tif, gt.txt, and .box files which are created by  
>>>>>>>>>>>> *MODEL_NAME.*
>>>>>>>>>>>>
>>>>>>>>>>>> On Tuesday, 12 September, 2023 at 9:42:13 pm UTC+6 
>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Yes, I am familiar with the video and have set up the folder 
>>>>>>>>>>>>> structure as you did. Indeed, I have tried a number of 
>>>>>>>>>>>>> fine-tuning with a 
>>>>>>>>>>>>> single font following Gracia's video. But, your script is much  
>>>>>>>>>>>>> better 
>>>>>>>>>>>>> because supports multiple fonts. The whole improvement you made 
>>>>>>>>>>>>> is  
>>>>>>>>>>>>> brilliant; and very useful. It is all working for me. 
>>>>>>>>>>>>> The only part that I didn't understand is the trick you used 
>>>>>>>>>>>>> in your tesseract_train.py script. You see, I have been doing 
>>>>>>>>>>>>> exactly to 
>>>>>>>>>>>>> you did except this script. 
>>>>>>>>>>>>>
>>>>>>>>>>>>> The scripts seems to have the trick of sending/teaching each 
>>>>>>>>>>>>> of the fonts (iteratively) into the model. The script I have been 
>>>>>>>>>>>>> using  
>>>>>>>>>>>>> (which I get from Garcia) doesn't mention font at all. 
>>>>>>>>>>>>>
>>>>>>>>>>>>> *TESSDATA_PREFIX=../tesseract/tessdata make training 
>>>>>>>>>>>>> MODEL_NAME=oro TESSDATA=../tesseract/tessdata 
>>>>>>>>>>>>> MAX_ITERATIONS=10000*
>>>>>>>>>>>>> Does it mean that my model does't train the fonts (even if the 
>>>>>>>>>>>>> fonts have been included in the splitting process, in the other 
>>>>>>>>>>>>> script)?
>>>>>>>>>>>>> On Monday, September 11, 2023 at 10:54:08 AM UTC+3 
>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *import subprocess# List of font namesfont_names = ['ben']for 
>>>>>>>>>>>>>> font in font_names:    command = 
>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata 
>>>>>>>>>>>>>> make training MODEL_NAME={font} START_MODEL=ben 
>>>>>>>>>>>>>> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000"*
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *    subprocess.run(command, shell=True) 1 . This command is 
>>>>>>>>>>>>>> for training data that I have named '*tesseract_training*.py' 
>>>>>>>>>>>>>> inside tesstrain folder.*
>>>>>>>>>>>>>> *2. root directory means your main training folder and inside 
>>>>>>>>>>>>>> it as like langdata, tessearact,  tesstrain folders. if you see 
>>>>>>>>>>>>>> this 
>>>>>>>>>>>>>> tutorial    *https://www.youtube.com/watch?v=KE4xEzFGSU8  
>>>>>>>>>>>>>>  you will understand better the folder structure. only I 
>>>>>>>>>>>>>> created tesseract_training.py in tesstrain folder for training 
>>>>>>>>>>>>>> and  
>>>>>>>>>>>>>> FontList.py file is the main path as *like langdata, 
>>>>>>>>>>>>>> tessearact,  tesstrain, and *split_training_text.py.
>>>>>>>>>>>>>> 3. first of all you have to put all fonts in your Linux fonts 
>>>>>>>>>>>>>> folder.   /usr/share/fonts/  then run:  sudo apt update  
>>>>>>>>>>>>>> then sudo fc-cache -fv
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> after that, you have to add the exact font's name in 
>>>>>>>>>>>>>> FontList.py file like me.
>>>>>>>>>>>>>> I  have added two pic my folder structure. first is main 
>>>>>>>>>>>>>> structure pic and the second is the Colopse tesstrain folder.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I[image: Screenshot 2023-09-11 134947.png][image: Screenshot 
>>>>>>>>>>>>>> 2023-09-11 135014.png] 
>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 12:50:03 pm UTC+6 
>>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thank you so much for putting out these brilliant scripts. 
>>>>>>>>>>>>>>> They make the process  much more efficient.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I have one more question on the other script that you use to 
>>>>>>>>>>>>>>> train. 
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> *import subprocess# List of font namesfont_names = 
>>>>>>>>>>>>>>> ['ben']for font in font_names:    command = 
>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make training 
>>>>>>>>>>>>>>> MODEL_NAME={font} 
>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata 
>>>>>>>>>>>>>>> MAX_ITERATIONS=10000"*
>>>>>>>>>>>>>>> *    subprocess.run(command, shell=True) *
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Do you have the name of fonts listed in file in the 
>>>>>>>>>>>>>>> same/root directory?
>>>>>>>>>>>>>>> How do you setup the names of the fonts in the file, if you 
>>>>>>>>>>>>>>> don't mind sharing it?
>>>>>>>>>>>>>>> On Monday, September 11, 2023 at 4:27:27 AM UTC+3 
>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> You can use the new script below. it's better than the 
>>>>>>>>>>>>>>>> previous two scripts.  You can create *tif, gt.txt, and 
>>>>>>>>>>>>>>>> .box files *by multiple fonts and also use breakpoint if 
>>>>>>>>>>>>>>>> vs code close or anything during creating *tif, gt.txt, 
>>>>>>>>>>>>>>>> and .box files *then you can checkpoint to navigate where 
>>>>>>>>>>>>>>>> you close vs code.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> command for *tif, gt.txt, and .box files *
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> import os
>>>>>>>>>>>>>>>> import random
>>>>>>>>>>>>>>>> import pathlib
>>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>>> import argparse
>>>>>>>>>>>>>>>> from FontList import FontList
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> def create_training_data(training_text_file, font_list, 
>>>>>>>>>>>>>>>> output_directory, start_line=None, end_line=None):
>>>>>>>>>>>>>>>>     lines = []
>>>>>>>>>>>>>>>>     with open(training_text_file, 'r') as input_file:
>>>>>>>>>>>>>>>>         lines = input_file.readlines()
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>     if not os.path.exists(output_directory):
>>>>>>>>>>>>>>>>         os.mkdir(output_directory)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>     if start_line is None:
>>>>>>>>>>>>>>>>         start_line = 0
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>     if end_line is None:
>>>>>>>>>>>>>>>>         end_line = len(lines) - 1
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>     for font_name in font_list.fonts:
>>>>>>>>>>>>>>>>         for line_index in range(start_line, end_line + 1):
>>>>>>>>>>>>>>>>             line = lines[line_index].strip()
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>             training_text_file_name = pathlib.Path(
>>>>>>>>>>>>>>>> training_text_file).stem
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>             line_serial = f"{line_index:d}"
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>             line_gt_text = os.path.join(output_directory, f
>>>>>>>>>>>>>>>> '{training_text_file_name}_{line_serial}_{
>>>>>>>>>>>>>>>> font_name.replace(" ", "_")}.gt.txt')
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>             with open(line_gt_text, 'w') as output_file:
>>>>>>>>>>>>>>>>                 output_file.writelines([line])
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>             file_base_name = f'{training_text_file_name}_{
>>>>>>>>>>>>>>>> line_serial}_{font_name.replace(" ", "_")}'
>>>>>>>>>>>>>>>>             subprocess.run([
>>>>>>>>>>>>>>>>                 'text2image',
>>>>>>>>>>>>>>>>                 f'--font={font_name}',
>>>>>>>>>>>>>>>>                 f'--text={line_gt_text}',
>>>>>>>>>>>>>>>>                 f'--outputbase={output_directory}/{
>>>>>>>>>>>>>>>> file_base_name}',
>>>>>>>>>>>>>>>>                 '--max_pages=1',
>>>>>>>>>>>>>>>>                 '--strip_unrenderable_words',
>>>>>>>>>>>>>>>>                 '--leading=36',
>>>>>>>>>>>>>>>>                 '--xsize=3600',
>>>>>>>>>>>>>>>>                 '--ysize=330',
>>>>>>>>>>>>>>>>                 '--char_spacing=1.0',
>>>>>>>>>>>>>>>>                 '--exposure=0',
>>>>>>>>>>>>>>>>                 '--unicharset_file=langdata/eng.unicharset'
>>>>>>>>>>>>>>>> ,
>>>>>>>>>>>>>>>>             ])
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> if __name__ == "__main__":
>>>>>>>>>>>>>>>>     parser = argparse.ArgumentParser()
>>>>>>>>>>>>>>>>     parser.add_argument('--start', type=int, help='Starting 
>>>>>>>>>>>>>>>> line count (inclusive)')
>>>>>>>>>>>>>>>>     parser.add_argument('--end', type=int, help='Ending 
>>>>>>>>>>>>>>>> line count (inclusive)')
>>>>>>>>>>>>>>>>     args = parser.parse_args()
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>     training_text_file = 'langdata/eng.training_text'
>>>>>>>>>>>>>>>>     output_directory = 'tesstrain/data/eng-ground-truth'
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>     font_list = FontList()
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>     create_training_data(training_text_file, font_list, 
>>>>>>>>>>>>>>>> output_directory, args.start, args.end)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Then create a file called "FontList" in the root directory 
>>>>>>>>>>>>>>>> and paste it.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> class FontList:
>>>>>>>>>>>>>>>>     def __init__(self):
>>>>>>>>>>>>>>>>         self.fonts = [
>>>>>>>>>>>>>>>>         "Gerlick"
>>>>>>>>>>>>>>>>             "Sagar Medium",
>>>>>>>>>>>>>>>>             "Ekushey Lohit Normal",  
>>>>>>>>>>>>>>>>            "Charukola Round Head Regular, weight=433",
>>>>>>>>>>>>>>>>             "Charukola Round Head Bold, weight=443",
>>>>>>>>>>>>>>>>             "Ador Orjoma Unicode",
>>>>>>>>>>>>>>>>       
>>>>>>>>>>>>>>>>           
>>>>>>>>>>>>>>>>                        
>>>>>>>>>>>>>>>> ]                         
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> then import in the above code,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> *for breakpoint command:*
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> sudo python3 split_training_text.py --start 0  --end 11
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> change checkpoint according to you  --start 0 --end 11.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> *and training checkpoint as you know already.*
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 1:22:34 am UTC+6 
>>>>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi mhalidu, 
>>>>>>>>>>>>>>>>> the script you posted here seems much more extensive than 
>>>>>>>>>>>>>>>>> you posted before: 
>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/0e2880d9-64c0-4659-b497-902a5747caf4n%40googlegroups.com
>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I have been using your earlier script. It is magical. How 
>>>>>>>>>>>>>>>>> is this one different from the earlier one?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thank you for posting these scripts, by the way. It has 
>>>>>>>>>>>>>>>>> saved my countless hours; by running multiple fonts in one 
>>>>>>>>>>>>>>>>> sweep. I was not 
>>>>>>>>>>>>>>>>> able to find any instruction on how to train for  multiple 
>>>>>>>>>>>>>>>>> fonts. The 
>>>>>>>>>>>>>>>>> official manual is also unclear. YOUr script helped me to get 
>>>>>>>>>>>>>>>>> started. 
>>>>>>>>>>>>>>>>> On Wednesday, August 9, 2023 at 11:00:49 PM UTC+3 
>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> ok, I will try as you said.
>>>>>>>>>>>>>>>>>> one more thing, what's the role of the trained_text lines 
>>>>>>>>>>>>>>>>>> will be? I have seen Bengali text are long words of lines. 
>>>>>>>>>>>>>>>>>> so I wanna know 
>>>>>>>>>>>>>>>>>> how many words or characters will be the better choice for 
>>>>>>>>>>>>>>>>>> the train? 
>>>>>>>>>>>>>>>>>> and '--xsize=3600','--ysize=350',  will be according to 
>>>>>>>>>>>>>>>>>> words of lines?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Thursday, 10 August, 2023 at 1:10:14 am UTC+6 shree 
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Include the default fonts also in your fine-tuning list 
>>>>>>>>>>>>>>>>>>> of fonts and see if that helps.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Wed, Aug 9, 2023, 2:27 PM Ali hussain <
>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I have trained some new fonts by fine-tune methods for 
>>>>>>>>>>>>>>>>>>>> the Bengali language in Tesseract 5 and I have used all 
>>>>>>>>>>>>>>>>>>>> official 
>>>>>>>>>>>>>>>>>>>> trained_text and tessdata_best and other things also.  
>>>>>>>>>>>>>>>>>>>> everything is good 
>>>>>>>>>>>>>>>>>>>> but the problem is the default font which was trained 
>>>>>>>>>>>>>>>>>>>> before that does not 
>>>>>>>>>>>>>>>>>>>> convert text like prev but my new fonts work well. I don't 
>>>>>>>>>>>>>>>>>>>> understand why 
>>>>>>>>>>>>>>>>>>>> it's happening. I share code based to understand what 
>>>>>>>>>>>>>>>>>>>> going on.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> *codes  for creating tif, gt.txt, .box files:*
>>>>>>>>>>>>>>>>>>>> import os
>>>>>>>>>>>>>>>>>>>> import random
>>>>>>>>>>>>>>>>>>>> import pathlib
>>>>>>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>>>>>>> import argparse
>>>>>>>>>>>>>>>>>>>> from FontList import FontList
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> def read_line_count():
>>>>>>>>>>>>>>>>>>>>     if os.path.exists('line_count.txt'):
>>>>>>>>>>>>>>>>>>>>         with open('line_count.txt', 'r') as file:
>>>>>>>>>>>>>>>>>>>>             return int(file.read())
>>>>>>>>>>>>>>>>>>>>     return 0
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> def write_line_count(line_count):
>>>>>>>>>>>>>>>>>>>>     with open('line_count.txt', 'w') as file:
>>>>>>>>>>>>>>>>>>>>         file.write(str(line_count))
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> def create_training_data(training_text_file, font_list, 
>>>>>>>>>>>>>>>>>>>> output_directory, start_line=None, end_line=None):
>>>>>>>>>>>>>>>>>>>>     lines = []
>>>>>>>>>>>>>>>>>>>>     with open(training_text_file, 'r') as input_file:
>>>>>>>>>>>>>>>>>>>>         for line in input_file.readlines():
>>>>>>>>>>>>>>>>>>>>             lines.append(line.strip())
>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>     if not os.path.exists(output_directory):
>>>>>>>>>>>>>>>>>>>>         os.mkdir(output_directory)
>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>     random.shuffle(lines)
>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>     if start_line is None:
>>>>>>>>>>>>>>>>>>>>         line_count = read_line_count()  # Set the 
>>>>>>>>>>>>>>>>>>>> starting line_count from the file
>>>>>>>>>>>>>>>>>>>>     else:
>>>>>>>>>>>>>>>>>>>>         line_count = start_line
>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>     if end_line is None:
>>>>>>>>>>>>>>>>>>>>         end_line_count = len(lines) - 1  # Set the 
>>>>>>>>>>>>>>>>>>>> ending line_count
>>>>>>>>>>>>>>>>>>>>     else:
>>>>>>>>>>>>>>>>>>>>         end_line_count = min(end_line, len(lines) - 1)
>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>     for font in font_list.fonts:  # Iterate through 
>>>>>>>>>>>>>>>>>>>> all the fonts in the font_list
>>>>>>>>>>>>>>>>>>>>         font_serial = 1
>>>>>>>>>>>>>>>>>>>>         for line in lines:
>>>>>>>>>>>>>>>>>>>>             training_text_file_name = pathlib.Path(
>>>>>>>>>>>>>>>>>>>> training_text_file).stem
>>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>>             # Generate a unique serial number for each 
>>>>>>>>>>>>>>>>>>>> line
>>>>>>>>>>>>>>>>>>>>             line_serial = f"{line_count:d}"
>>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>>             # GT (Ground Truth) text filename
>>>>>>>>>>>>>>>>>>>>             line_gt_text = os.path.join(
>>>>>>>>>>>>>>>>>>>> output_directory, f'{training_text_file_name}_{
>>>>>>>>>>>>>>>>>>>> line_serial}.gt.txt')
>>>>>>>>>>>>>>>>>>>>             with open(line_gt_text, 'w') as 
>>>>>>>>>>>>>>>>>>>> output_file:
>>>>>>>>>>>>>>>>>>>>                 output_file.writelines([line])
>>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>>             # Image filename
>>>>>>>>>>>>>>>>>>>>             file_base_name = f'ben_{line_serial}'  # 
>>>>>>>>>>>>>>>>>>>> Unique filename for each font
>>>>>>>>>>>>>>>>>>>>             subprocess.run([
>>>>>>>>>>>>>>>>>>>>                 'text2image',
>>>>>>>>>>>>>>>>>>>>                 f'--font={font}',
>>>>>>>>>>>>>>>>>>>>                 f'--text={line_gt_text}',
>>>>>>>>>>>>>>>>>>>>                 f'--outputbase={output_directory}/{
>>>>>>>>>>>>>>>>>>>> file_base_name}',
>>>>>>>>>>>>>>>>>>>>                 '--max_pages=1',
>>>>>>>>>>>>>>>>>>>>                 '--strip_unrenderable_words',
>>>>>>>>>>>>>>>>>>>>                 '--leading=36',
>>>>>>>>>>>>>>>>>>>>                 '--xsize=3600',
>>>>>>>>>>>>>>>>>>>>                 '--ysize=350',
>>>>>>>>>>>>>>>>>>>>                 '--char_spacing=1.0',
>>>>>>>>>>>>>>>>>>>>                 '--exposure=0',
>>>>>>>>>>>>>>>>>>>>                 '
>>>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/ben.unicharset',
>>>>>>>>>>>>>>>>>>>>             ])
>>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>>             line_count += 1
>>>>>>>>>>>>>>>>>>>>             font_serial += 1
>>>>>>>>>>>>>>>>>>>>         
>>>>>>>>>>>>>>>>>>>>         # Reset font_serial for the next font iteration
>>>>>>>>>>>>>>>>>>>>         font_serial = 1
>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>     write_line_count(line_count)  # Update the 
>>>>>>>>>>>>>>>>>>>> line_count in the file
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> if __name__ == "__main__":
>>>>>>>>>>>>>>>>>>>>     parser = argparse.ArgumentParser()
>>>>>>>>>>>>>>>>>>>>     parser.add_argument('--start', type=int, 
>>>>>>>>>>>>>>>>>>>> help='Starting 
>>>>>>>>>>>>>>>>>>>> line count (inclusive)')
>>>>>>>>>>>>>>>>>>>>     parser.add_argument('--end', type=int, help='Ending 
>>>>>>>>>>>>>>>>>>>> line count (inclusive)')
>>>>>>>>>>>>>>>>>>>>     args = parser.parse_args()
>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>     training_text_file = 'langdata/ben.training_text'
>>>>>>>>>>>>>>>>>>>>     output_directory = 'tesstrain/data/ben-ground-truth
>>>>>>>>>>>>>>>>>>>> '
>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>     # Create an instance of the FontList class
>>>>>>>>>>>>>>>>>>>>     font_list = FontList()
>>>>>>>>>>>>>>>>>>>>      
>>>>>>>>>>>>>>>>>>>>     create_training_data(training_text_file, 
>>>>>>>>>>>>>>>>>>>> font_list, output_directory, args.start, args.end)
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> *and for training code:*
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> # List of font names
>>>>>>>>>>>>>>>>>>>> font_names = ['ben']
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> for font in font_names:
>>>>>>>>>>>>>>>>>>>>     command = f"TESSDATA_PREFIX=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>>>> make training MODEL_NAME={font} START_MODEL=ben 
>>>>>>>>>>>>>>>>>>>> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000 
>>>>>>>>>>>>>>>>>>>> LANG_TYPE=Indic"
>>>>>>>>>>>>>>>>>>>>     subprocess.run(command, shell=True)
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> any suggestion to identify to extract the problem.
>>>>>>>>>>>>>>>>>>>> thanks, everyone
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>>>>> You received this message because you are subscribed to 
>>>>>>>>>>>>>>>>>>>> the Google Groups "tesseract-ocr" group.
>>>>>>>>>>>>>>>>>>>> To unsubscribe from this group and stop receiving 
>>>>>>>>>>>>>>>>>>>> emails from it, send an email to 
>>>>>>>>>>>>>>>>>>>> tesseract-oc...@googlegroups.com.
>>>>>>>>>>>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com
>>>>>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com.

Re: [tesseract-ocr] accuracy problem after trained in fine-tune

Reply via email to