Re: [tesseract-ocr] accuracy problem after trained in fine-tune

Ali hussain Wed, 13 Sep 2023 23:30:16 -0700

you faced Can't encode transcription? if you faced how you have solved this?



On Thursday, 14 September, 2023 at 10:51:52 am UTC+6 elvi...@gmail.com 
wrote:

> I was using my own text
>
> On Thu, Sep 14, 2023, 6:58 AM Ali hussain <mdalihu...@gmail.com> wrote:
>
>> you are training from Tessearact default text data or your own collected 
>> text data?
>> On Thursday, 14 September, 2023 at 12:19:53 am UTC+6 desal...@gmail.com 
>> wrote:
>>
>>> I now get to 200000 iterations; and the error rate is stuck at 0.46. The 
>>> result is absolutely trash: nowhere close to the default/Ray's training. 
>>>
>>> On Wednesday, September 13, 2023 at 2:47:05 PM UTC+3 
>>> mdalihu...@gmail.com wrote:
>>>
>>>>
>>>> after Tesseact recognizes text from images. then you can apply regex to 
>>>> replace the wrong word with to correct word.
>>>> I'm not familiar with paddleOcr and scanTailor also.
>>>>
>>>> On Wednesday, 13 September, 2023 at 5:06:12 pm UTC+6 desal...@gmail.com 
>>>> wrote:
>>>>
>>>>> At what stage are you doing the regex replacement?
>>>>> My process has been: Scan (tif)--> ScanTailor --> Tesseract --> pdf
>>>>>
>>>>> >EasyOCR I think is best for ID cards or something like that image 
>>>>> process. but document images like books, here Tesseract is better than 
>>>>> EasyOCR.
>>>>>
>>>>> How about paddleOcr?, are you familiar with it?
>>>>>
>>>>> On Wednesday, September 13, 2023 at 1:45:54 PM UTC+3 
>>>>> mdalihu...@gmail.com wrote:
>>>>>
>>>>>> I know what you mean. but in some cases, it helps me.  I have faced 
>>>>>> specific characters and words are always not recognized by Tesseract. 
>>>>>> That 
>>>>>> way I use these regex to replace those characters   and words if  those 
>>>>>> characters are incorrect.
>>>>>>
>>>>>> see what I have done: 
>>>>>>
>>>>>>    " ী": "ী",
>>>>>>     " ্": " ",
>>>>>>     " ে": " ",
>>>>>>     জ্া: "জা",
>>>>>>     "  ": " ",
>>>>>>     "   ": " ",
>>>>>>     "    ": " ",
>>>>>>     "্প": " ",
>>>>>>     " য": "র্য",
>>>>>>     য: "য",
>>>>>>     " া": "া",
>>>>>>     আা: "আ",
>>>>>>     ম্ি: "মি",
>>>>>>     স্ু: "সু",
>>>>>>     "হূ ": "হূ",
>>>>>>     " ণ": "ণ",
>>>>>>     র্্: "র",
>>>>>>     "চিন্ত ": "চিন্তা ",
>>>>>>     ন্া: "না",
>>>>>>     "সম ূর্ন": "সম্পূর্ণ",
>>>>>> On Wednesday, 13 September, 2023 at 4:18:22 pm UTC+6 
>>>>>> desal...@gmail.com wrote:
>>>>>>
>>>>>>> The problem for regex is that Tesseract is not consistent in its 
>>>>>>> replacement. 
>>>>>>> Think of the original training of English data doesn't contain the 
>>>>>>> letter /u/. What does Tesseract do when it faces /u/ in actual 
>>>>>>> processing??
>>>>>>> In some cases, it replaces it with closely similar letters such as 
>>>>>>> /v/ and /w/. In other cases, it completely removes it. That is what is 
>>>>>>> happening with my case. Those characters re sometimes completely 
>>>>>>> removed; 
>>>>>>> other times, they are replaced by closely resembling characters. 
>>>>>>> Because of 
>>>>>>> this inconsistency, applying regex is very difficult. 
>>>>>>>
>>>>>>> On Wednesday, September 13, 2023 at 1:02:01 PM UTC+3 
>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>
>>>>>>>> if Some specific characters or words are always missing from the 
>>>>>>>> OCR result.  then you can apply logic with the Regular expressions 
>>>>>>>> method 
>>>>>>>> on your applications. After OCR, these specific characters or words 
>>>>>>>> will be 
>>>>>>>> replaced by current characters or words that you defined in your 
>>>>>>>> applications by  Regular expressions. it can be done in some major 
>>>>>>>> problems.
>>>>>>>>
>>>>>>>> On Wednesday, 13 September, 2023 at 3:51:29 pm UTC+6 
>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>
>>>>>>>>> The characters are getting missed, even after fine-tuning. 
>>>>>>>>> I never made any progress. I tried many different ways. Some  
>>>>>>>>> specific characters are always missing from the OCR result.  
>>>>>>>>>
>>>>>>>>> On Wednesday, September 13, 2023 at 12:49:20 PM UTC+3 
>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>
>>>>>>>>>> EasyOCR I think is best for ID cards or something like that image 
>>>>>>>>>> process. but document images like books, here Tesseract is better 
>>>>>>>>>> than 
>>>>>>>>>> EasyOCR.  Even I didn't use EasyOCR. you can try it.
>>>>>>>>>>
>>>>>>>>>> I have added words of dictionaries but the result is the same. 
>>>>>>>>>>
>>>>>>>>>> what kind of problem you have faced in fine-tuning in few new 
>>>>>>>>>> characters as you said (*but, I failed in every possible way to 
>>>>>>>>>> introduce a few new characters into the database.)*
>>>>>>>>>> On Wednesday, 13 September, 2023 at 3:33:48 pm UTC+6 
>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>
>>>>>>>>>>> Yes, we are new to this. I find the instructions (the manual) 
>>>>>>>>>>> very hard to follow. The video you linked above was really helpful  
>>>>>>>>>>> to get 
>>>>>>>>>>> started. My plan at the beginning was to fine tune the existing 
>>>>>>>>>>> .traineddata. But, I failed in every possible way to introduce a 
>>>>>>>>>>> few new 
>>>>>>>>>>> characters into the database. That is why I started from scratch. 
>>>>>>>>>>>
>>>>>>>>>>> Sure, I will follow Lorenzo's suggestion: will run more the 
>>>>>>>>>>> iterations, and see if I can improve. 
>>>>>>>>>>>
>>>>>>>>>>> Another areas we need to explore is usage of dictionaries 
>>>>>>>>>>> actually. May be adding millions of words into the dictionary could 
>>>>>>>>>>> help 
>>>>>>>>>>> Tesseract. I don't have millions of words; but I am looking into 
>>>>>>>>>>> some 
>>>>>>>>>>> corpus to get more words into the dictionary. 
>>>>>>>>>>>
>>>>>>>>>>> If this all fails, EasyOCR (and probably other similar 
>>>>>>>>>>> open-source packages)  is probably our next option to try on. Sure, 
>>>>>>>>>>> sharing 
>>>>>>>>>>> our experiences will be helpful. I will let you know if I made good 
>>>>>>>>>>> progresses in any of these options. 
>>>>>>>>>>> On Wednesday, September 13, 2023 at 12:19:48 PM UTC+3 
>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>
>>>>>>>>>>>> How is your training going for Bengali?  It was nearly good but 
>>>>>>>>>>>> I faced space problems between two words, some words are spaces 
>>>>>>>>>>>> but most of 
>>>>>>>>>>>> them have no space. I think is problem is in the dataset but I use 
>>>>>>>>>>>> the 
>>>>>>>>>>>> default training dataset from Tesseract which is used in Ben That 
>>>>>>>>>>>> way I am 
>>>>>>>>>>>> confused so I have to explore more. by the way,  you can try as 
>>>>>>>>>>>> Lorenzo 
>>>>>>>>>>>> Blz said.  Actually training from scratch is harder than 
>>>>>>>>>>>> fine-tuning. so you can use different datasets to explore. if you 
>>>>>>>>>>>> succeed. 
>>>>>>>>>>>> please let me know how you have done this whole process.  I'm also 
>>>>>>>>>>>> new in 
>>>>>>>>>>>> this field.
>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 1:13:43 pm UTC+6 
>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> How is your training going for Bengali?
>>>>>>>>>>>>> I have been trying to train from scratch. I made about 64,000 
>>>>>>>>>>>>> lines of text (which produced about 255,000 files, in the end) 
>>>>>>>>>>>>> and run the 
>>>>>>>>>>>>> training for 150,000 iterations; getting 0.51 training error 
>>>>>>>>>>>>> rate. I was 
>>>>>>>>>>>>> hopping to get reasonable accuracy. Unfortunately, when I run the 
>>>>>>>>>>>>> OCR 
>>>>>>>>>>>>> using  .traineddata,  the accuracy is absolutely terrible. Do you 
>>>>>>>>>>>>> think I 
>>>>>>>>>>>>> made some mistakes, or that is an expected result?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tuesday, September 12, 2023 at 11:15:25 PM UTC+3 
>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yes, he doesn't mention all fonts but only one font.  That 
>>>>>>>>>>>>>> way he didn't use *MODEL_NAME in a separate **script **file 
>>>>>>>>>>>>>> script I think.*
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Actually, here we teach all *tif, gt.txt, and .box files *which 
>>>>>>>>>>>>>> are created by  *MODEL_NAME I mean **eng, ben, oro flag or 
>>>>>>>>>>>>>> language code *because when we first create *tif, gt.txt, 
>>>>>>>>>>>>>> and .box files, *every file starts by  *MODEL_NAME*. This  
>>>>>>>>>>>>>> *MODEL_NAME*  we selected on the training script for looping 
>>>>>>>>>>>>>> each tif, gt.txt, and .box files which are created by  
>>>>>>>>>>>>>> *MODEL_NAME.*
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tuesday, 12 September, 2023 at 9:42:13 pm UTC+6 
>>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yes, I am familiar with the video and have set up the folder 
>>>>>>>>>>>>>>> structure as you did. Indeed, I have tried a number of 
>>>>>>>>>>>>>>> fine-tuning with a 
>>>>>>>>>>>>>>> single font following Gracia's video. But, your script is much  
>>>>>>>>>>>>>>> better 
>>>>>>>>>>>>>>> because supports multiple fonts. The whole improvement you made 
>>>>>>>>>>>>>>> is  
>>>>>>>>>>>>>>> brilliant; and very useful. It is all working for me. 
>>>>>>>>>>>>>>> The only part that I didn't understand is the trick you used 
>>>>>>>>>>>>>>> in your tesseract_train.py script. You see, I have been doing 
>>>>>>>>>>>>>>> exactly to 
>>>>>>>>>>>>>>> you did except this script. 
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The scripts seems to have the trick of sending/teaching each 
>>>>>>>>>>>>>>> of the fonts (iteratively) into the model. The script I have 
>>>>>>>>>>>>>>> been using  
>>>>>>>>>>>>>>> (which I get from Garcia) doesn't mention font at all. 
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> *TESSDATA_PREFIX=../tesseract/tessdata make training 
>>>>>>>>>>>>>>> MODEL_NAME=oro TESSDATA=../tesseract/tessdata 
>>>>>>>>>>>>>>> MAX_ITERATIONS=10000*
>>>>>>>>>>>>>>> Does it mean that my model does't train the fonts (even 
>>>>>>>>>>>>>>> if the fonts have been included in the splitting process, in 
>>>>>>>>>>>>>>> the other 
>>>>>>>>>>>>>>> script)?
>>>>>>>>>>>>>>> On Monday, September 11, 2023 at 10:54:08 AM UTC+3 
>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> *import subprocess# List of font namesfont_names = 
>>>>>>>>>>>>>>>> ['ben']for font in font_names:    command = 
>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make training 
>>>>>>>>>>>>>>>> MODEL_NAME={font} 
>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata 
>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000"*
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> *    subprocess.run(command, shell=True) 1 . This command 
>>>>>>>>>>>>>>>> is for training data that I have named 
>>>>>>>>>>>>>>>> '*tesseract_training*.py' 
>>>>>>>>>>>>>>>> inside tesstrain folder.*
>>>>>>>>>>>>>>>> *2. root directory means your main training folder and 
>>>>>>>>>>>>>>>> inside it as like langdata, tessearact,  tesstrain folders. if 
>>>>>>>>>>>>>>>> you see this 
>>>>>>>>>>>>>>>> tutorial    *https://www.youtube.com/watch?v=KE4xEzFGSU8  
>>>>>>>>>>>>>>>>  you will understand better the folder structure. only I 
>>>>>>>>>>>>>>>> created tesseract_training.py in tesstrain folder for training 
>>>>>>>>>>>>>>>> and  
>>>>>>>>>>>>>>>> FontList.py file is the main path as *like langdata, 
>>>>>>>>>>>>>>>> tessearact,  tesstrain, and *split_training_text.py.
>>>>>>>>>>>>>>>> 3. first of all you have to put all fonts in your Linux 
>>>>>>>>>>>>>>>> fonts folder.   /usr/share/fonts/  then run:  sudo apt 
>>>>>>>>>>>>>>>> update  then sudo fc-cache -fv
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> after that, you have to add the exact font's name in 
>>>>>>>>>>>>>>>> FontList.py file like me.
>>>>>>>>>>>>>>>> I  have added two pic my folder structure. first is main 
>>>>>>>>>>>>>>>> structure pic and the second is the Colopse tesstrain folder.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I[image: Screenshot 2023-09-11 134947.png][image: 
>>>>>>>>>>>>>>>> Screenshot 2023-09-11 135014.png] 
>>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 12:50:03 pm UTC+6 
>>>>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thank you so much for putting out these brilliant scripts. 
>>>>>>>>>>>>>>>>> They make the process  much more efficient.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I have one more question on the other script that you use 
>>>>>>>>>>>>>>>>> to train. 
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> *import subprocess# List of font namesfont_names = 
>>>>>>>>>>>>>>>>> ['ben']for font in font_names:    command = 
>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make training 
>>>>>>>>>>>>>>>>> MODEL_NAME={font} 
>>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata 
>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000"*
>>>>>>>>>>>>>>>>> *    subprocess.run(command, shell=True) *
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Do you have the name of fonts listed in file in the 
>>>>>>>>>>>>>>>>> same/root directory?
>>>>>>>>>>>>>>>>> How do you setup the names of the fonts in the file, if 
>>>>>>>>>>>>>>>>> you don't mind sharing it?
>>>>>>>>>>>>>>>>> On Monday, September 11, 2023 at 4:27:27 AM UTC+3 
>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> You can use the new script below. it's better than the 
>>>>>>>>>>>>>>>>>> previous two scripts.  You can create *tif, gt.txt, and 
>>>>>>>>>>>>>>>>>> .box files *by multiple fonts and also use breakpoint if 
>>>>>>>>>>>>>>>>>> vs code close or anything during creating *tif, gt.txt, 
>>>>>>>>>>>>>>>>>> and .box files *then you can checkpoint to navigate 
>>>>>>>>>>>>>>>>>> where you close vs code.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> command for *tif, gt.txt, and .box files *
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> import os
>>>>>>>>>>>>>>>>>> import random
>>>>>>>>>>>>>>>>>> import pathlib
>>>>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>>>>> import argparse
>>>>>>>>>>>>>>>>>> from FontList import FontList
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> def create_training_data(training_text_file, font_list, 
>>>>>>>>>>>>>>>>>> output_directory, start_line=None, end_line=None):
>>>>>>>>>>>>>>>>>>     lines = []
>>>>>>>>>>>>>>>>>>     with open(training_text_file, 'r') as input_file:
>>>>>>>>>>>>>>>>>>         lines = input_file.readlines()
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>     if not os.path.exists(output_directory):
>>>>>>>>>>>>>>>>>>         os.mkdir(output_directory)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>     if start_line is None:
>>>>>>>>>>>>>>>>>>         start_line = 0
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>     if end_line is None:
>>>>>>>>>>>>>>>>>>         end_line = len(lines) - 1
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>     for font_name in font_list.fonts:
>>>>>>>>>>>>>>>>>>         for line_index in range(start_line, end_line + 1
>>>>>>>>>>>>>>>>>> ):
>>>>>>>>>>>>>>>>>>             line = lines[line_index].strip()
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>             training_text_file_name = pathlib.Path(
>>>>>>>>>>>>>>>>>> training_text_file).stem
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>             line_serial = f"{line_index:d}"
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>             line_gt_text = os.path.join(output_directory, 
>>>>>>>>>>>>>>>>>> f'{training_text_file_name}_{line_serial}_{
>>>>>>>>>>>>>>>>>> font_name.replace(" ", "_")}.gt.txt')
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>             with open(line_gt_text, 'w') as output_file:
>>>>>>>>>>>>>>>>>>                 output_file.writelines([line])
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>             file_base_name = f'{training_text_file_name}_
>>>>>>>>>>>>>>>>>> {line_serial}_{font_name.replace(" ", "_")}'
>>>>>>>>>>>>>>>>>>             subprocess.run([
>>>>>>>>>>>>>>>>>>                 'text2image',
>>>>>>>>>>>>>>>>>>                 f'--font={font_name}',
>>>>>>>>>>>>>>>>>>                 f'--text={line_gt_text}',
>>>>>>>>>>>>>>>>>>                 f'--outputbase={output_directory}/{
>>>>>>>>>>>>>>>>>> file_base_name}',
>>>>>>>>>>>>>>>>>>                 '--max_pages=1',
>>>>>>>>>>>>>>>>>>                 '--strip_unrenderable_words',
>>>>>>>>>>>>>>>>>>                 '--leading=36',
>>>>>>>>>>>>>>>>>>                 '--xsize=3600',
>>>>>>>>>>>>>>>>>>                 '--ysize=330',
>>>>>>>>>>>>>>>>>>                 '--char_spacing=1.0',
>>>>>>>>>>>>>>>>>>                 '--exposure=0',
>>>>>>>>>>>>>>>>>>                 '
>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/eng.unicharset',
>>>>>>>>>>>>>>>>>>             ])
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> if __name__ == "__main__":
>>>>>>>>>>>>>>>>>>     parser = argparse.ArgumentParser()
>>>>>>>>>>>>>>>>>>     parser.add_argument('--start', type=int, help='Starting 
>>>>>>>>>>>>>>>>>> line count (inclusive)')
>>>>>>>>>>>>>>>>>>     parser.add_argument('--end', type=int, help='Ending 
>>>>>>>>>>>>>>>>>> line count (inclusive)')
>>>>>>>>>>>>>>>>>>     args = parser.parse_args()
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>     training_text_file = 'langdata/eng.training_text'
>>>>>>>>>>>>>>>>>>     output_directory = 'tesstrain/data/eng-ground-truth'
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>     font_list = FontList()
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>     create_training_data(training_text_file, font_list, 
>>>>>>>>>>>>>>>>>> output_directory, args.start, args.end)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Then create a file called "FontList" in the root 
>>>>>>>>>>>>>>>>>> directory and paste it.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> class FontList:
>>>>>>>>>>>>>>>>>>     def __init__(self):
>>>>>>>>>>>>>>>>>>         self.fonts = [
>>>>>>>>>>>>>>>>>>         "Gerlick"
>>>>>>>>>>>>>>>>>>             "Sagar Medium",
>>>>>>>>>>>>>>>>>>             "Ekushey Lohit Normal",  
>>>>>>>>>>>>>>>>>>            "Charukola Round Head Regular, weight=433",
>>>>>>>>>>>>>>>>>>             "Charukola Round Head Bold, weight=443",
>>>>>>>>>>>>>>>>>>             "Ador Orjoma Unicode",
>>>>>>>>>>>>>>>>>>       
>>>>>>>>>>>>>>>>>>           
>>>>>>>>>>>>>>>>>>                        
>>>>>>>>>>>>>>>>>> ]                         
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> then import in the above code,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> *for breakpoint command:*
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> sudo python3 split_training_text.py --start 0  --end 11
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> change checkpoint according to you  --start 0 --end 11.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> *and training checkpoint as you know already.*
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 1:22:34 am UTC+6 
>>>>>>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi mhalidu, 
>>>>>>>>>>>>>>>>>>> the script you posted here seems much more extensive 
>>>>>>>>>>>>>>>>>>> than you posted before: 
>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/0e2880d9-64c0-4659-b497-902a5747caf4n%40googlegroups.com
>>>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I have been using your earlier script. It is magical. 
>>>>>>>>>>>>>>>>>>> How is this one different from the earlier one?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thank you for posting these scripts, by the way. It has 
>>>>>>>>>>>>>>>>>>> saved my countless hours; by running multiple fonts in one 
>>>>>>>>>>>>>>>>>>> sweep. I was not 
>>>>>>>>>>>>>>>>>>> able to find any instruction on how to train for  multiple 
>>>>>>>>>>>>>>>>>>> fonts. The 
>>>>>>>>>>>>>>>>>>> official manual is also unclear. YOUr script helped me to 
>>>>>>>>>>>>>>>>>>> get started. 
>>>>>>>>>>>>>>>>>>> On Wednesday, August 9, 2023 at 11:00:49 PM UTC+3 
>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> ok, I will try as you said.
>>>>>>>>>>>>>>>>>>>> one more thing, what's the role of the trained_text 
>>>>>>>>>>>>>>>>>>>> lines will be? I have seen Bengali text are long words of 
>>>>>>>>>>>>>>>>>>>> lines. so I wanna 
>>>>>>>>>>>>>>>>>>>> know how many words or characters will be the better 
>>>>>>>>>>>>>>>>>>>> choice for the train? 
>>>>>>>>>>>>>>>>>>>> and '--xsize=3600','--ysize=350',  will be according to 
>>>>>>>>>>>>>>>>>>>> words of lines?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Thursday, 10 August, 2023 at 1:10:14 am UTC+6 shree 
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Include the default fonts also in your fine-tuning 
>>>>>>>>>>>>>>>>>>>>> list of fonts and see if that helps.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Wed, Aug 9, 2023, 2:27 PM Ali hussain <
>>>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I have trained some new fonts by fine-tune methods 
>>>>>>>>>>>>>>>>>>>>>> for the Bengali language in Tesseract 5 and I have used 
>>>>>>>>>>>>>>>>>>>>>> all official 
>>>>>>>>>>>>>>>>>>>>>> trained_text and tessdata_best and other things also.  
>>>>>>>>>>>>>>>>>>>>>> everything is good 
>>>>>>>>>>>>>>>>>>>>>> but the problem is the default font which was trained 
>>>>>>>>>>>>>>>>>>>>>> before that does not 
>>>>>>>>>>>>>>>>>>>>>> convert text like prev but my new fonts work well. I 
>>>>>>>>>>>>>>>>>>>>>> don't understand why 
>>>>>>>>>>>>>>>>>>>>>> it's happening. I share code based to understand what 
>>>>>>>>>>>>>>>>>>>>>> going on.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> *codes  for creating tif, gt.txt, .box files:*
>>>>>>>>>>>>>>>>>>>>>> import os
>>>>>>>>>>>>>>>>>>>>>> import random
>>>>>>>>>>>>>>>>>>>>>> import pathlib
>>>>>>>>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>>>>>>>>> import argparse
>>>>>>>>>>>>>>>>>>>>>> from FontList import FontList
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> def read_line_count():
>>>>>>>>>>>>>>>>>>>>>>     if os.path.exists('line_count.txt'):
>>>>>>>>>>>>>>>>>>>>>>         with open('line_count.txt', 'r') as file:
>>>>>>>>>>>>>>>>>>>>>>             return int(file.read())
>>>>>>>>>>>>>>>>>>>>>>     return 0
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> def write_line_count(line_count):
>>>>>>>>>>>>>>>>>>>>>>     with open('line_count.txt', 'w') as file:
>>>>>>>>>>>>>>>>>>>>>>         file.write(str(line_count))
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> def create_training_data(training_text_file, 
>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, start_line=None, 
>>>>>>>>>>>>>>>>>>>>>> end_line=None):
>>>>>>>>>>>>>>>>>>>>>>     lines = []
>>>>>>>>>>>>>>>>>>>>>>     with open(training_text_file, 'r') as input_file:
>>>>>>>>>>>>>>>>>>>>>>         for line in input_file.readlines():
>>>>>>>>>>>>>>>>>>>>>>             lines.append(line.strip())
>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>     if not os.path.exists(output_directory):
>>>>>>>>>>>>>>>>>>>>>>         os.mkdir(output_directory)
>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>     random.shuffle(lines)
>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>     if start_line is None:
>>>>>>>>>>>>>>>>>>>>>>         line_count = read_line_count()  # Set the 
>>>>>>>>>>>>>>>>>>>>>> starting line_count from the file
>>>>>>>>>>>>>>>>>>>>>>     else:
>>>>>>>>>>>>>>>>>>>>>>         line_count = start_line
>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>     if end_line is None:
>>>>>>>>>>>>>>>>>>>>>>         end_line_count = len(lines) - 1  # Set the 
>>>>>>>>>>>>>>>>>>>>>> ending line_count
>>>>>>>>>>>>>>>>>>>>>>     else:
>>>>>>>>>>>>>>>>>>>>>>         end_line_count = min(end_line, len(lines) - 1
>>>>>>>>>>>>>>>>>>>>>> )
>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>     for font in font_list.fonts:  # Iterate through 
>>>>>>>>>>>>>>>>>>>>>> all the fonts in the font_list
>>>>>>>>>>>>>>>>>>>>>>         font_serial = 1
>>>>>>>>>>>>>>>>>>>>>>         for line in lines:
>>>>>>>>>>>>>>>>>>>>>>             training_text_file_name = pathlib.Path(
>>>>>>>>>>>>>>>>>>>>>> training_text_file).stem
>>>>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>>>>             # Generate a unique serial number for 
>>>>>>>>>>>>>>>>>>>>>> each line
>>>>>>>>>>>>>>>>>>>>>>             line_serial = f"{line_count:d}"
>>>>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>>>>             # GT (Ground Truth) text filename
>>>>>>>>>>>>>>>>>>>>>>             line_gt_text = os.path.join(
>>>>>>>>>>>>>>>>>>>>>> output_directory, f'{training_text_file_name}_{
>>>>>>>>>>>>>>>>>>>>>> line_serial}.gt.txt')
>>>>>>>>>>>>>>>>>>>>>>             with open(line_gt_text, 'w') as 
>>>>>>>>>>>>>>>>>>>>>> output_file:
>>>>>>>>>>>>>>>>>>>>>>                 output_file.writelines([line])
>>>>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>>>>             # Image filename
>>>>>>>>>>>>>>>>>>>>>>             file_base_name = f'ben_{line_serial}'  # 
>>>>>>>>>>>>>>>>>>>>>> Unique filename for each font
>>>>>>>>>>>>>>>>>>>>>>             subprocess.run([
>>>>>>>>>>>>>>>>>>>>>>                 'text2image',
>>>>>>>>>>>>>>>>>>>>>>                 f'--font={font}',
>>>>>>>>>>>>>>>>>>>>>>                 f'--text={line_gt_text}',
>>>>>>>>>>>>>>>>>>>>>>                 f'--outputbase={output_directory}/{
>>>>>>>>>>>>>>>>>>>>>> file_base_name}',
>>>>>>>>>>>>>>>>>>>>>>                 '--max_pages=1',
>>>>>>>>>>>>>>>>>>>>>>                 '--strip_unrenderable_words',
>>>>>>>>>>>>>>>>>>>>>>                 '--leading=36',
>>>>>>>>>>>>>>>>>>>>>>                 '--xsize=3600',
>>>>>>>>>>>>>>>>>>>>>>                 '--ysize=350',
>>>>>>>>>>>>>>>>>>>>>>                 '--char_spacing=1.0',
>>>>>>>>>>>>>>>>>>>>>>                 '--exposure=0',
>>>>>>>>>>>>>>>>>>>>>>                 '
>>>>>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/ben.unicharset',
>>>>>>>>>>>>>>>>>>>>>>             ])
>>>>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>>>>             line_count += 1
>>>>>>>>>>>>>>>>>>>>>>             font_serial += 1
>>>>>>>>>>>>>>>>>>>>>>         
>>>>>>>>>>>>>>>>>>>>>>         # Reset font_serial for the next font 
>>>>>>>>>>>>>>>>>>>>>> iteration
>>>>>>>>>>>>>>>>>>>>>>         font_serial = 1
>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>     write_line_count(line_count)  # Update the 
>>>>>>>>>>>>>>>>>>>>>> line_count in the file
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> if __name__ == "__main__":
>>>>>>>>>>>>>>>>>>>>>>     parser = argparse.ArgumentParser()
>>>>>>>>>>>>>>>>>>>>>>     parser.add_argument('--start', type=int, 
>>>>>>>>>>>>>>>>>>>>>> help='Starting 
>>>>>>>>>>>>>>>>>>>>>> line count (inclusive)')
>>>>>>>>>>>>>>>>>>>>>>     parser.add_argument('--end', type=int, help='Ending 
>>>>>>>>>>>>>>>>>>>>>> line count (inclusive)')
>>>>>>>>>>>>>>>>>>>>>>     args = parser.parse_args()
>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>     training_text_file = 'langdata/ben.training_text'
>>>>>>>>>>>>>>>>>>>>>>     output_directory = '
>>>>>>>>>>>>>>>>>>>>>> tesstrain/data/ben-ground-truth'
>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>     # Create an instance of the FontList class
>>>>>>>>>>>>>>>>>>>>>>     font_list = FontList()
>>>>>>>>>>>>>>>>>>>>>>      
>>>>>>>>>>>>>>>>>>>>>>     create_training_data(training_text_file, 
>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, args.start, args.end)
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> *and for training code:*
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> # List of font names
>>>>>>>>>>>>>>>>>>>>>> font_names = ['ben']
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> for font in font_names:
>>>>>>>>>>>>>>>>>>>>>>     command = f"TESSDATA_PREFIX=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>>>>>> make training MODEL_NAME={font} START_MODEL=ben 
>>>>>>>>>>>>>>>>>>>>>> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000 
>>>>>>>>>>>>>>>>>>>>>> LANG_TYPE=Indic"
>>>>>>>>>>>>>>>>>>>>>>     subprocess.run(command, shell=True)
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> any suggestion to identify to extract the problem.
>>>>>>>>>>>>>>>>>>>>>> thanks, everyone
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>>>>>>> You received this message because you are subscribed 
>>>>>>>>>>>>>>>>>>>>>> to the Google Groups "tesseract-ocr" group.
>>>>>>>>>>>>>>>>>>>>>> To unsubscribe from this group and stop receiving 
>>>>>>>>>>>>>>>>>>>>>> emails from it, send an email to 
>>>>>>>>>>>>>>>>>>>>>> tesseract-oc...@googlegroups.com.
>>>>>>>>>>>>>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com
>>>>>>>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com.
>>
> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/e28e6112-ac6b-47d3-bd4e-e1c4c13dfa47n%40googlegroups.com.

Re: [tesseract-ocr] accuracy problem after trained in fine-tune

Reply via email to