Re: [tesseract-ocr] accuracy problem after trained in fine-tune

Des Bw Fri, 15 Sep 2023 03:01:37 -0700

Just saw this paper: https://osf.io/b8h7q


On Thursday, September 14, 2023 at 9:02:22 PM UTC+3 mdalihu...@gmail.com 
wrote:

> I will try some changes. thx
>
> On Thursday, 14 September, 2023 at 2:46:36 pm UTC+6 elvi...@gmail.com 
> wrote:
>
>> I also faced that issue in the Windows. Apparently, the issue is related 
>> with unicode. You can try your luck by changing  "r" to "utf8" in the 
>> script.
>> I end up installing Ubuntu because i was having too many errors in the 
>> Windows.
>>
>> On Thu, Sep 14, 2023, 9:33 AM Ali hussain <mdalihu...@gmail.com> wrote:
>>
>>> you faced this error,  Can't encode transcription? if you faced how you 
>>> have solved this?
>>>
>>> On Thursday, 14 September, 2023 at 10:51:52 am UTC+6 elvi...@gmail.com 
>>> wrote:
>>>
>>>> I was using my own text
>>>>
>>>> On Thu, Sep 14, 2023, 6:58 AM Ali hussain <mdalihu...@gmail.com> wrote:
>>>>
>>>>> you are training from Tessearact default text data or your own 
>>>>> collected text data?
>>>>> On Thursday, 14 September, 2023 at 12:19:53 am UTC+6 
>>>>> desal...@gmail.com wrote:
>>>>>
>>>>>> I now get to 200000 iterations; and the error rate is stuck at 0.46. 
>>>>>> The result is absolutely trash: nowhere close to the default/Ray's 
>>>>>> training. 
>>>>>>
>>>>>> On Wednesday, September 13, 2023 at 2:47:05 PM UTC+3 
>>>>>> mdalihu...@gmail.com wrote:
>>>>>>
>>>>>>>
>>>>>>> after Tesseact recognizes text from images. then you can apply regex 
>>>>>>> to replace the wrong word with to correct word.
>>>>>>> I'm not familiar with paddleOcr and scanTailor also.
>>>>>>>
>>>>>>> On Wednesday, 13 September, 2023 at 5:06:12 pm UTC+6 
>>>>>>> desal...@gmail.com wrote:
>>>>>>>
>>>>>>>> At what stage are you doing the regex replacement?
>>>>>>>> My process has been: Scan (tif)--> ScanTailor --> Tesseract --> pdf
>>>>>>>>
>>>>>>>> >EasyOCR I think is best for ID cards or something like that image 
>>>>>>>> process. but document images like books, here Tesseract is better than 
>>>>>>>> EasyOCR.
>>>>>>>>
>>>>>>>> How about paddleOcr?, are you familiar with it?
>>>>>>>>
>>>>>>>> On Wednesday, September 13, 2023 at 1:45:54 PM UTC+3 
>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>
>>>>>>>>> I know what you mean. but in some cases, it helps me.  I have 
>>>>>>>>> faced specific characters and words are always not recognized by 
>>>>>>>>> Tesseract. 
>>>>>>>>> That way I use these regex to replace those characters   and words if 
>>>>>>>>>  
>>>>>>>>> those characters are incorrect.
>>>>>>>>>
>>>>>>>>> see what I have done: 
>>>>>>>>>
>>>>>>>>>    " ী": "ী",
>>>>>>>>>     " ্": " ",
>>>>>>>>>     " ে": " ",
>>>>>>>>>     জ্া: "জা",
>>>>>>>>>     "  ": " ",
>>>>>>>>>     "   ": " ",
>>>>>>>>>     "    ": " ",
>>>>>>>>>     "্প": " ",
>>>>>>>>>     " য": "র্য",
>>>>>>>>>     য: "য",
>>>>>>>>>     " া": "া",
>>>>>>>>>     আা: "আ",
>>>>>>>>>     ম্ি: "মি",
>>>>>>>>>     স্ু: "সু",
>>>>>>>>>     "হূ ": "হূ",
>>>>>>>>>     " ণ": "ণ",
>>>>>>>>>     র্্: "র",
>>>>>>>>>     "চিন্ত ": "চিন্তা ",
>>>>>>>>>     ন্া: "না",
>>>>>>>>>     "সম ূর্ন": "সম্পূর্ণ",
>>>>>>>>> On Wednesday, 13 September, 2023 at 4:18:22 pm UTC+6 
>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>
>>>>>>>>>> The problem for regex is that Tesseract is not consistent in its 
>>>>>>>>>> replacement. 
>>>>>>>>>> Think of the original training of English data doesn't contain 
>>>>>>>>>> the letter /u/. What does Tesseract do when it faces /u/ in actual 
>>>>>>>>>> processing??
>>>>>>>>>> In some cases, it replaces it with closely similar letters such 
>>>>>>>>>> as /v/ and /w/. In other cases, it completely removes it. That is 
>>>>>>>>>> what is 
>>>>>>>>>> happening with my case. Those characters re sometimes completely 
>>>>>>>>>> removed; 
>>>>>>>>>> other times, they are replaced by closely resembling characters. 
>>>>>>>>>> Because of 
>>>>>>>>>> this inconsistency, applying regex is very difficult. 
>>>>>>>>>>
>>>>>>>>>> On Wednesday, September 13, 2023 at 1:02:01 PM UTC+3 
>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>
>>>>>>>>>>> if Some specific characters or words are always missing from the 
>>>>>>>>>>> OCR result.  then you can apply logic with the Regular expressions 
>>>>>>>>>>> method 
>>>>>>>>>>> on your applications. After OCR, these specific characters or words 
>>>>>>>>>>> will be 
>>>>>>>>>>> replaced by current characters or words that you defined in your 
>>>>>>>>>>> applications by  Regular expressions. it can be done in some major 
>>>>>>>>>>> problems.
>>>>>>>>>>>
>>>>>>>>>>> On Wednesday, 13 September, 2023 at 3:51:29 pm UTC+6 
>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>
>>>>>>>>>>>> The characters are getting missed, even after fine-tuning. 
>>>>>>>>>>>> I never made any progress. I tried many different ways. Some  
>>>>>>>>>>>> specific characters are always missing from the OCR result.  
>>>>>>>>>>>>
>>>>>>>>>>>> On Wednesday, September 13, 2023 at 12:49:20 PM UTC+3 
>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> EasyOCR I think is best for ID cards or something like that 
>>>>>>>>>>>>> image process. but document images like books, here Tesseract is 
>>>>>>>>>>>>> better 
>>>>>>>>>>>>> than EasyOCR.  Even I didn't use EasyOCR. you can try it.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have added words of dictionaries but the result is the same. 
>>>>>>>>>>>>>
>>>>>>>>>>>>> what kind of problem you have faced in fine-tuning in few new 
>>>>>>>>>>>>> characters as you said (*but, I failed in every possible way 
>>>>>>>>>>>>> to introduce a few new characters into the database.)*
>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 3:33:48 pm UTC+6 
>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yes, we are new to this. I find the instructions (the manual) 
>>>>>>>>>>>>>> very hard to follow. The video you linked above was really 
>>>>>>>>>>>>>> helpful  to get 
>>>>>>>>>>>>>> started. My plan at the beginning was to fine tune the existing 
>>>>>>>>>>>>>> .traineddata. But, I failed in every possible way to introduce a 
>>>>>>>>>>>>>> few new 
>>>>>>>>>>>>>> characters into the database. That is why I started from 
>>>>>>>>>>>>>> scratch. 
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Sure, I will follow Lorenzo's suggestion: will run more the 
>>>>>>>>>>>>>> iterations, and see if I can improve. 
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Another areas we need to explore is usage of dictionaries 
>>>>>>>>>>>>>> actually. May be adding millions of words into the dictionary 
>>>>>>>>>>>>>> could help 
>>>>>>>>>>>>>> Tesseract. I don't have millions of words; but I am looking into 
>>>>>>>>>>>>>> some 
>>>>>>>>>>>>>> corpus to get more words into the dictionary. 
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If this all fails, EasyOCR (and probably other similar 
>>>>>>>>>>>>>> open-source packages)  is probably our next option to try on. 
>>>>>>>>>>>>>> Sure, sharing 
>>>>>>>>>>>>>> our experiences will be helpful. I will let you know if I made 
>>>>>>>>>>>>>> good 
>>>>>>>>>>>>>> progresses in any of these options. 
>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 12:19:48 PM UTC+3 
>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> How is your training going for Bengali?  It was nearly good 
>>>>>>>>>>>>>>> but I faced space problems between two words, some words are 
>>>>>>>>>>>>>>> spaces but 
>>>>>>>>>>>>>>> most of them have no space. I think is problem is in the 
>>>>>>>>>>>>>>> dataset but I use 
>>>>>>>>>>>>>>> the default training dataset from Tesseract which is used in 
>>>>>>>>>>>>>>> Ben That way I 
>>>>>>>>>>>>>>> am confused so I have to explore more. by the way,  you can try 
>>>>>>>>>>>>>>> as Lorenzo 
>>>>>>>>>>>>>>> Blz said.  Actually training from scratch is harder than 
>>>>>>>>>>>>>>> fine-tuning. so you can use different datasets to explore. if 
>>>>>>>>>>>>>>> you succeed. 
>>>>>>>>>>>>>>> please let me know how you have done this whole process.  I'm 
>>>>>>>>>>>>>>> also new in 
>>>>>>>>>>>>>>> this field.
>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 1:13:43 pm UTC+6 
>>>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> How is your training going for Bengali?
>>>>>>>>>>>>>>>> I have been trying to train from scratch. I made about 
>>>>>>>>>>>>>>>> 64,000 lines of text (which produced about 255,000 files, in 
>>>>>>>>>>>>>>>> the end) and 
>>>>>>>>>>>>>>>> run the training for 150,000 iterations; getting 0.51 training 
>>>>>>>>>>>>>>>> error rate. 
>>>>>>>>>>>>>>>> I was hopping to get reasonable accuracy. Unfortunately, when 
>>>>>>>>>>>>>>>> I run the OCR 
>>>>>>>>>>>>>>>> using  .traineddata,  the accuracy is absolutely terrible. Do 
>>>>>>>>>>>>>>>> you think I 
>>>>>>>>>>>>>>>> made some mistakes, or that is an expected result?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tuesday, September 12, 2023 at 11:15:25 PM UTC+3 
>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Yes, he doesn't mention all fonts but only one font.  That 
>>>>>>>>>>>>>>>>> way he didn't use *MODEL_NAME in a separate **script **file 
>>>>>>>>>>>>>>>>> script I think.*
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Actually, here we teach all *tif, gt.txt, and .box files 
>>>>>>>>>>>>>>>>> *which 
>>>>>>>>>>>>>>>>> are created by  *MODEL_NAME I mean **eng, ben, oro flag 
>>>>>>>>>>>>>>>>> or language code *because when we first create *tif, 
>>>>>>>>>>>>>>>>> gt.txt, and .box files, *every file starts by  
>>>>>>>>>>>>>>>>> *MODEL_NAME*. This  *MODEL_NAME*  we selected on the 
>>>>>>>>>>>>>>>>> training script for looping each tif, gt.txt, and .box files 
>>>>>>>>>>>>>>>>> which are 
>>>>>>>>>>>>>>>>> created by  *MODEL_NAME.*
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Tuesday, 12 September, 2023 at 9:42:13 pm UTC+6 
>>>>>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Yes, I am familiar with the video and have set up the 
>>>>>>>>>>>>>>>>>> folder structure as you did. Indeed, I have tried a number 
>>>>>>>>>>>>>>>>>> of fine-tuning 
>>>>>>>>>>>>>>>>>> with a single font following Gracia's video. But, your 
>>>>>>>>>>>>>>>>>> script is much  
>>>>>>>>>>>>>>>>>> better because supports multiple fonts. The whole 
>>>>>>>>>>>>>>>>>> improvement you made is  
>>>>>>>>>>>>>>>>>> brilliant; and very useful. It is all working for me. 
>>>>>>>>>>>>>>>>>> The only part that I didn't understand is the trick you 
>>>>>>>>>>>>>>>>>> used in your tesseract_train.py script. You see, I have been 
>>>>>>>>>>>>>>>>>> doing exactly 
>>>>>>>>>>>>>>>>>> to you did except this script. 
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The scripts seems to have the trick of sending/teaching 
>>>>>>>>>>>>>>>>>> each of the fonts (iteratively) into the model. The script I 
>>>>>>>>>>>>>>>>>> have been 
>>>>>>>>>>>>>>>>>> using  (which I get from Garcia) doesn't mention font at 
>>>>>>>>>>>>>>>>>> all. 
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> *TESSDATA_PREFIX=../tesseract/tessdata make training 
>>>>>>>>>>>>>>>>>> MODEL_NAME=oro TESSDATA=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000*
>>>>>>>>>>>>>>>>>> Does it mean that my model does't train the fonts (even 
>>>>>>>>>>>>>>>>>> if the fonts have been included in the splitting process, in 
>>>>>>>>>>>>>>>>>> the other 
>>>>>>>>>>>>>>>>>> script)?
>>>>>>>>>>>>>>>>>> On Monday, September 11, 2023 at 10:54:08 AM UTC+3 
>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> *import subprocess# List of font namesfont_names = 
>>>>>>>>>>>>>>>>>>> ['ben']for font in font_names:    command = 
>>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make training 
>>>>>>>>>>>>>>>>>>> MODEL_NAME={font} 
>>>>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000"*
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> *    subprocess.run(command, shell=True) 1 . This 
>>>>>>>>>>>>>>>>>>> command is for training data that I have named '*
>>>>>>>>>>>>>>>>>>> tesseract_training*.py' inside tesstrain folder.*
>>>>>>>>>>>>>>>>>>> *2. root directory means your main training folder and 
>>>>>>>>>>>>>>>>>>> inside it as like langdata, tessearact,  tesstrain folders. 
>>>>>>>>>>>>>>>>>>> if you see this 
>>>>>>>>>>>>>>>>>>> tutorial    *https://www.youtube.com/watch?v=KE4xEzFGSU8  
>>>>>>>>>>>>>>>>>>>  you will understand better the folder structure. only I 
>>>>>>>>>>>>>>>>>>> created tesseract_training.py in tesstrain folder for 
>>>>>>>>>>>>>>>>>>> training and  
>>>>>>>>>>>>>>>>>>> FontList.py file is the main path as *like langdata, 
>>>>>>>>>>>>>>>>>>> tessearact,  tesstrain, and *split_training_text.py.
>>>>>>>>>>>>>>>>>>> 3. first of all you have to put all fonts in your Linux 
>>>>>>>>>>>>>>>>>>> fonts folder.   /usr/share/fonts/  then run:  sudo apt 
>>>>>>>>>>>>>>>>>>> update  then sudo fc-cache -fv
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> after that, you have to add the exact font's name in 
>>>>>>>>>>>>>>>>>>> FontList.py file like me.
>>>>>>>>>>>>>>>>>>> I  have added two pic my folder structure. first is 
>>>>>>>>>>>>>>>>>>> main structure pic and the second is the Colopse tesstrain 
>>>>>>>>>>>>>>>>>>> folder.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I[image: Screenshot 2023-09-11 134947.png][image: 
>>>>>>>>>>>>>>>>>>> Screenshot 2023-09-11 135014.png] 
>>>>>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 12:50:03 pm UTC+6 
>>>>>>>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thank you so much for putting out these brilliant 
>>>>>>>>>>>>>>>>>>>> scripts. They make the process  much more efficient.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I have one more question on the other script that you 
>>>>>>>>>>>>>>>>>>>> use to train. 
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> *import subprocess# List of font namesfont_names = 
>>>>>>>>>>>>>>>>>>>> ['ben']for font in font_names:    command = 
>>>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make training 
>>>>>>>>>>>>>>>>>>>> MODEL_NAME={font} 
>>>>>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000"*
>>>>>>>>>>>>>>>>>>>> *    subprocess.run(command, shell=True) *
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Do you have the name of fonts listed in file in the 
>>>>>>>>>>>>>>>>>>>> same/root directory?
>>>>>>>>>>>>>>>>>>>> How do you setup the names of the fonts in the file, if 
>>>>>>>>>>>>>>>>>>>> you don't mind sharing it?
>>>>>>>>>>>>>>>>>>>> On Monday, September 11, 2023 at 4:27:27 AM UTC+3 
>>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> You can use the new script below. it's better than the 
>>>>>>>>>>>>>>>>>>>>> previous two scripts.  You can create *tif, gt.txt, 
>>>>>>>>>>>>>>>>>>>>> and .box files *by multiple fonts and also use 
>>>>>>>>>>>>>>>>>>>>> breakpoint if vs code close or anything during creating 
>>>>>>>>>>>>>>>>>>>>> *tif, 
>>>>>>>>>>>>>>>>>>>>> gt.txt, and .box files *then you can checkpoint to 
>>>>>>>>>>>>>>>>>>>>> navigate where you close vs code.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> command for *tif, gt.txt, and .box files *
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> import os
>>>>>>>>>>>>>>>>>>>>> import random
>>>>>>>>>>>>>>>>>>>>> import pathlib
>>>>>>>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>>>>>>>> import argparse
>>>>>>>>>>>>>>>>>>>>> from FontList import FontList
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> def create_training_data(training_text_file, font_list, 
>>>>>>>>>>>>>>>>>>>>> output_directory, start_line=None, end_line=None):
>>>>>>>>>>>>>>>>>>>>>     lines = []
>>>>>>>>>>>>>>>>>>>>>     with open(training_text_file, 'r') as input_file:
>>>>>>>>>>>>>>>>>>>>>         lines = input_file.readlines()
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>     if not os.path.exists(output_directory):
>>>>>>>>>>>>>>>>>>>>>         os.mkdir(output_directory)
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>     if start_line is None:
>>>>>>>>>>>>>>>>>>>>>         start_line = 0
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>     if end_line is None:
>>>>>>>>>>>>>>>>>>>>>         end_line = len(lines) - 1
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>     for font_name in font_list.fonts:
>>>>>>>>>>>>>>>>>>>>>         for line_index in range(start_line, end_line + 
>>>>>>>>>>>>>>>>>>>>> 1):
>>>>>>>>>>>>>>>>>>>>>             line = lines[line_index].strip()
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>             training_text_file_name = pathlib.Path(
>>>>>>>>>>>>>>>>>>>>> training_text_file).stem
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>             line_serial = f"{line_index:d}"
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>             line_gt_text = os.path.join(
>>>>>>>>>>>>>>>>>>>>> output_directory, f'{training_text_file_name}_{
>>>>>>>>>>>>>>>>>>>>> line_serial}_{font_name.replace(" ", "_")}.gt.txt')
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>             with open(line_gt_text, 'w') as 
>>>>>>>>>>>>>>>>>>>>> output_file:
>>>>>>>>>>>>>>>>>>>>>                 output_file.writelines([line])
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>             file_base_name = f'{
>>>>>>>>>>>>>>>>>>>>> training_text_file_name}_{line_serial}_{
>>>>>>>>>>>>>>>>>>>>> font_name.replace(" ", "_")}'
>>>>>>>>>>>>>>>>>>>>>             subprocess.run([
>>>>>>>>>>>>>>>>>>>>>                 'text2image',
>>>>>>>>>>>>>>>>>>>>>                 f'--font={font_name}',
>>>>>>>>>>>>>>>>>>>>>                 f'--text={line_gt_text}',
>>>>>>>>>>>>>>>>>>>>>                 f'--outputbase={output_directory}/{
>>>>>>>>>>>>>>>>>>>>> file_base_name}',
>>>>>>>>>>>>>>>>>>>>>                 '--max_pages=1',
>>>>>>>>>>>>>>>>>>>>>                 '--strip_unrenderable_words',
>>>>>>>>>>>>>>>>>>>>>                 '--leading=36',
>>>>>>>>>>>>>>>>>>>>>                 '--xsize=3600',
>>>>>>>>>>>>>>>>>>>>>                 '--ysize=330',
>>>>>>>>>>>>>>>>>>>>>                 '--char_spacing=1.0',
>>>>>>>>>>>>>>>>>>>>>                 '--exposure=0',
>>>>>>>>>>>>>>>>>>>>>                 '
>>>>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/eng.unicharset',
>>>>>>>>>>>>>>>>>>>>>             ])
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> if __name__ == "__main__":
>>>>>>>>>>>>>>>>>>>>>     parser = argparse.ArgumentParser()
>>>>>>>>>>>>>>>>>>>>>     parser.add_argument('--start', type=int, 
>>>>>>>>>>>>>>>>>>>>> help='Starting 
>>>>>>>>>>>>>>>>>>>>> line count (inclusive)')
>>>>>>>>>>>>>>>>>>>>>     parser.add_argument('--end', type=int, help='Ending 
>>>>>>>>>>>>>>>>>>>>> line count (inclusive)')
>>>>>>>>>>>>>>>>>>>>>     args = parser.parse_args()
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>     training_text_file = 'langdata/eng.training_text'
>>>>>>>>>>>>>>>>>>>>>     output_directory = '
>>>>>>>>>>>>>>>>>>>>> tesstrain/data/eng-ground-truth'
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>     font_list = FontList()
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>     create_training_data(training_text_file, 
>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, args.start, args.end)
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Then create a file called "FontList" in the root 
>>>>>>>>>>>>>>>>>>>>> directory and paste it.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> class FontList:
>>>>>>>>>>>>>>>>>>>>>     def __init__(self):
>>>>>>>>>>>>>>>>>>>>>         self.fonts = [
>>>>>>>>>>>>>>>>>>>>>         "Gerlick"
>>>>>>>>>>>>>>>>>>>>>             "Sagar Medium",
>>>>>>>>>>>>>>>>>>>>>             "Ekushey Lohit Normal",  
>>>>>>>>>>>>>>>>>>>>>            "Charukola Round Head Regular, weight=433",
>>>>>>>>>>>>>>>>>>>>>             "Charukola Round Head Bold, weight=443",
>>>>>>>>>>>>>>>>>>>>>             "Ador Orjoma Unicode",
>>>>>>>>>>>>>>>>>>>>>       
>>>>>>>>>>>>>>>>>>>>>           
>>>>>>>>>>>>>>>>>>>>>                        
>>>>>>>>>>>>>>>>>>>>> ]                         
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> then import in the above code,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> *for breakpoint command:*
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> sudo python3 split_training_text.py --start 0  --end 
>>>>>>>>>>>>>>>>>>>>> 11
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> change checkpoint according to you  --start 0 --end 11.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> *and training checkpoint as you know already.*
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 1:22:34 am UTC+6 
>>>>>>>>>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hi mhalidu, 
>>>>>>>>>>>>>>>>>>>>>> the script you posted here seems much more extensive 
>>>>>>>>>>>>>>>>>>>>>> than you posted before: 
>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/0e2880d9-64c0-4659-b497-902a5747caf4n%40googlegroups.com
>>>>>>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I have been using your earlier script. It is magical. 
>>>>>>>>>>>>>>>>>>>>>> How is this one different from the earlier one?
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thank you for posting these scripts, by the way. It 
>>>>>>>>>>>>>>>>>>>>>> has saved my countless hours; by running multiple fonts 
>>>>>>>>>>>>>>>>>>>>>> in one sweep. I was 
>>>>>>>>>>>>>>>>>>>>>> not able to find any instruction on how to train for  
>>>>>>>>>>>>>>>>>>>>>> multiple fonts. The 
>>>>>>>>>>>>>>>>>>>>>> official manual is also unclear. YOUr script helped me 
>>>>>>>>>>>>>>>>>>>>>> to get started. 
>>>>>>>>>>>>>>>>>>>>>> On Wednesday, August 9, 2023 at 11:00:49 PM UTC+3 
>>>>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> ok, I will try as you said.
>>>>>>>>>>>>>>>>>>>>>>> one more thing, what's the role of the trained_text 
>>>>>>>>>>>>>>>>>>>>>>> lines will be? I have seen Bengali text are long words 
>>>>>>>>>>>>>>>>>>>>>>> of lines. so I wanna 
>>>>>>>>>>>>>>>>>>>>>>> know how many words or characters will be the better 
>>>>>>>>>>>>>>>>>>>>>>> choice for the train? 
>>>>>>>>>>>>>>>>>>>>>>> and '--xsize=3600','--ysize=350',  will be according to 
>>>>>>>>>>>>>>>>>>>>>>> words of lines?
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Thursday, 10 August, 2023 at 1:10:14 am UTC+6 
>>>>>>>>>>>>>>>>>>>>>>> shree wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Include the default fonts also in your fine-tuning 
>>>>>>>>>>>>>>>>>>>>>>>> list of fonts and see if that helps.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Aug 9, 2023, 2:27 PM Ali hussain <
>>>>>>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> I have trained some new fonts by fine-tune methods 
>>>>>>>>>>>>>>>>>>>>>>>>> for the Bengali language in Tesseract 5 and I have 
>>>>>>>>>>>>>>>>>>>>>>>>> used all official 
>>>>>>>>>>>>>>>>>>>>>>>>> trained_text and tessdata_best and other things also. 
>>>>>>>>>>>>>>>>>>>>>>>>>  everything is good 
>>>>>>>>>>>>>>>>>>>>>>>>> but the problem is the default font which was trained 
>>>>>>>>>>>>>>>>>>>>>>>>> before that does not 
>>>>>>>>>>>>>>>>>>>>>>>>> convert text like prev but my new fonts work well. I 
>>>>>>>>>>>>>>>>>>>>>>>>> don't understand why 
>>>>>>>>>>>>>>>>>>>>>>>>> it's happening. I share code based to understand what 
>>>>>>>>>>>>>>>>>>>>>>>>> going on.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> *codes  for creating tif, gt.txt, .box files:*
>>>>>>>>>>>>>>>>>>>>>>>>> import os
>>>>>>>>>>>>>>>>>>>>>>>>> import random
>>>>>>>>>>>>>>>>>>>>>>>>> import pathlib
>>>>>>>>>>>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>>>>>>>>>>>> import argparse
>>>>>>>>>>>>>>>>>>>>>>>>> from FontList import FontList
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> def read_line_count():
>>>>>>>>>>>>>>>>>>>>>>>>>     if os.path.exists('line_count.txt'):
>>>>>>>>>>>>>>>>>>>>>>>>>         with open('line_count.txt', 'r') as file:
>>>>>>>>>>>>>>>>>>>>>>>>>             return int(file.read())
>>>>>>>>>>>>>>>>>>>>>>>>>     return 0
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> def write_line_count(line_count):
>>>>>>>>>>>>>>>>>>>>>>>>>     with open('line_count.txt', 'w') as file:
>>>>>>>>>>>>>>>>>>>>>>>>>         file.write(str(line_count))
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> def create_training_data(training_text_file, 
>>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, start_line=None, 
>>>>>>>>>>>>>>>>>>>>>>>>> end_line=None):
>>>>>>>>>>>>>>>>>>>>>>>>>     lines = []
>>>>>>>>>>>>>>>>>>>>>>>>>     with open(training_text_file, 'r') as 
>>>>>>>>>>>>>>>>>>>>>>>>> input_file:
>>>>>>>>>>>>>>>>>>>>>>>>>         for line in input_file.readlines():
>>>>>>>>>>>>>>>>>>>>>>>>>             lines.append(line.strip())
>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>     if not os.path.exists(output_directory):
>>>>>>>>>>>>>>>>>>>>>>>>>         os.mkdir(output_directory)
>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>     random.shuffle(lines)
>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>     if start_line is None:
>>>>>>>>>>>>>>>>>>>>>>>>>         line_count = read_line_count()  # Set the 
>>>>>>>>>>>>>>>>>>>>>>>>> starting line_count from the file
>>>>>>>>>>>>>>>>>>>>>>>>>     else:
>>>>>>>>>>>>>>>>>>>>>>>>>         line_count = start_line
>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>     if end_line is None:
>>>>>>>>>>>>>>>>>>>>>>>>>         end_line_count = len(lines) - 1  # Set 
>>>>>>>>>>>>>>>>>>>>>>>>> the ending line_count
>>>>>>>>>>>>>>>>>>>>>>>>>     else:
>>>>>>>>>>>>>>>>>>>>>>>>>         end_line_count = min(end_line, len(lines) 
>>>>>>>>>>>>>>>>>>>>>>>>> - 1)
>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>     for font in font_list.fonts:  # Iterate 
>>>>>>>>>>>>>>>>>>>>>>>>> through all the fonts in the font_list
>>>>>>>>>>>>>>>>>>>>>>>>>         font_serial = 1
>>>>>>>>>>>>>>>>>>>>>>>>>         for line in lines:
>>>>>>>>>>>>>>>>>>>>>>>>>             training_text_file_name = pathlib.Path
>>>>>>>>>>>>>>>>>>>>>>>>> (training_text_file).stem
>>>>>>>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>>>>>>>             # Generate a unique serial number for 
>>>>>>>>>>>>>>>>>>>>>>>>> each line
>>>>>>>>>>>>>>>>>>>>>>>>>             line_serial = f"{line_count:d}"
>>>>>>>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>>>>>>>             # GT (Ground Truth) text filename
>>>>>>>>>>>>>>>>>>>>>>>>>             line_gt_text = os.path.join(
>>>>>>>>>>>>>>>>>>>>>>>>> output_directory, f'{training_text_file_name}_{
>>>>>>>>>>>>>>>>>>>>>>>>> line_serial}.gt.txt')
>>>>>>>>>>>>>>>>>>>>>>>>>             with open(line_gt_text, 'w') as 
>>>>>>>>>>>>>>>>>>>>>>>>> output_file:
>>>>>>>>>>>>>>>>>>>>>>>>>                 output_file.writelines([line])
>>>>>>>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>>>>>>>             # Image filename
>>>>>>>>>>>>>>>>>>>>>>>>>             file_base_name = f'ben_{line_serial}' 
>>>>>>>>>>>>>>>>>>>>>>>>>  # Unique filename for each font
>>>>>>>>>>>>>>>>>>>>>>>>>             subprocess.run([
>>>>>>>>>>>>>>>>>>>>>>>>>                 'text2image',
>>>>>>>>>>>>>>>>>>>>>>>>>                 f'--font={font}',
>>>>>>>>>>>>>>>>>>>>>>>>>                 f'--text={line_gt_text}',
>>>>>>>>>>>>>>>>>>>>>>>>>                 f'--outputbase={output_directory}/
>>>>>>>>>>>>>>>>>>>>>>>>> {file_base_name}',
>>>>>>>>>>>>>>>>>>>>>>>>>                 '--max_pages=1',
>>>>>>>>>>>>>>>>>>>>>>>>>                 '--strip_unrenderable_words',
>>>>>>>>>>>>>>>>>>>>>>>>>                 '--leading=36',
>>>>>>>>>>>>>>>>>>>>>>>>>                 '--xsize=3600',
>>>>>>>>>>>>>>>>>>>>>>>>>                 '--ysize=350',
>>>>>>>>>>>>>>>>>>>>>>>>>                 '--char_spacing=1.0',
>>>>>>>>>>>>>>>>>>>>>>>>>                 '--exposure=0',
>>>>>>>>>>>>>>>>>>>>>>>>>                 '
>>>>>>>>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/ben.unicharset',
>>>>>>>>>>>>>>>>>>>>>>>>>             ])
>>>>>>>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>>>>>>>             line_count += 1
>>>>>>>>>>>>>>>>>>>>>>>>>             font_serial += 1
>>>>>>>>>>>>>>>>>>>>>>>>>         
>>>>>>>>>>>>>>>>>>>>>>>>>         # Reset font_serial for the next font 
>>>>>>>>>>>>>>>>>>>>>>>>> iteration
>>>>>>>>>>>>>>>>>>>>>>>>>         font_serial = 1
>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>     write_line_count(line_count)  # Update the 
>>>>>>>>>>>>>>>>>>>>>>>>> line_count in the file
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> if __name__ == "__main__":
>>>>>>>>>>>>>>>>>>>>>>>>>     parser = argparse.ArgumentParser()
>>>>>>>>>>>>>>>>>>>>>>>>>     parser.add_argument('--start', type=int, help=
>>>>>>>>>>>>>>>>>>>>>>>>> 'Starting line count (inclusive)')
>>>>>>>>>>>>>>>>>>>>>>>>>     parser.add_argument('--end', type=int, 
>>>>>>>>>>>>>>>>>>>>>>>>> help='Ending 
>>>>>>>>>>>>>>>>>>>>>>>>> line count (inclusive)')
>>>>>>>>>>>>>>>>>>>>>>>>>     args = parser.parse_args()
>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>     training_text_file = '
>>>>>>>>>>>>>>>>>>>>>>>>> langdata/ben.training_text'
>>>>>>>>>>>>>>>>>>>>>>>>>     output_directory = '
>>>>>>>>>>>>>>>>>>>>>>>>> tesstrain/data/ben-ground-truth'
>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>     # Create an instance of the FontList class
>>>>>>>>>>>>>>>>>>>>>>>>>     font_list = FontList()
>>>>>>>>>>>>>>>>>>>>>>>>>      
>>>>>>>>>>>>>>>>>>>>>>>>>     create_training_data(training_text_file, 
>>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, args.start, args.end)
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> *and for training code:*
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> # List of font names
>>>>>>>>>>>>>>>>>>>>>>>>> font_names = ['ben']
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> for font in font_names:
>>>>>>>>>>>>>>>>>>>>>>>>>     command = f"TESSDATA_PREFIX=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>>>>>>>>> make training MODEL_NAME={font} START_MODEL=ben 
>>>>>>>>>>>>>>>>>>>>>>>>> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000 
>>>>>>>>>>>>>>>>>>>>>>>>> LANG_TYPE=Indic"
>>>>>>>>>>>>>>>>>>>>>>>>>     subprocess.run(command, shell=True)
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> any suggestion to identify to extract the problem.
>>>>>>>>>>>>>>>>>>>>>>>>> thanks, everyone
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>>>>>>>>>> You received this message because you are 
>>>>>>>>>>>>>>>>>>>>>>>>> subscribed to the Google Groups "tesseract-ocr" group.
>>>>>>>>>>>>>>>>>>>>>>>>> To unsubscribe from this group and stop receiving 
>>>>>>>>>>>>>>>>>>>>>>>>> emails from it, send an email to 
>>>>>>>>>>>>>>>>>>>>>>>>> tesseract-oc...@googlegroups.com.
>>>>>>>>>>>>>>>>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com
>>>>>>>>>>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to tesseract-oc...@googlegroups.com.
>>>>>
>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com
>>>>>  
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to tesseract-oc...@googlegroups.com.
>>>
>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/eb833902-7258-43e3-8854-d51ce26b7257n%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/tesseract-ocr/eb833902-7258-43e3-8854-d51ce26b7257n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/ba377910-d2e1-43e1-ad3f-0425071a195cn%40googlegroups.com.

Re: [tesseract-ocr] accuracy problem after trained in fine-tune

Reply via email to