Re: [tesseract-ocr] accuracy problem after trained in fine-tune

Des Bw Wed, 13 Sep 2023 11:19:58 -0700

I now get to 200000 iterations; and the error rate is stuck at 0.46. The 
result is absolutely trash: nowhere close to the default/Ray's training.


On Wednesday, September 13, 2023 at 2:47:05 PM UTC+3 mdalihu...@gmail.com 
wrote:

>
> after Tesseact recognizes text from images. then you can apply regex to 
> replace the wrong word with to correct word.
> I'm not familiar with paddleOcr and scanTailor also.
>
> On Wednesday, 13 September, 2023 at 5:06:12 pm UTC+6 desal...@gmail.com 
> wrote:
>
>> At what stage are you doing the regex replacement?
>> My process has been: Scan (tif)--> ScanTailor --> Tesseract --> pdf
>>
>> >EasyOCR I think is best for ID cards or something like that image 
>> process. but document images like books, here Tesseract is better than 
>> EasyOCR.
>>
>> How about paddleOcr?, are you familiar with it?
>>
>> On Wednesday, September 13, 2023 at 1:45:54 PM UTC+3 mdalihu...@gmail.com 
>> wrote:
>>
>>> I know what you mean. but in some cases, it helps me.  I have faced 
>>> specific characters and words are always not recognized by Tesseract. That 
>>> way I use these regex to replace those characters   and words if  those 
>>> characters are incorrect.
>>>
>>> see what I have done: 
>>>
>>>    " ী": "ী",
>>>     " ্": " ",
>>>     " ে": " ",
>>>     জ্া: "জা",
>>>     "  ": " ",
>>>     "   ": " ",
>>>     "    ": " ",
>>>     "্প": " ",
>>>     " য": "র্য",
>>>     য: "য",
>>>     " া": "া",
>>>     আা: "আ",
>>>     ম্ি: "মি",
>>>     স্ু: "সু",
>>>     "হূ ": "হূ",
>>>     " ণ": "ণ",
>>>     র্্: "র",
>>>     "চিন্ত ": "চিন্তা ",
>>>     ন্া: "না",
>>>     "সম ূর্ন": "সম্পূর্ণ",
>>> On Wednesday, 13 September, 2023 at 4:18:22 pm UTC+6 desal...@gmail.com 
>>> wrote:
>>>
>>>> The problem for regex is that Tesseract is not consistent in its 
>>>> replacement. 
>>>> Think of the original training of English data doesn't contain the 
>>>> letter /u/. What does Tesseract do when it faces /u/ in actual processing??
>>>> In some cases, it replaces it with closely similar letters such as /v/ 
>>>> and /w/. In other cases, it completely removes it. That is what is 
>>>> happening with my case. Those characters re sometimes completely removed; 
>>>> other times, they are replaced by closely resembling characters. Because 
>>>> of 
>>>> this inconsistency, applying regex is very difficult. 
>>>>
>>>> On Wednesday, September 13, 2023 at 1:02:01 PM UTC+3 
>>>> mdalihu...@gmail.com wrote:
>>>>
>>>>> if Some specific characters or words are always missing from the OCR 
>>>>> result.  then you can apply logic with the Regular expressions method on 
>>>>> your applications. After OCR, these specific characters or words will be 
>>>>> replaced by current characters or words that you defined in your 
>>>>> applications by  Regular expressions. it can be done in some major 
>>>>> problems.
>>>>>
>>>>> On Wednesday, 13 September, 2023 at 3:51:29 pm UTC+6 
>>>>> desal...@gmail.com wrote:
>>>>>
>>>>>> The characters are getting missed, even after fine-tuning. 
>>>>>> I never made any progress. I tried many different ways. Some  
>>>>>> specific characters are always missing from the OCR result.  
>>>>>>
>>>>>> On Wednesday, September 13, 2023 at 12:49:20 PM UTC+3 
>>>>>> mdalihu...@gmail.com wrote:
>>>>>>
>>>>>>> EasyOCR I think is best for ID cards or something like that image 
>>>>>>> process. but document images like books, here Tesseract is better than 
>>>>>>> EasyOCR.  Even I didn't use EasyOCR. you can try it.
>>>>>>>
>>>>>>> I have added words of dictionaries but the result is the same. 
>>>>>>>
>>>>>>> what kind of problem you have faced in fine-tuning in few new 
>>>>>>> characters as you said (*but, I failed in every possible way to 
>>>>>>> introduce a few new characters into the database.)*
>>>>>>> On Wednesday, 13 September, 2023 at 3:33:48 pm UTC+6 
>>>>>>> desal...@gmail.com wrote:
>>>>>>>
>>>>>>>> Yes, we are new to this. I find the instructions (the manual) very 
>>>>>>>> hard to follow. The video you linked above was really helpful  to get 
>>>>>>>> started. My plan at the beginning was to fine tune the existing 
>>>>>>>> .traineddata. But, I failed in every possible way to introduce a few 
>>>>>>>> new 
>>>>>>>> characters into the database. That is why I started from scratch. 
>>>>>>>>
>>>>>>>> Sure, I will follow Lorenzo's suggestion: will run more the 
>>>>>>>> iterations, and see if I can improve. 
>>>>>>>>
>>>>>>>> Another areas we need to explore is usage of dictionaries actually. 
>>>>>>>> May be adding millions of words into the dictionary could help 
>>>>>>>> Tesseract. I 
>>>>>>>> don't have millions of words; but I am looking into some corpus to get 
>>>>>>>> more 
>>>>>>>> words into the dictionary. 
>>>>>>>>
>>>>>>>> If this all fails, EasyOCR (and probably other similar open-source 
>>>>>>>> packages)  is probably our next option to try on. Sure, sharing 
>>>>>>>> our experiences will be helpful. I will let you know if I made good 
>>>>>>>> progresses in any of these options. 
>>>>>>>> On Wednesday, September 13, 2023 at 12:19:48 PM UTC+3 
>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>
>>>>>>>>> How is your training going for Bengali?  It was nearly good but I 
>>>>>>>>> faced space problems between two words, some words are spaces but 
>>>>>>>>> most of 
>>>>>>>>> them have no space. I think is problem is in the dataset but I use 
>>>>>>>>> the 
>>>>>>>>> default training dataset from Tesseract which is used in Ben That way 
>>>>>>>>> I am 
>>>>>>>>> confused so I have to explore more. by the way,  you can try as 
>>>>>>>>> Lorenzo 
>>>>>>>>> Blz said.  Actually training from scratch is harder than 
>>>>>>>>> fine-tuning. so you can use different datasets to explore. if you 
>>>>>>>>> succeed. 
>>>>>>>>> please let me know how you have done this whole process.  I'm also 
>>>>>>>>> new in 
>>>>>>>>> this field.
>>>>>>>>> On Wednesday, 13 September, 2023 at 1:13:43 pm UTC+6 
>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>
>>>>>>>>>> How is your training going for Bengali?
>>>>>>>>>> I have been trying to train from scratch. I made about 64,000 
>>>>>>>>>> lines of text (which produced about 255,000 files, in the end) and 
>>>>>>>>>> run the 
>>>>>>>>>> training for 150,000 iterations; getting 0.51 training error rate. I 
>>>>>>>>>> was 
>>>>>>>>>> hopping to get reasonable accuracy. Unfortunately, when I run the 
>>>>>>>>>> OCR 
>>>>>>>>>> using  .traineddata,  the accuracy is absolutely terrible. Do you 
>>>>>>>>>> think I 
>>>>>>>>>> made some mistakes, or that is an expected result?
>>>>>>>>>>
>>>>>>>>>> On Tuesday, September 12, 2023 at 11:15:25 PM UTC+3 
>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>
>>>>>>>>>>> Yes, he doesn't mention all fonts but only one font.  That way 
>>>>>>>>>>> he didn't use *MODEL_NAME in a separate **script **file script 
>>>>>>>>>>> I think.*
>>>>>>>>>>>
>>>>>>>>>>> Actually, here we teach all *tif, gt.txt, and .box files *which 
>>>>>>>>>>> are created by  *MODEL_NAME I mean **eng, ben, oro flag or 
>>>>>>>>>>> language code *because when we first create *tif, gt.txt, and 
>>>>>>>>>>> .box files, *every file starts by  *MODEL_NAME*. This  
>>>>>>>>>>> *MODEL_NAME*  we selected on the training script for looping 
>>>>>>>>>>> each tif, gt.txt, and .box files which are created by  
>>>>>>>>>>> *MODEL_NAME.*
>>>>>>>>>>>
>>>>>>>>>>> On Tuesday, 12 September, 2023 at 9:42:13 pm UTC+6 
>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Yes, I am familiar with the video and have set up the folder 
>>>>>>>>>>>> structure as you did. Indeed, I have tried a number of fine-tuning 
>>>>>>>>>>>> with a 
>>>>>>>>>>>> single font following Gracia's video. But, your script is much  
>>>>>>>>>>>> better 
>>>>>>>>>>>> because supports multiple fonts. The whole improvement you made is 
>>>>>>>>>>>>  
>>>>>>>>>>>> brilliant; and very useful. It is all working for me. 
>>>>>>>>>>>> The only part that I didn't understand is the trick you used in 
>>>>>>>>>>>> your tesseract_train.py script. You see, I have been doing exactly 
>>>>>>>>>>>> to you 
>>>>>>>>>>>> did except this script. 
>>>>>>>>>>>>
>>>>>>>>>>>> The scripts seems to have the trick of sending/teaching each of 
>>>>>>>>>>>> the fonts (iteratively) into the model. The script I have been 
>>>>>>>>>>>> using  
>>>>>>>>>>>> (which I get from Garcia) doesn't mention font at all. 
>>>>>>>>>>>>
>>>>>>>>>>>> *TESSDATA_PREFIX=../tesseract/tessdata make training 
>>>>>>>>>>>> MODEL_NAME=oro TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000*
>>>>>>>>>>>> Does it mean that my model does't train the fonts (even if the 
>>>>>>>>>>>> fonts have been included in the splitting process, in the other 
>>>>>>>>>>>> script)?
>>>>>>>>>>>> On Monday, September 11, 2023 at 10:54:08 AM UTC+3 
>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> *import subprocess# List of font namesfont_names = ['ben']for 
>>>>>>>>>>>>> font in font_names:    command = 
>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata 
>>>>>>>>>>>>> make training MODEL_NAME={font} START_MODEL=ben 
>>>>>>>>>>>>> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000"*
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> *    subprocess.run(command, shell=True) 1 . This command is 
>>>>>>>>>>>>> for training data that I have named '*tesseract_training*.py' 
>>>>>>>>>>>>> inside tesstrain folder.*
>>>>>>>>>>>>> *2. root directory means your main training folder and inside 
>>>>>>>>>>>>> it as like langdata, tessearact,  tesstrain folders. if you see 
>>>>>>>>>>>>> this 
>>>>>>>>>>>>> tutorial    *https://www.youtube.com/watch?v=KE4xEzFGSU8  
>>>>>>>>>>>>>  you will understand better the folder structure. only I 
>>>>>>>>>>>>> created tesseract_training.py in tesstrain folder for training 
>>>>>>>>>>>>> and  
>>>>>>>>>>>>> FontList.py file is the main path as *like langdata, 
>>>>>>>>>>>>> tessearact,  tesstrain, and *split_training_text.py.
>>>>>>>>>>>>> 3. first of all you have to put all fonts in your Linux fonts 
>>>>>>>>>>>>> folder.   /usr/share/fonts/  then run:  sudo apt update  then 
>>>>>>>>>>>>> sudo 
>>>>>>>>>>>>> fc-cache -fv
>>>>>>>>>>>>>
>>>>>>>>>>>>> after that, you have to add the exact font's name in 
>>>>>>>>>>>>> FontList.py file like me.
>>>>>>>>>>>>> I  have added two pic my folder structure. first is main 
>>>>>>>>>>>>> structure pic and the second is the Colopse tesstrain folder.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I[image: Screenshot 2023-09-11 134947.png][image: Screenshot 
>>>>>>>>>>>>> 2023-09-11 135014.png] 
>>>>>>>>>>>>> On Monday, 11 September, 2023 at 12:50:03 pm UTC+6 
>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thank you so much for putting out these brilliant scripts. 
>>>>>>>>>>>>>> They make the process  much more efficient.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I have one more question on the other script that you use to 
>>>>>>>>>>>>>> train. 
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *import subprocess# List of font namesfont_names = ['ben']for 
>>>>>>>>>>>>>> font in font_names:    command = 
>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata 
>>>>>>>>>>>>>> make training MODEL_NAME={font} START_MODEL=ben 
>>>>>>>>>>>>>> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000"*
>>>>>>>>>>>>>> *    subprocess.run(command, shell=True) *
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Do you have the name of fonts listed in file in the same/root 
>>>>>>>>>>>>>> directory?
>>>>>>>>>>>>>> How do you setup the names of the fonts in the file, if you 
>>>>>>>>>>>>>> don't mind sharing it?
>>>>>>>>>>>>>> On Monday, September 11, 2023 at 4:27:27 AM UTC+3 
>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> You can use the new script below. it's better than the 
>>>>>>>>>>>>>>> previous two scripts.  You can create *tif, gt.txt, and 
>>>>>>>>>>>>>>> .box files *by multiple fonts and also use breakpoint if vs 
>>>>>>>>>>>>>>> code close or anything during creating *tif, gt.txt, and 
>>>>>>>>>>>>>>> .box files *then you can checkpoint to navigate where you 
>>>>>>>>>>>>>>> close vs code.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> command for *tif, gt.txt, and .box files *
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> import os
>>>>>>>>>>>>>>> import random
>>>>>>>>>>>>>>> import pathlib
>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>> import argparse
>>>>>>>>>>>>>>> from FontList import FontList
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> def create_training_data(training_text_file, font_list, 
>>>>>>>>>>>>>>> output_directory, start_line=None, end_line=None):
>>>>>>>>>>>>>>>     lines = []
>>>>>>>>>>>>>>>     with open(training_text_file, 'r') as input_file:
>>>>>>>>>>>>>>>         lines = input_file.readlines()
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>     if not os.path.exists(output_directory):
>>>>>>>>>>>>>>>         os.mkdir(output_directory)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>     if start_line is None:
>>>>>>>>>>>>>>>         start_line = 0
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>     if end_line is None:
>>>>>>>>>>>>>>>         end_line = len(lines) - 1
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>     for font_name in font_list.fonts:
>>>>>>>>>>>>>>>         for line_index in range(start_line, end_line + 1):
>>>>>>>>>>>>>>>             line = lines[line_index].strip()
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>             training_text_file_name = pathlib.Path(
>>>>>>>>>>>>>>> training_text_file).stem
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>             line_serial = f"{line_index:d}"
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>             line_gt_text = os.path.join(output_directory, f'
>>>>>>>>>>>>>>> {training_text_file_name}_{line_serial}_{font_name.replace(" 
>>>>>>>>>>>>>>> ", "_")}.gt.txt')
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>             with open(line_gt_text, 'w') as output_file:
>>>>>>>>>>>>>>>                 output_file.writelines([line])
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>             file_base_name = f'{training_text_file_name}_{
>>>>>>>>>>>>>>> line_serial}_{font_name.replace(" ", "_")}'
>>>>>>>>>>>>>>>             subprocess.run([
>>>>>>>>>>>>>>>                 'text2image',
>>>>>>>>>>>>>>>                 f'--font={font_name}',
>>>>>>>>>>>>>>>                 f'--text={line_gt_text}',
>>>>>>>>>>>>>>>                 f'--outputbase={output_directory}/{
>>>>>>>>>>>>>>> file_base_name}',
>>>>>>>>>>>>>>>                 '--max_pages=1',
>>>>>>>>>>>>>>>                 '--strip_unrenderable_words',
>>>>>>>>>>>>>>>                 '--leading=36',
>>>>>>>>>>>>>>>                 '--xsize=3600',
>>>>>>>>>>>>>>>                 '--ysize=330',
>>>>>>>>>>>>>>>                 '--char_spacing=1.0',
>>>>>>>>>>>>>>>                 '--exposure=0',
>>>>>>>>>>>>>>>                 '--unicharset_file=langdata/eng.unicharset',
>>>>>>>>>>>>>>>             ])
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> if __name__ == "__main__":
>>>>>>>>>>>>>>>     parser = argparse.ArgumentParser()
>>>>>>>>>>>>>>>     parser.add_argument('--start', type=int, help='Starting 
>>>>>>>>>>>>>>> line count (inclusive)')
>>>>>>>>>>>>>>>     parser.add_argument('--end', type=int, help='Ending 
>>>>>>>>>>>>>>> line count (inclusive)')
>>>>>>>>>>>>>>>     args = parser.parse_args()
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>     training_text_file = 'langdata/eng.training_text'
>>>>>>>>>>>>>>>     output_directory = 'tesstrain/data/eng-ground-truth'
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>     font_list = FontList()
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>     create_training_data(training_text_file, font_list, 
>>>>>>>>>>>>>>> output_directory, args.start, args.end)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Then create a file called "FontList" in the root directory 
>>>>>>>>>>>>>>> and paste it.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> class FontList:
>>>>>>>>>>>>>>>     def __init__(self):
>>>>>>>>>>>>>>>         self.fonts = [
>>>>>>>>>>>>>>>         "Gerlick"
>>>>>>>>>>>>>>>             "Sagar Medium",
>>>>>>>>>>>>>>>             "Ekushey Lohit Normal",  
>>>>>>>>>>>>>>>            "Charukola Round Head Regular, weight=433",
>>>>>>>>>>>>>>>             "Charukola Round Head Bold, weight=443",
>>>>>>>>>>>>>>>             "Ador Orjoma Unicode",
>>>>>>>>>>>>>>>       
>>>>>>>>>>>>>>>           
>>>>>>>>>>>>>>>                        
>>>>>>>>>>>>>>> ]                         
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> then import in the above code,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> *for breakpoint command:*
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> sudo python3 split_training_text.py --start 0  --end 11
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> change checkpoint according to you  --start 0 --end 11.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> *and training checkpoint as you know already.*
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 1:22:34 am UTC+6 
>>>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi mhalidu, 
>>>>>>>>>>>>>>>> the script you posted here seems much more extensive than 
>>>>>>>>>>>>>>>> you posted before: 
>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/0e2880d9-64c0-4659-b497-902a5747caf4n%40googlegroups.com
>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I have been using your earlier script. It is magical. How 
>>>>>>>>>>>>>>>> is this one different from the earlier one?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thank you for posting these scripts, by the way. It has 
>>>>>>>>>>>>>>>> saved my countless hours; by running multiple fonts in one 
>>>>>>>>>>>>>>>> sweep. I was not 
>>>>>>>>>>>>>>>> able to find any instruction on how to train for  multiple 
>>>>>>>>>>>>>>>> fonts. The 
>>>>>>>>>>>>>>>> official manual is also unclear. YOUr script helped me to get 
>>>>>>>>>>>>>>>> started. 
>>>>>>>>>>>>>>>> On Wednesday, August 9, 2023 at 11:00:49 PM UTC+3 
>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ok, I will try as you said.
>>>>>>>>>>>>>>>>> one more thing, what's the role of the trained_text lines 
>>>>>>>>>>>>>>>>> will be? I have seen Bengali text are long words of lines. so 
>>>>>>>>>>>>>>>>> I wanna know 
>>>>>>>>>>>>>>>>> how many words or characters will be the better choice for 
>>>>>>>>>>>>>>>>> the train? 
>>>>>>>>>>>>>>>>> and '--xsize=3600','--ysize=350',  will be according to words 
>>>>>>>>>>>>>>>>> of lines?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Thursday, 10 August, 2023 at 1:10:14 am UTC+6 shree 
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Include the default fonts also in your fine-tuning list 
>>>>>>>>>>>>>>>>>> of fonts and see if that helps.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Wed, Aug 9, 2023, 2:27 PM Ali hussain <
>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I have trained some new fonts by fine-tune methods for 
>>>>>>>>>>>>>>>>>>> the Bengali language in Tesseract 5 and I have used all 
>>>>>>>>>>>>>>>>>>> official 
>>>>>>>>>>>>>>>>>>> trained_text and tessdata_best and other things also.  
>>>>>>>>>>>>>>>>>>> everything is good 
>>>>>>>>>>>>>>>>>>> but the problem is the default font which was trained 
>>>>>>>>>>>>>>>>>>> before that does not 
>>>>>>>>>>>>>>>>>>> convert text like prev but my new fonts work well. I don't 
>>>>>>>>>>>>>>>>>>> understand why 
>>>>>>>>>>>>>>>>>>> it's happening. I share code based to understand what going 
>>>>>>>>>>>>>>>>>>> on.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> *codes  for creating tif, gt.txt, .box files:*
>>>>>>>>>>>>>>>>>>> import os
>>>>>>>>>>>>>>>>>>> import random
>>>>>>>>>>>>>>>>>>> import pathlib
>>>>>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>>>>>> import argparse
>>>>>>>>>>>>>>>>>>> from FontList import FontList
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> def read_line_count():
>>>>>>>>>>>>>>>>>>>     if os.path.exists('line_count.txt'):
>>>>>>>>>>>>>>>>>>>         with open('line_count.txt', 'r') as file:
>>>>>>>>>>>>>>>>>>>             return int(file.read())
>>>>>>>>>>>>>>>>>>>     return 0
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> def write_line_count(line_count):
>>>>>>>>>>>>>>>>>>>     with open('line_count.txt', 'w') as file:
>>>>>>>>>>>>>>>>>>>         file.write(str(line_count))
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> def create_training_data(training_text_file, font_list, 
>>>>>>>>>>>>>>>>>>> output_directory, start_line=None, end_line=None):
>>>>>>>>>>>>>>>>>>>     lines = []
>>>>>>>>>>>>>>>>>>>     with open(training_text_file, 'r') as input_file:
>>>>>>>>>>>>>>>>>>>         for line in input_file.readlines():
>>>>>>>>>>>>>>>>>>>             lines.append(line.strip())
>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>     if not os.path.exists(output_directory):
>>>>>>>>>>>>>>>>>>>         os.mkdir(output_directory)
>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>     random.shuffle(lines)
>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>     if start_line is None:
>>>>>>>>>>>>>>>>>>>         line_count = read_line_count()  # Set the 
>>>>>>>>>>>>>>>>>>> starting line_count from the file
>>>>>>>>>>>>>>>>>>>     else:
>>>>>>>>>>>>>>>>>>>         line_count = start_line
>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>     if end_line is None:
>>>>>>>>>>>>>>>>>>>         end_line_count = len(lines) - 1  # Set the 
>>>>>>>>>>>>>>>>>>> ending line_count
>>>>>>>>>>>>>>>>>>>     else:
>>>>>>>>>>>>>>>>>>>         end_line_count = min(end_line, len(lines) - 1)
>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>     for font in font_list.fonts:  # Iterate through all 
>>>>>>>>>>>>>>>>>>> the fonts in the font_list
>>>>>>>>>>>>>>>>>>>         font_serial = 1
>>>>>>>>>>>>>>>>>>>         for line in lines:
>>>>>>>>>>>>>>>>>>>             training_text_file_name = pathlib.Path(
>>>>>>>>>>>>>>>>>>> training_text_file).stem
>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>             # Generate a unique serial number for each 
>>>>>>>>>>>>>>>>>>> line
>>>>>>>>>>>>>>>>>>>             line_serial = f"{line_count:d}"
>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>             # GT (Ground Truth) text filename
>>>>>>>>>>>>>>>>>>>             line_gt_text = os.path.join(output_directory, 
>>>>>>>>>>>>>>>>>>> f'{training_text_file_name}_{line_serial}.gt.txt')
>>>>>>>>>>>>>>>>>>>             with open(line_gt_text, 'w') as output_file:
>>>>>>>>>>>>>>>>>>>                 output_file.writelines([line])
>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>             # Image filename
>>>>>>>>>>>>>>>>>>>             file_base_name = f'ben_{line_serial}'  # 
>>>>>>>>>>>>>>>>>>> Unique filename for each font
>>>>>>>>>>>>>>>>>>>             subprocess.run([
>>>>>>>>>>>>>>>>>>>                 'text2image',
>>>>>>>>>>>>>>>>>>>                 f'--font={font}',
>>>>>>>>>>>>>>>>>>>                 f'--text={line_gt_text}',
>>>>>>>>>>>>>>>>>>>                 f'--outputbase={output_directory}/{
>>>>>>>>>>>>>>>>>>> file_base_name}',
>>>>>>>>>>>>>>>>>>>                 '--max_pages=1',
>>>>>>>>>>>>>>>>>>>                 '--strip_unrenderable_words',
>>>>>>>>>>>>>>>>>>>                 '--leading=36',
>>>>>>>>>>>>>>>>>>>                 '--xsize=3600',
>>>>>>>>>>>>>>>>>>>                 '--ysize=350',
>>>>>>>>>>>>>>>>>>>                 '--char_spacing=1.0',
>>>>>>>>>>>>>>>>>>>                 '--exposure=0',
>>>>>>>>>>>>>>>>>>>                 '
>>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/ben.unicharset',
>>>>>>>>>>>>>>>>>>>             ])
>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>             line_count += 1
>>>>>>>>>>>>>>>>>>>             font_serial += 1
>>>>>>>>>>>>>>>>>>>         
>>>>>>>>>>>>>>>>>>>         # Reset font_serial for the next font iteration
>>>>>>>>>>>>>>>>>>>         font_serial = 1
>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>     write_line_count(line_count)  # Update the 
>>>>>>>>>>>>>>>>>>> line_count in the file
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> if __name__ == "__main__":
>>>>>>>>>>>>>>>>>>>     parser = argparse.ArgumentParser()
>>>>>>>>>>>>>>>>>>>     parser.add_argument('--start', type=int, help='Starting 
>>>>>>>>>>>>>>>>>>> line count (inclusive)')
>>>>>>>>>>>>>>>>>>>     parser.add_argument('--end', type=int, help='Ending 
>>>>>>>>>>>>>>>>>>> line count (inclusive)')
>>>>>>>>>>>>>>>>>>>     args = parser.parse_args()
>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>     training_text_file = 'langdata/ben.training_text'
>>>>>>>>>>>>>>>>>>>     output_directory = 'tesstrain/data/ben-ground-truth'
>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>     # Create an instance of the FontList class
>>>>>>>>>>>>>>>>>>>     font_list = FontList()
>>>>>>>>>>>>>>>>>>>      
>>>>>>>>>>>>>>>>>>>     create_training_data(training_text_file, font_list, 
>>>>>>>>>>>>>>>>>>> output_directory, args.start, args.end)
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> *and for training code:*
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> # List of font names
>>>>>>>>>>>>>>>>>>> font_names = ['ben']
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> for font in font_names:
>>>>>>>>>>>>>>>>>>>     command = f"TESSDATA_PREFIX=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>>> make training MODEL_NAME={font} START_MODEL=ben 
>>>>>>>>>>>>>>>>>>> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000 
>>>>>>>>>>>>>>>>>>> LANG_TYPE=Indic"
>>>>>>>>>>>>>>>>>>>     subprocess.run(command, shell=True)
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> any suggestion to identify to extract the problem.
>>>>>>>>>>>>>>>>>>> thanks, everyone
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>>>> You received this message because you are subscribed to 
>>>>>>>>>>>>>>>>>>> the Google Groups "tesseract-ocr" group.
>>>>>>>>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails 
>>>>>>>>>>>>>>>>>>> from it, send an email to 
>>>>>>>>>>>>>>>>>>> tesseract-oc...@googlegroups.com.
>>>>>>>>>>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com
>>>>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/2a9184e1-8816-4ecc-93e7-60df799df300n%40googlegroups.com.

Re: [tesseract-ocr] accuracy problem after trained in fine-tune

Reply via email to