Re: [tesseract-ocr] accuracy problem after trained in fine-tune

Dellu Bw Wed, 13 Sep 2023 21:51:49 -0700

I was using my own text

On Thu, Sep 14, 2023, 6:58 AM Ali hussain <mdalihussain...@gmail.com> wrote:


> you are training from Tessearact default text data or your own collected
> text data?
> On Thursday, 14 September, 2023 at 12:19:53 am UTC+6 desal...@gmail.com
> wrote:
>
>> I now get to 200000 iterations; and the error rate is stuck at 0.46. The
>> result is absolutely trash: nowhere close to the default/Ray's training.
>>
>> On Wednesday, September 13, 2023 at 2:47:05 PM UTC+3 mdalihu...@gmail.com
>> wrote:
>>
>>>
>>> after Tesseact recognizes text from images. then you can apply regex to
>>> replace the wrong word with to correct word.
>>> I'm not familiar with paddleOcr and scanTailor also.
>>>
>>> On Wednesday, 13 September, 2023 at 5:06:12 pm UTC+6 desal...@gmail.com
>>> wrote:
>>>
>>>> At what stage are you doing the regex replacement?
>>>> My process has been: Scan (tif)--> ScanTailor --> Tesseract --> pdf
>>>>
>>>> >EasyOCR I think is best for ID cards or something like that image
>>>> process. but document images like books, here Tesseract is better than
>>>> EasyOCR.
>>>>
>>>> How about paddleOcr?, are you familiar with it?
>>>>
>>>> On Wednesday, September 13, 2023 at 1:45:54 PM UTC+3
>>>> mdalihu...@gmail.com wrote:
>>>>
>>>>> I know what you mean. but in some cases, it helps me.  I have faced
>>>>> specific characters and words are always not recognized by Tesseract. That
>>>>> way I use these regex to replace those characters   and words if  those
>>>>> characters are incorrect.
>>>>>
>>>>> see what I have done:
>>>>>
>>>>>    " ী": "ী",
>>>>>     " ্": " ",
>>>>>     " ে": " ",
>>>>>     জ্া: "জা",
>>>>>     "  ": " ",
>>>>>     "   ": " ",
>>>>>     "    ": " ",
>>>>>     "্প": " ",
>>>>>     " য": "র্য",
>>>>>     য: "য",
>>>>>     " া": "া",
>>>>>     আা: "আ",
>>>>>     ম্ি: "মি",
>>>>>     স্ু: "সু",
>>>>>     "হূ ": "হূ",
>>>>>     " ণ": "ণ",
>>>>>     র্্: "র",
>>>>>     "চিন্ত ": "চিন্তা ",
>>>>>     ন্া: "না",
>>>>>     "সম ূর্ন": "সম্পূর্ণ",
>>>>> On Wednesday, 13 September, 2023 at 4:18:22 pm UTC+6
>>>>> desal...@gmail.com wrote:
>>>>>
>>>>>> The problem for regex is that Tesseract is not consistent in its
>>>>>> replacement.
>>>>>> Think of the original training of English data doesn't contain the
>>>>>> letter /u/. What does Tesseract do when it faces /u/ in actual 
>>>>>> processing??
>>>>>> In some cases, it replaces it with closely similar letters such as
>>>>>> /v/ and /w/. In other cases, it completely removes it. That is what is
>>>>>> happening with my case. Those characters re sometimes completely removed;
>>>>>> other times, they are replaced by closely resembling characters. Because 
>>>>>> of
>>>>>> this inconsistency, applying regex is very difficult.
>>>>>>
>>>>>> On Wednesday, September 13, 2023 at 1:02:01 PM UTC+3
>>>>>> mdalihu...@gmail.com wrote:
>>>>>>
>>>>>>> if Some specific characters or words are always missing from the OCR
>>>>>>> result.  then you can apply logic with the Regular expressions method on
>>>>>>> your applications. After OCR, these specific characters or words will be
>>>>>>> replaced by current characters or words that you defined in your
>>>>>>> applications by  Regular expressions. it can be done in some major 
>>>>>>> problems.
>>>>>>>
>>>>>>> On Wednesday, 13 September, 2023 at 3:51:29 pm UTC+6
>>>>>>> desal...@gmail.com wrote:
>>>>>>>
>>>>>>>> The characters are getting missed, even after fine-tuning.
>>>>>>>> I never made any progress. I tried many different ways. Some
>>>>>>>> specific characters are always missing from the OCR result.
>>>>>>>>
>>>>>>>> On Wednesday, September 13, 2023 at 12:49:20 PM UTC+3
>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>
>>>>>>>>> EasyOCR I think is best for ID cards or something like that image
>>>>>>>>> process. but document images like books, here Tesseract is better than
>>>>>>>>> EasyOCR.  Even I didn't use EasyOCR. you can try it.
>>>>>>>>>
>>>>>>>>> I have added words of dictionaries but the result is the same.
>>>>>>>>>
>>>>>>>>> what kind of problem you have faced in fine-tuning in few new
>>>>>>>>> characters as you said (*but, I failed in every possible way to
>>>>>>>>> introduce a few new characters into the database.)*
>>>>>>>>> On Wednesday, 13 September, 2023 at 3:33:48 pm UTC+6
>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>
>>>>>>>>>> Yes, we are new to this. I find the instructions (the manual)
>>>>>>>>>> very hard to follow. The video you linked above was really helpful  
>>>>>>>>>> to get
>>>>>>>>>> started. My plan at the beginning was to fine tune the existing
>>>>>>>>>> .traineddata. But, I failed in every possible way to introduce a few 
>>>>>>>>>> new
>>>>>>>>>> characters into the database. That is why I started from scratch.
>>>>>>>>>>
>>>>>>>>>> Sure, I will follow Lorenzo's suggestion: will run more the
>>>>>>>>>> iterations, and see if I can improve.
>>>>>>>>>>
>>>>>>>>>> Another areas we need to explore is usage of dictionaries
>>>>>>>>>> actually. May be adding millions of words into the dictionary could 
>>>>>>>>>> help
>>>>>>>>>> Tesseract. I don't have millions of words; but I am looking into some
>>>>>>>>>> corpus to get more words into the dictionary.
>>>>>>>>>>
>>>>>>>>>> If this all fails, EasyOCR (and probably other similar
>>>>>>>>>> open-source packages)  is probably our next option to try on. Sure, 
>>>>>>>>>> sharing
>>>>>>>>>> our experiences will be helpful. I will let you know if I made good
>>>>>>>>>> progresses in any of these options.
>>>>>>>>>> On Wednesday, September 13, 2023 at 12:19:48 PM UTC+3
>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>
>>>>>>>>>>> How is your training going for Bengali?  It was nearly good but
>>>>>>>>>>> I faced space problems between two words, some words are spaces but 
>>>>>>>>>>> most of
>>>>>>>>>>> them have no space. I think is problem is in the dataset but I use 
>>>>>>>>>>> the
>>>>>>>>>>> default training dataset from Tesseract which is used in Ben That 
>>>>>>>>>>> way I am
>>>>>>>>>>> confused so I have to explore more. by the way,  you can try as 
>>>>>>>>>>> Lorenzo
>>>>>>>>>>> Blz said.  Actually training from scratch is harder than
>>>>>>>>>>> fine-tuning. so you can use different datasets to explore. if you 
>>>>>>>>>>> succeed.
>>>>>>>>>>> please let me know how you have done this whole process.  I'm also 
>>>>>>>>>>> new in
>>>>>>>>>>> this field.
>>>>>>>>>>> On Wednesday, 13 September, 2023 at 1:13:43 pm UTC+6
>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>
>>>>>>>>>>>> How is your training going for Bengali?
>>>>>>>>>>>> I have been trying to train from scratch. I made about 64,000
>>>>>>>>>>>> lines of text (which produced about 255,000 files, in the end) and 
>>>>>>>>>>>> run the
>>>>>>>>>>>> training for 150,000 iterations; getting 0.51 training error rate. 
>>>>>>>>>>>> I was
>>>>>>>>>>>> hopping to get reasonable accuracy. Unfortunately, when I run the 
>>>>>>>>>>>> OCR
>>>>>>>>>>>> using  .traineddata,  the accuracy is absolutely terrible. Do you 
>>>>>>>>>>>> think I
>>>>>>>>>>>> made some mistakes, or that is an expected result?
>>>>>>>>>>>>
>>>>>>>>>>>> On Tuesday, September 12, 2023 at 11:15:25 PM UTC+3
>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Yes, he doesn't mention all fonts but only one font.  That way
>>>>>>>>>>>>> he didn't use *MODEL_NAME in a separate **script **file
>>>>>>>>>>>>> script I think.*
>>>>>>>>>>>>>
>>>>>>>>>>>>> Actually, here we teach all *tif, gt.txt, and .box files *which
>>>>>>>>>>>>> are created by  *MODEL_NAME I mean **eng, ben, oro flag or
>>>>>>>>>>>>> language code *because when we first create *tif, gt.txt, and
>>>>>>>>>>>>> .box files, *every file starts by  *MODEL_NAME*. This
>>>>>>>>>>>>> *MODEL_NAME*  we selected on the training script for looping
>>>>>>>>>>>>> each tif, gt.txt, and .box files which are created by
>>>>>>>>>>>>> *MODEL_NAME.*
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tuesday, 12 September, 2023 at 9:42:13 pm UTC+6
>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yes, I am familiar with the video and have set up the folder
>>>>>>>>>>>>>> structure as you did. Indeed, I have tried a number of 
>>>>>>>>>>>>>> fine-tuning with a
>>>>>>>>>>>>>> single font following Gracia's video. But, your script is much  
>>>>>>>>>>>>>> better
>>>>>>>>>>>>>> because supports multiple fonts. The whole improvement you made 
>>>>>>>>>>>>>> is
>>>>>>>>>>>>>> brilliant; and very useful. It is all working for me.
>>>>>>>>>>>>>> The only part that I didn't understand is the trick you used
>>>>>>>>>>>>>> in your tesseract_train.py script. You see, I have been doing 
>>>>>>>>>>>>>> exactly to
>>>>>>>>>>>>>> you did except this script.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The scripts seems to have the trick of sending/teaching each
>>>>>>>>>>>>>> of the fonts (iteratively) into the model. The script I have 
>>>>>>>>>>>>>> been using
>>>>>>>>>>>>>> (which I get from Garcia) doesn't mention font at all.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *TESSDATA_PREFIX=../tesseract/tessdata make training
>>>>>>>>>>>>>> MODEL_NAME=oro TESSDATA=../tesseract/tessdata 
>>>>>>>>>>>>>> MAX_ITERATIONS=10000*
>>>>>>>>>>>>>> Does it mean that my model does't train the fonts (even
>>>>>>>>>>>>>> if the fonts have been included in the splitting process, in the 
>>>>>>>>>>>>>> other
>>>>>>>>>>>>>> script)?
>>>>>>>>>>>>>> On Monday, September 11, 2023 at 10:54:08 AM UTC+3
>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> *import subprocess# List of font namesfont_names =
>>>>>>>>>>>>>>> ['ben']for font in font_names:    command =
>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make training 
>>>>>>>>>>>>>>> MODEL_NAME={font}
>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata 
>>>>>>>>>>>>>>> MAX_ITERATIONS=10000"*
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> *    subprocess.run(command, shell=True) 1 . This command is
>>>>>>>>>>>>>>> for training data that I have named '*tesseract_training*.py'
>>>>>>>>>>>>>>> inside tesstrain folder.*
>>>>>>>>>>>>>>> *2. root directory means your main training folder and
>>>>>>>>>>>>>>> inside it as like langdata, tessearact,  tesstrain folders. if 
>>>>>>>>>>>>>>> you see this
>>>>>>>>>>>>>>> tutorial    *https://www.youtube.com/watch?v=KE4xEzFGSU8
>>>>>>>>>>>>>>>  you will understand better the folder structure. only I
>>>>>>>>>>>>>>> created tesseract_training.py in tesstrain folder for training 
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>> FontList.py file is the main path as *like langdata,
>>>>>>>>>>>>>>> tessearact,  tesstrain, and *split_training_text.py.
>>>>>>>>>>>>>>> 3. first of all you have to put all fonts in your Linux
>>>>>>>>>>>>>>> fonts folder.   /usr/share/fonts/  then run:  sudo apt
>>>>>>>>>>>>>>> update  then sudo fc-cache -fv
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> after that, you have to add the exact font's name in
>>>>>>>>>>>>>>> FontList.py file like me.
>>>>>>>>>>>>>>> I  have added two pic my folder structure. first is main
>>>>>>>>>>>>>>> structure pic and the second is the Colopse tesstrain folder.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I[image: Screenshot 2023-09-11 134947.png][image:
>>>>>>>>>>>>>>> Screenshot 2023-09-11 135014.png]
>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 12:50:03 pm UTC+6
>>>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thank you so much for putting out these brilliant scripts.
>>>>>>>>>>>>>>>> They make the process  much more efficient.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I have one more question on the other script that you use
>>>>>>>>>>>>>>>> to train.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> *import subprocess# List of font namesfont_names =
>>>>>>>>>>>>>>>> ['ben']for font in font_names:    command =
>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make training 
>>>>>>>>>>>>>>>> MODEL_NAME={font}
>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata 
>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000"*
>>>>>>>>>>>>>>>> *    subprocess.run(command, shell=True) *
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Do you have the name of fonts listed in file in the
>>>>>>>>>>>>>>>> same/root directory?
>>>>>>>>>>>>>>>> How do you setup the names of the fonts in the file, if you
>>>>>>>>>>>>>>>> don't mind sharing it?
>>>>>>>>>>>>>>>> On Monday, September 11, 2023 at 4:27:27 AM UTC+3
>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> You can use the new script below. it's better than the
>>>>>>>>>>>>>>>>> previous two scripts.  You can create *tif, gt.txt, and
>>>>>>>>>>>>>>>>> .box files *by multiple fonts and also use breakpoint if
>>>>>>>>>>>>>>>>> vs code close or anything during creating *tif, gt.txt,
>>>>>>>>>>>>>>>>> and .box files *then you can checkpoint to navigate where
>>>>>>>>>>>>>>>>> you close vs code.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> command for *tif, gt.txt, and .box files *
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> import os
>>>>>>>>>>>>>>>>> import random
>>>>>>>>>>>>>>>>> import pathlib
>>>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>>>> import argparse
>>>>>>>>>>>>>>>>> from FontList import FontList
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> def create_training_data(training_text_file, font_list,
>>>>>>>>>>>>>>>>> output_directory, start_line=None, end_line=None):
>>>>>>>>>>>>>>>>>     lines = []
>>>>>>>>>>>>>>>>>     with open(training_text_file, 'r') as input_file:
>>>>>>>>>>>>>>>>>         lines = input_file.readlines()
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>     if not os.path.exists(output_directory):
>>>>>>>>>>>>>>>>>         os.mkdir(output_directory)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>     if start_line is None:
>>>>>>>>>>>>>>>>>         start_line = 0
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>     if end_line is None:
>>>>>>>>>>>>>>>>>         end_line = len(lines) - 1
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>     for font_name in font_list.fonts:
>>>>>>>>>>>>>>>>>         for line_index in range(start_line, end_line + 1):
>>>>>>>>>>>>>>>>>             line = lines[line_index].strip()
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>             training_text_file_name = pathlib.Path(
>>>>>>>>>>>>>>>>> training_text_file).stem
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>             line_serial = f"{line_index:d}"
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>             line_gt_text = os.path.join(output_directory,
>>>>>>>>>>>>>>>>> f'{training_text_file_name}_{line_serial}_{
>>>>>>>>>>>>>>>>> font_name.replace(" ", "_")}.gt.txt')
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>             with open(line_gt_text, 'w') as output_file:
>>>>>>>>>>>>>>>>>                 output_file.writelines([line])
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>             file_base_name = f'{training_text_file_name}_{
>>>>>>>>>>>>>>>>> line_serial}_{font_name.replace(" ", "_")}'
>>>>>>>>>>>>>>>>>             subprocess.run([
>>>>>>>>>>>>>>>>>                 'text2image',
>>>>>>>>>>>>>>>>>                 f'--font={font_name}',
>>>>>>>>>>>>>>>>>                 f'--text={line_gt_text}',
>>>>>>>>>>>>>>>>>                 f'--outputbase={output_directory}/{
>>>>>>>>>>>>>>>>> file_base_name}',
>>>>>>>>>>>>>>>>>                 '--max_pages=1',
>>>>>>>>>>>>>>>>>                 '--strip_unrenderable_words',
>>>>>>>>>>>>>>>>>                 '--leading=36',
>>>>>>>>>>>>>>>>>                 '--xsize=3600',
>>>>>>>>>>>>>>>>>                 '--ysize=330',
>>>>>>>>>>>>>>>>>                 '--char_spacing=1.0',
>>>>>>>>>>>>>>>>>                 '--exposure=0',
>>>>>>>>>>>>>>>>>                 '--unicharset_file=langdata/eng.unicharset
>>>>>>>>>>>>>>>>> ',
>>>>>>>>>>>>>>>>>             ])
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> if __name__ == "__main__":
>>>>>>>>>>>>>>>>>     parser = argparse.ArgumentParser()
>>>>>>>>>>>>>>>>>     parser.add_argument('--start', type=int, help='Starting
>>>>>>>>>>>>>>>>> line count (inclusive)')
>>>>>>>>>>>>>>>>>     parser.add_argument('--end', type=int, help='Ending
>>>>>>>>>>>>>>>>> line count (inclusive)')
>>>>>>>>>>>>>>>>>     args = parser.parse_args()
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>     training_text_file = 'langdata/eng.training_text'
>>>>>>>>>>>>>>>>>     output_directory = 'tesstrain/data/eng-ground-truth'
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>     font_list = FontList()
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>     create_training_data(training_text_file, font_list,
>>>>>>>>>>>>>>>>> output_directory, args.start, args.end)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Then create a file called "FontList" in the root directory
>>>>>>>>>>>>>>>>> and paste it.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> class FontList:
>>>>>>>>>>>>>>>>>     def __init__(self):
>>>>>>>>>>>>>>>>>         self.fonts = [
>>>>>>>>>>>>>>>>>         "Gerlick"
>>>>>>>>>>>>>>>>>             "Sagar Medium",
>>>>>>>>>>>>>>>>>             "Ekushey Lohit Normal",
>>>>>>>>>>>>>>>>>            "Charukola Round Head Regular, weight=433",
>>>>>>>>>>>>>>>>>             "Charukola Round Head Bold, weight=443",
>>>>>>>>>>>>>>>>>             "Ador Orjoma Unicode",
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ]
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> then import in the above code,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> *for breakpoint command:*
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> sudo python3 split_training_text.py --start 0  --end 11
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> change checkpoint according to you  --start 0 --end 11.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> *and training checkpoint as you know already.*
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 1:22:34 am UTC+6
>>>>>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi mhalidu,
>>>>>>>>>>>>>>>>>> the script you posted here seems much more extensive than
>>>>>>>>>>>>>>>>>> you posted before:
>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/0e2880d9-64c0-4659-b497-902a5747caf4n%40googlegroups.com
>>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I have been using your earlier script. It is magical. How
>>>>>>>>>>>>>>>>>> is this one different from the earlier one?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thank you for posting these scripts, by the way. It has
>>>>>>>>>>>>>>>>>> saved my countless hours; by running multiple fonts in one 
>>>>>>>>>>>>>>>>>> sweep. I was not
>>>>>>>>>>>>>>>>>> able to find any instruction on how to train for  multiple 
>>>>>>>>>>>>>>>>>> fonts. The
>>>>>>>>>>>>>>>>>> official manual is also unclear. YOUr script helped me to 
>>>>>>>>>>>>>>>>>> get started.
>>>>>>>>>>>>>>>>>> On Wednesday, August 9, 2023 at 11:00:49 PM UTC+3
>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> ok, I will try as you said.
>>>>>>>>>>>>>>>>>>> one more thing, what's the role of the trained_text
>>>>>>>>>>>>>>>>>>> lines will be? I have seen Bengali text are long words of 
>>>>>>>>>>>>>>>>>>> lines. so I wanna
>>>>>>>>>>>>>>>>>>> know how many words or characters will be the better choice 
>>>>>>>>>>>>>>>>>>> for the train?
>>>>>>>>>>>>>>>>>>> and '--xsize=3600','--ysize=350',  will be according to 
>>>>>>>>>>>>>>>>>>> words of lines?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Thursday, 10 August, 2023 at 1:10:14 am UTC+6 shree
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Include the default fonts also in your fine-tuning list
>>>>>>>>>>>>>>>>>>>> of fonts and see if that helps.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Wed, Aug 9, 2023, 2:27 PM Ali hussain <
>>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I have trained some new fonts by fine-tune methods for
>>>>>>>>>>>>>>>>>>>>> the Bengali language in Tesseract 5 and I have used all 
>>>>>>>>>>>>>>>>>>>>> official
>>>>>>>>>>>>>>>>>>>>> trained_text and tessdata_best and other things also.  
>>>>>>>>>>>>>>>>>>>>> everything is good
>>>>>>>>>>>>>>>>>>>>> but the problem is the default font which was trained 
>>>>>>>>>>>>>>>>>>>>> before that does not
>>>>>>>>>>>>>>>>>>>>> convert text like prev but my new fonts work well. I 
>>>>>>>>>>>>>>>>>>>>> don't understand why
>>>>>>>>>>>>>>>>>>>>> it's happening. I share code based to understand what 
>>>>>>>>>>>>>>>>>>>>> going on.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> *codes  for creating tif, gt.txt, .box files:*
>>>>>>>>>>>>>>>>>>>>> import os
>>>>>>>>>>>>>>>>>>>>> import random
>>>>>>>>>>>>>>>>>>>>> import pathlib
>>>>>>>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>>>>>>>> import argparse
>>>>>>>>>>>>>>>>>>>>> from FontList import FontList
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> def read_line_count():
>>>>>>>>>>>>>>>>>>>>>     if os.path.exists('line_count.txt'):
>>>>>>>>>>>>>>>>>>>>>         with open('line_count.txt', 'r') as file:
>>>>>>>>>>>>>>>>>>>>>             return int(file.read())
>>>>>>>>>>>>>>>>>>>>>     return 0
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> def write_line_count(line_count):
>>>>>>>>>>>>>>>>>>>>>     with open('line_count.txt', 'w') as file:
>>>>>>>>>>>>>>>>>>>>>         file.write(str(line_count))
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> def create_training_data(training_text_file, font_list,
>>>>>>>>>>>>>>>>>>>>> output_directory, start_line=None, end_line=None):
>>>>>>>>>>>>>>>>>>>>>     lines = []
>>>>>>>>>>>>>>>>>>>>>     with open(training_text_file, 'r') as input_file:
>>>>>>>>>>>>>>>>>>>>>         for line in input_file.readlines():
>>>>>>>>>>>>>>>>>>>>>             lines.append(line.strip())
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>     if not os.path.exists(output_directory):
>>>>>>>>>>>>>>>>>>>>>         os.mkdir(output_directory)
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>     random.shuffle(lines)
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>     if start_line is None:
>>>>>>>>>>>>>>>>>>>>>         line_count = read_line_count()  # Set the
>>>>>>>>>>>>>>>>>>>>> starting line_count from the file
>>>>>>>>>>>>>>>>>>>>>     else:
>>>>>>>>>>>>>>>>>>>>>         line_count = start_line
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>     if end_line is None:
>>>>>>>>>>>>>>>>>>>>>         end_line_count = len(lines) - 1  # Set the
>>>>>>>>>>>>>>>>>>>>> ending line_count
>>>>>>>>>>>>>>>>>>>>>     else:
>>>>>>>>>>>>>>>>>>>>>         end_line_count = min(end_line, len(lines) - 1)
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>     for font in font_list.fonts:  # Iterate through
>>>>>>>>>>>>>>>>>>>>> all the fonts in the font_list
>>>>>>>>>>>>>>>>>>>>>         font_serial = 1
>>>>>>>>>>>>>>>>>>>>>         for line in lines:
>>>>>>>>>>>>>>>>>>>>>             training_text_file_name = pathlib.Path(
>>>>>>>>>>>>>>>>>>>>> training_text_file).stem
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>             # Generate a unique serial number for
>>>>>>>>>>>>>>>>>>>>> each line
>>>>>>>>>>>>>>>>>>>>>             line_serial = f"{line_count:d}"
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>             # GT (Ground Truth) text filename
>>>>>>>>>>>>>>>>>>>>>             line_gt_text = os.path.join(
>>>>>>>>>>>>>>>>>>>>> output_directory, f'{training_text_file_name}_{
>>>>>>>>>>>>>>>>>>>>> line_serial}.gt.txt')
>>>>>>>>>>>>>>>>>>>>>             with open(line_gt_text, 'w') as
>>>>>>>>>>>>>>>>>>>>> output_file:
>>>>>>>>>>>>>>>>>>>>>                 output_file.writelines([line])
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>             # Image filename
>>>>>>>>>>>>>>>>>>>>>             file_base_name = f'ben_{line_serial}'  #
>>>>>>>>>>>>>>>>>>>>> Unique filename for each font
>>>>>>>>>>>>>>>>>>>>>             subprocess.run([
>>>>>>>>>>>>>>>>>>>>>                 'text2image',
>>>>>>>>>>>>>>>>>>>>>                 f'--font={font}',
>>>>>>>>>>>>>>>>>>>>>                 f'--text={line_gt_text}',
>>>>>>>>>>>>>>>>>>>>>                 f'--outputbase={output_directory}/{
>>>>>>>>>>>>>>>>>>>>> file_base_name}',
>>>>>>>>>>>>>>>>>>>>>                 '--max_pages=1',
>>>>>>>>>>>>>>>>>>>>>                 '--strip_unrenderable_words',
>>>>>>>>>>>>>>>>>>>>>                 '--leading=36',
>>>>>>>>>>>>>>>>>>>>>                 '--xsize=3600',
>>>>>>>>>>>>>>>>>>>>>                 '--ysize=350',
>>>>>>>>>>>>>>>>>>>>>                 '--char_spacing=1.0',
>>>>>>>>>>>>>>>>>>>>>                 '--exposure=0',
>>>>>>>>>>>>>>>>>>>>>                 '
>>>>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/ben.unicharset',
>>>>>>>>>>>>>>>>>>>>>             ])
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>             line_count += 1
>>>>>>>>>>>>>>>>>>>>>             font_serial += 1
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>         # Reset font_serial for the next font
>>>>>>>>>>>>>>>>>>>>> iteration
>>>>>>>>>>>>>>>>>>>>>         font_serial = 1
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>     write_line_count(line_count)  # Update the
>>>>>>>>>>>>>>>>>>>>> line_count in the file
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> if __name__ == "__main__":
>>>>>>>>>>>>>>>>>>>>>     parser = argparse.ArgumentParser()
>>>>>>>>>>>>>>>>>>>>>     parser.add_argument('--start', type=int, 
>>>>>>>>>>>>>>>>>>>>> help='Starting
>>>>>>>>>>>>>>>>>>>>> line count (inclusive)')
>>>>>>>>>>>>>>>>>>>>>     parser.add_argument('--end', type=int, help='Ending
>>>>>>>>>>>>>>>>>>>>> line count (inclusive)')
>>>>>>>>>>>>>>>>>>>>>     args = parser.parse_args()
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>     training_text_file = 'langdata/ben.training_text'
>>>>>>>>>>>>>>>>>>>>>     output_directory = '
>>>>>>>>>>>>>>>>>>>>> tesstrain/data/ben-ground-truth'
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>     # Create an instance of the FontList class
>>>>>>>>>>>>>>>>>>>>>     font_list = FontList()
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>     create_training_data(training_text_file,
>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, args.start, args.end)
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> *and for training code:*
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> # List of font names
>>>>>>>>>>>>>>>>>>>>> font_names = ['ben']
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> for font in font_names:
>>>>>>>>>>>>>>>>>>>>>     command = f"TESSDATA_PREFIX=../tesseract/tessdata
>>>>>>>>>>>>>>>>>>>>> make training MODEL_NAME={font} START_MODEL=ben
>>>>>>>>>>>>>>>>>>>>> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000 
>>>>>>>>>>>>>>>>>>>>> LANG_TYPE=Indic"
>>>>>>>>>>>>>>>>>>>>>     subprocess.run(command, shell=True)
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> any suggestion to identify to extract the problem.
>>>>>>>>>>>>>>>>>>>>> thanks, everyone
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>> You received this message because you are subscribed
>>>>>>>>>>>>>>>>>>>>> to the Google Groups "tesseract-ocr" group.
>>>>>>>>>>>>>>>>>>>>> To unsubscribe from this group and stop receiving
>>>>>>>>>>>>>>>>>>>>> emails from it, send an email to
>>>>>>>>>>>>>>>>>>>>> tesseract-oc...@googlegroups.com.
>>>>>>>>>>>>>>>>>>>>> To view this discussion on the web visit
>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com
>>>>>>>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CA%2BLi4kBP9bTZnbjaY3H3ZkbT5Fu36nqZJKK209Zsku20sDH06g%40mail.gmail.com.

Re: [tesseract-ocr] accuracy problem after trained in fine-tune

Reply via email to