Re: [tesseract-ocr] accuracy problem after trained in fine-tune

Ali hussain Fri, 20 Oct 2023 17:22:49 -0700

not good result. that's way i stop to training now. default traineddata is 
overall good then scratch.
On Thursday, 19 October, 2023 at 11:32:08 pm UTC+6 desal...@gmail.com wrote:


> Hi Ali, 
> How is your training going?
> Do you get good results with the training-from-the-scratch?
>
> On Friday, September 15, 2023 at 6:42:26 PM UTC+3 tesseract-ocr wrote:
>
>> yes, two months ago when I started to learn OCR I saw that. it was very 
>> helpful at the beginning.
>> On Friday, 15 September, 2023 at 4:01:32 pm UTC+6 desal...@gmail.com 
>> wrote:
>>
>>> Just saw this paper: https://osf.io/b8h7q
>>>
>>> On Thursday, September 14, 2023 at 9:02:22 PM UTC+3 mdalihu...@gmail.com 
>>> wrote:
>>>
>>>> I will try some changes. thx
>>>>
>>>> On Thursday, 14 September, 2023 at 2:46:36 pm UTC+6 elvi...@gmail.com 
>>>> wrote:
>>>>
>>>>> I also faced that issue in the Windows. Apparently, the issue is 
>>>>> related with unicode. You can try your luck by changing  "r" to "utf8" in 
>>>>> the script.
>>>>> I end up installing Ubuntu because i was having too many errors in the 
>>>>> Windows.
>>>>>
>>>>> On Thu, Sep 14, 2023, 9:33 AM Ali hussain <mdalihu...@gmail.com> 
>>>>> wrote:
>>>>>
>>>>>> you faced this error,  Can't encode transcription? if you faced how 
>>>>>> you have solved this?
>>>>>>
>>>>>> On Thursday, 14 September, 2023 at 10:51:52 am UTC+6 
>>>>>> elvi...@gmail.com wrote:
>>>>>>
>>>>>>> I was using my own text
>>>>>>>
>>>>>>> On Thu, Sep 14, 2023, 6:58 AM Ali hussain <mdalihu...@gmail.com> 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> you are training from Tessearact default text data or your own 
>>>>>>>> collected text data?
>>>>>>>> On Thursday, 14 September, 2023 at 12:19:53 am UTC+6 
>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>
>>>>>>>>> I now get to 200000 iterations; and the error rate is stuck at 
>>>>>>>>> 0.46. The result is absolutely trash: nowhere close to the 
>>>>>>>>> default/Ray's 
>>>>>>>>> training. 
>>>>>>>>>
>>>>>>>>> On Wednesday, September 13, 2023 at 2:47:05 PM UTC+3 
>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> after Tesseact recognizes text from images. then you can apply 
>>>>>>>>>> regex to replace the wrong word with to correct word.
>>>>>>>>>> I'm not familiar with paddleOcr and scanTailor also.
>>>>>>>>>>
>>>>>>>>>> On Wednesday, 13 September, 2023 at 5:06:12 pm UTC+6 
>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>
>>>>>>>>>>> At what stage are you doing the regex replacement?
>>>>>>>>>>> My process has been: Scan (tif)--> ScanTailor --> Tesseract --> 
>>>>>>>>>>> pdf
>>>>>>>>>>>
>>>>>>>>>>> >EasyOCR I think is best for ID cards or something like that 
>>>>>>>>>>> image process. but document images like books, here Tesseract is 
>>>>>>>>>>> better 
>>>>>>>>>>> than EasyOCR.
>>>>>>>>>>>
>>>>>>>>>>> How about paddleOcr?, are you familiar with it?
>>>>>>>>>>>
>>>>>>>>>>> On Wednesday, September 13, 2023 at 1:45:54 PM UTC+3 
>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I know what you mean. but in some cases, it helps me.  I have 
>>>>>>>>>>>> faced specific characters and words are always not recognized by 
>>>>>>>>>>>> Tesseract. 
>>>>>>>>>>>> That way I use these regex to replace those characters   and words 
>>>>>>>>>>>> if  
>>>>>>>>>>>> those characters are incorrect.
>>>>>>>>>>>>
>>>>>>>>>>>> see what I have done: 
>>>>>>>>>>>>
>>>>>>>>>>>>    " ী": "ী",
>>>>>>>>>>>>     " ্": " ",
>>>>>>>>>>>>     " ে": " ",
>>>>>>>>>>>>     জ্া: "জা",
>>>>>>>>>>>>     "  ": " ",
>>>>>>>>>>>>     "   ": " ",
>>>>>>>>>>>>     "    ": " ",
>>>>>>>>>>>>     "্প": " ",
>>>>>>>>>>>>     " য": "র্য",
>>>>>>>>>>>>     য: "য",
>>>>>>>>>>>>     " া": "া",
>>>>>>>>>>>>     আা: "আ",
>>>>>>>>>>>>     ম্ি: "মি",
>>>>>>>>>>>>     স্ু: "সু",
>>>>>>>>>>>>     "হূ ": "হূ",
>>>>>>>>>>>>     " ণ": "ণ",
>>>>>>>>>>>>     র্্: "র",
>>>>>>>>>>>>     "চিন্ত ": "চিন্তা ",
>>>>>>>>>>>>     ন্া: "না",
>>>>>>>>>>>>     "সম ূর্ন": "সম্পূর্ণ",
>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 4:18:22 pm UTC+6 
>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> The problem for regex is that Tesseract is not consistent in 
>>>>>>>>>>>>> its replacement. 
>>>>>>>>>>>>> Think of the original training of English data doesn't contain 
>>>>>>>>>>>>> the letter /u/. What does Tesseract do when it faces /u/ in 
>>>>>>>>>>>>> actual 
>>>>>>>>>>>>> processing??
>>>>>>>>>>>>> In some cases, it replaces it with closely similar letters 
>>>>>>>>>>>>> such as /v/ and /w/. In other cases, it completely removes it. 
>>>>>>>>>>>>> That is what 
>>>>>>>>>>>>> is happening with my case. Those characters re sometimes 
>>>>>>>>>>>>> completely 
>>>>>>>>>>>>> removed; other times, they are replaced by closely resembling 
>>>>>>>>>>>>> characters. 
>>>>>>>>>>>>> Because of this inconsistency, applying regex is very difficult. 
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 1:02:01 PM UTC+3 
>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> if Some specific characters or words are always missing 
>>>>>>>>>>>>>> from the OCR result.  then you can apply logic with the Regular 
>>>>>>>>>>>>>> expressions 
>>>>>>>>>>>>>> method on your applications. After OCR, these specific 
>>>>>>>>>>>>>> characters or words 
>>>>>>>>>>>>>> will be replaced by current characters or words that you defined 
>>>>>>>>>>>>>> in your 
>>>>>>>>>>>>>> applications by  Regular expressions. it can be done in some 
>>>>>>>>>>>>>> major problems.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 3:51:29 pm UTC+6 
>>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The characters are getting missed, even after fine-tuning. 
>>>>>>>>>>>>>>> I never made any progress. I tried many different 
>>>>>>>>>>>>>>> ways. Some  specific characters are always missing from the OCR 
>>>>>>>>>>>>>>> result.  
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 12:49:20 PM UTC+3 
>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> EasyOCR I think is best for ID cards or something like that 
>>>>>>>>>>>>>>>> image process. but document images like books, here Tesseract 
>>>>>>>>>>>>>>>> is better 
>>>>>>>>>>>>>>>> than EasyOCR.  Even I didn't use EasyOCR. you can try it.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I have added words of dictionaries but the result is the 
>>>>>>>>>>>>>>>> same. 
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> what kind of problem you have faced in fine-tuning in few 
>>>>>>>>>>>>>>>> new characters as you said (*but, I failed in every 
>>>>>>>>>>>>>>>> possible way to introduce a few new characters into the 
>>>>>>>>>>>>>>>> database.)*
>>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 3:33:48 pm UTC+6 
>>>>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Yes, we are new to this. I find the instructions (the 
>>>>>>>>>>>>>>>>> manual) very hard to follow. The video you linked above was 
>>>>>>>>>>>>>>>>> really helpful  
>>>>>>>>>>>>>>>>> to get started. My plan at the beginning was to fine tune the 
>>>>>>>>>>>>>>>>> existing 
>>>>>>>>>>>>>>>>> .traineddata. But, I failed in every possible way to 
>>>>>>>>>>>>>>>>> introduce a few new 
>>>>>>>>>>>>>>>>> characters into the database. That is why I started from 
>>>>>>>>>>>>>>>>> scratch. 
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Sure, I will follow Lorenzo's suggestion: will run more 
>>>>>>>>>>>>>>>>> the iterations, and see if I can improve. 
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Another areas we need to explore is usage of dictionaries 
>>>>>>>>>>>>>>>>> actually. May be adding millions of words into the dictionary 
>>>>>>>>>>>>>>>>> could help 
>>>>>>>>>>>>>>>>> Tesseract. I don't have millions of words; but I am looking 
>>>>>>>>>>>>>>>>> into some 
>>>>>>>>>>>>>>>>> corpus to get more words into the dictionary. 
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> If this all fails, EasyOCR (and probably other similar 
>>>>>>>>>>>>>>>>> open-source packages)  is probably our next option to try on. 
>>>>>>>>>>>>>>>>> Sure, sharing 
>>>>>>>>>>>>>>>>> our experiences will be helpful. I will let you know if I 
>>>>>>>>>>>>>>>>> made good 
>>>>>>>>>>>>>>>>> progresses in any of these options. 
>>>>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 12:19:48 PM UTC+3 
>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> How is your training going for Bengali?  It was nearly 
>>>>>>>>>>>>>>>>>> good but I faced space problems between two words, some 
>>>>>>>>>>>>>>>>>> words are spaces 
>>>>>>>>>>>>>>>>>> but most of them have no space. I think is problem is in the 
>>>>>>>>>>>>>>>>>> dataset but I 
>>>>>>>>>>>>>>>>>> use the default training dataset from Tesseract which is 
>>>>>>>>>>>>>>>>>> used in Ben That 
>>>>>>>>>>>>>>>>>> way I am confused so I have to explore more. by the way,  
>>>>>>>>>>>>>>>>>> you can try as Lorenzo 
>>>>>>>>>>>>>>>>>> Blz said.  Actually training from scratch is harder than 
>>>>>>>>>>>>>>>>>> fine-tuning. so you can use different datasets to explore. 
>>>>>>>>>>>>>>>>>> if you succeed. 
>>>>>>>>>>>>>>>>>> please let me know how you have done this whole process.  
>>>>>>>>>>>>>>>>>> I'm also new in 
>>>>>>>>>>>>>>>>>> this field.
>>>>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 1:13:43 pm UTC+6 
>>>>>>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> How is your training going for Bengali?
>>>>>>>>>>>>>>>>>>> I have been trying to train from scratch. I made about 
>>>>>>>>>>>>>>>>>>> 64,000 lines of text (which produced about 255,000 files, 
>>>>>>>>>>>>>>>>>>> in the end) and 
>>>>>>>>>>>>>>>>>>> run the training for 150,000 iterations; getting 0.51 
>>>>>>>>>>>>>>>>>>> training error rate. 
>>>>>>>>>>>>>>>>>>> I was hopping to get reasonable accuracy. Unfortunately, 
>>>>>>>>>>>>>>>>>>> when I run the OCR 
>>>>>>>>>>>>>>>>>>> using  .traineddata,  the accuracy is absolutely terrible. 
>>>>>>>>>>>>>>>>>>> Do you think I 
>>>>>>>>>>>>>>>>>>> made some mistakes, or that is an expected result?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Tuesday, September 12, 2023 at 11:15:25 PM UTC+3 
>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Yes, he doesn't mention all fonts but only one font.  
>>>>>>>>>>>>>>>>>>>> That way he didn't use *MODEL_NAME in a separate *
>>>>>>>>>>>>>>>>>>>> *script **file script I think.*
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Actually, here we teach all *tif, gt.txt, and .box 
>>>>>>>>>>>>>>>>>>>> files *which are created by  *MODEL_NAME I mean **eng, 
>>>>>>>>>>>>>>>>>>>> ben, oro flag or language code *because when we first 
>>>>>>>>>>>>>>>>>>>> create *tif, gt.txt, and .box files, *every file 
>>>>>>>>>>>>>>>>>>>> starts by  *MODEL_NAME*. This  *MODEL_NAME*  we 
>>>>>>>>>>>>>>>>>>>> selected on the training script for looping each tif, 
>>>>>>>>>>>>>>>>>>>> gt.txt, and .box 
>>>>>>>>>>>>>>>>>>>> files which are created by  *MODEL_NAME.*
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Tuesday, 12 September, 2023 at 9:42:13 pm UTC+6 
>>>>>>>>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Yes, I am familiar with the video and have set up the 
>>>>>>>>>>>>>>>>>>>>> folder structure as you did. Indeed, I have tried a 
>>>>>>>>>>>>>>>>>>>>> number of fine-tuning 
>>>>>>>>>>>>>>>>>>>>> with a single font following Gracia's video. But, your 
>>>>>>>>>>>>>>>>>>>>> script is much  
>>>>>>>>>>>>>>>>>>>>> better because supports multiple fonts. The whole 
>>>>>>>>>>>>>>>>>>>>> improvement you made is  
>>>>>>>>>>>>>>>>>>>>> brilliant; and very useful. It is all working for me. 
>>>>>>>>>>>>>>>>>>>>> The only part that I didn't understand is the trick 
>>>>>>>>>>>>>>>>>>>>> you used in your tesseract_train.py script. You see, I 
>>>>>>>>>>>>>>>>>>>>> have been doing 
>>>>>>>>>>>>>>>>>>>>> exactly to you did except this script. 
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> The scripts seems to have the trick of 
>>>>>>>>>>>>>>>>>>>>> sending/teaching each of the fonts (iteratively) into the 
>>>>>>>>>>>>>>>>>>>>> model. The script 
>>>>>>>>>>>>>>>>>>>>> I have been using  (which I get from Garcia) doesn't 
>>>>>>>>>>>>>>>>>>>>> mention font at all. 
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> *TESSDATA_PREFIX=../tesseract/tessdata make training 
>>>>>>>>>>>>>>>>>>>>> MODEL_NAME=oro TESSDATA=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000*
>>>>>>>>>>>>>>>>>>>>> Does it mean that my model does't train the fonts 
>>>>>>>>>>>>>>>>>>>>> (even if the fonts have been included in the splitting 
>>>>>>>>>>>>>>>>>>>>> process, in the 
>>>>>>>>>>>>>>>>>>>>> other script)?
>>>>>>>>>>>>>>>>>>>>> On Monday, September 11, 2023 at 10:54:08 AM UTC+3 
>>>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> *import subprocess# List of font namesfont_names = 
>>>>>>>>>>>>>>>>>>>>>> ['ben']for font in font_names:    command = 
>>>>>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make training 
>>>>>>>>>>>>>>>>>>>>>> MODEL_NAME={font} 
>>>>>>>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000"*
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> *    subprocess.run(command, shell=True) 1 . This 
>>>>>>>>>>>>>>>>>>>>>> command is for training data that I have named '*
>>>>>>>>>>>>>>>>>>>>>> tesseract_training*.py' inside tesstrain folder.*
>>>>>>>>>>>>>>>>>>>>>> *2. root directory means your main training folder 
>>>>>>>>>>>>>>>>>>>>>> and inside it as like langdata, tessearact,  tesstrain 
>>>>>>>>>>>>>>>>>>>>>> folders. if you see 
>>>>>>>>>>>>>>>>>>>>>> this tutorial    *
>>>>>>>>>>>>>>>>>>>>>> https://www.youtube.com/watch?v=KE4xEzFGSU8   you 
>>>>>>>>>>>>>>>>>>>>>> will understand better the folder structure. only I 
>>>>>>>>>>>>>>>>>>>>>> created tesseract_training.py in tesstrain folder for 
>>>>>>>>>>>>>>>>>>>>>> training and  
>>>>>>>>>>>>>>>>>>>>>> FontList.py file is the main path as *like langdata, 
>>>>>>>>>>>>>>>>>>>>>> tessearact,  tesstrain, and *split_training_text.py.
>>>>>>>>>>>>>>>>>>>>>> 3. first of all you have to put all fonts in your 
>>>>>>>>>>>>>>>>>>>>>> Linux fonts folder.   /usr/share/fonts/  then run:  
>>>>>>>>>>>>>>>>>>>>>> sudo apt update  then sudo fc-cache -fv
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> after that, you have to add the exact font's name in 
>>>>>>>>>>>>>>>>>>>>>> FontList.py file like me.
>>>>>>>>>>>>>>>>>>>>>> I  have added two pic my folder structure. first is 
>>>>>>>>>>>>>>>>>>>>>> main structure pic and the second is the Colopse 
>>>>>>>>>>>>>>>>>>>>>> tesstrain folder.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I[image: Screenshot 2023-09-11 134947.png][image: 
>>>>>>>>>>>>>>>>>>>>>> Screenshot 2023-09-11 135014.png] 
>>>>>>>>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 12:50:03 pm UTC+6 
>>>>>>>>>>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Thank you so much for putting out these brilliant 
>>>>>>>>>>>>>>>>>>>>>>> scripts. They make the process  much more efficient.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I have one more question on the other script that 
>>>>>>>>>>>>>>>>>>>>>>> you use to train. 
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> *import subprocess# List of font namesfont_names = 
>>>>>>>>>>>>>>>>>>>>>>> ['ben']for font in font_names:    command = 
>>>>>>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make training 
>>>>>>>>>>>>>>>>>>>>>>> MODEL_NAME={font} 
>>>>>>>>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000"*
>>>>>>>>>>>>>>>>>>>>>>> *    subprocess.run(command, shell=True) *
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Do you have the name of fonts listed in file in the 
>>>>>>>>>>>>>>>>>>>>>>> same/root directory?
>>>>>>>>>>>>>>>>>>>>>>> How do you setup the names of the fonts in the file, 
>>>>>>>>>>>>>>>>>>>>>>> if you don't mind sharing it?
>>>>>>>>>>>>>>>>>>>>>>> On Monday, September 11, 2023 at 4:27:27 AM UTC+3 
>>>>>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> You can use the new script below. it's better than 
>>>>>>>>>>>>>>>>>>>>>>>> the previous two scripts.  You can create *tif, 
>>>>>>>>>>>>>>>>>>>>>>>> gt.txt, and .box files *by multiple fonts and also 
>>>>>>>>>>>>>>>>>>>>>>>> use breakpoint if vs code close or anything during 
>>>>>>>>>>>>>>>>>>>>>>>> creating *tif, 
>>>>>>>>>>>>>>>>>>>>>>>> gt.txt, and .box files *then you can checkpoint to 
>>>>>>>>>>>>>>>>>>>>>>>> navigate where you close vs code.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> command for *tif, gt.txt, and .box files *
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> import os
>>>>>>>>>>>>>>>>>>>>>>>> import random
>>>>>>>>>>>>>>>>>>>>>>>> import pathlib
>>>>>>>>>>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>>>>>>>>>>> import argparse
>>>>>>>>>>>>>>>>>>>>>>>> from FontList import FontList
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> def create_training_data(training_text_file, 
>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, start_line=None, 
>>>>>>>>>>>>>>>>>>>>>>>> end_line=None):
>>>>>>>>>>>>>>>>>>>>>>>>     lines = []
>>>>>>>>>>>>>>>>>>>>>>>>     with open(training_text_file, 'r') as 
>>>>>>>>>>>>>>>>>>>>>>>> input_file:
>>>>>>>>>>>>>>>>>>>>>>>>         lines = input_file.readlines()
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>     if not os.path.exists(output_directory):
>>>>>>>>>>>>>>>>>>>>>>>>         os.mkdir(output_directory)
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>     if start_line is None:
>>>>>>>>>>>>>>>>>>>>>>>>         start_line = 0
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>     if end_line is None:
>>>>>>>>>>>>>>>>>>>>>>>>         end_line = len(lines) - 1
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>     for font_name in font_list.fonts:
>>>>>>>>>>>>>>>>>>>>>>>>         for line_index in range(start_line, 
>>>>>>>>>>>>>>>>>>>>>>>> end_line + 1):
>>>>>>>>>>>>>>>>>>>>>>>>             line = lines[line_index].strip()
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>             training_text_file_name = pathlib.Path(
>>>>>>>>>>>>>>>>>>>>>>>> training_text_file).stem
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>             line_serial = f"{line_index:d}"
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>             line_gt_text = os.path.join(
>>>>>>>>>>>>>>>>>>>>>>>> output_directory, f'{training_text_file_name}_{
>>>>>>>>>>>>>>>>>>>>>>>> line_serial}_{font_name.replace(" ", "_")}.gt.txt')
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>             with open(line_gt_text, 'w') as 
>>>>>>>>>>>>>>>>>>>>>>>> output_file:
>>>>>>>>>>>>>>>>>>>>>>>>                 output_file.writelines([line])
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>             file_base_name = f'{
>>>>>>>>>>>>>>>>>>>>>>>> training_text_file_name}_{line_serial}_{
>>>>>>>>>>>>>>>>>>>>>>>> font_name.replace(" ", "_")}'
>>>>>>>>>>>>>>>>>>>>>>>>             subprocess.run([
>>>>>>>>>>>>>>>>>>>>>>>>                 'text2image',
>>>>>>>>>>>>>>>>>>>>>>>>                 f'--font={font_name}',
>>>>>>>>>>>>>>>>>>>>>>>>                 f'--text={line_gt_text}',
>>>>>>>>>>>>>>>>>>>>>>>>                 f'--outputbase={output_directory}/{
>>>>>>>>>>>>>>>>>>>>>>>> file_base_name}',
>>>>>>>>>>>>>>>>>>>>>>>>                 '--max_pages=1',
>>>>>>>>>>>>>>>>>>>>>>>>                 '--strip_unrenderable_words',
>>>>>>>>>>>>>>>>>>>>>>>>                 '--leading=36',
>>>>>>>>>>>>>>>>>>>>>>>>                 '--xsize=3600',
>>>>>>>>>>>>>>>>>>>>>>>>                 '--ysize=330',
>>>>>>>>>>>>>>>>>>>>>>>>                 '--char_spacing=1.0',
>>>>>>>>>>>>>>>>>>>>>>>>                 '--exposure=0',
>>>>>>>>>>>>>>>>>>>>>>>>                 '
>>>>>>>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/eng.unicharset',
>>>>>>>>>>>>>>>>>>>>>>>>             ])
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> if __name__ == "__main__":
>>>>>>>>>>>>>>>>>>>>>>>>     parser = argparse.ArgumentParser()
>>>>>>>>>>>>>>>>>>>>>>>>     parser.add_argument('--start', type=int, 
>>>>>>>>>>>>>>>>>>>>>>>> help='Starting 
>>>>>>>>>>>>>>>>>>>>>>>> line count (inclusive)')
>>>>>>>>>>>>>>>>>>>>>>>>     parser.add_argument('--end', type=int, 
>>>>>>>>>>>>>>>>>>>>>>>> help='Ending 
>>>>>>>>>>>>>>>>>>>>>>>> line count (inclusive)')
>>>>>>>>>>>>>>>>>>>>>>>>     args = parser.parse_args()
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>     training_text_file = '
>>>>>>>>>>>>>>>>>>>>>>>> langdata/eng.training_text'
>>>>>>>>>>>>>>>>>>>>>>>>     output_directory = '
>>>>>>>>>>>>>>>>>>>>>>>> tesstrain/data/eng-ground-truth'
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>     font_list = FontList()
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>     create_training_data(training_text_file, 
>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, args.start, args.end)
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Then create a file called "FontList" in the root 
>>>>>>>>>>>>>>>>>>>>>>>> directory and paste it.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> class FontList:
>>>>>>>>>>>>>>>>>>>>>>>>     def __init__(self):
>>>>>>>>>>>>>>>>>>>>>>>>         self.fonts = [
>>>>>>>>>>>>>>>>>>>>>>>>         "Gerlick"
>>>>>>>>>>>>>>>>>>>>>>>>             "Sagar Medium",
>>>>>>>>>>>>>>>>>>>>>>>>             "Ekushey Lohit Normal",  
>>>>>>>>>>>>>>>>>>>>>>>>            "Charukola Round Head Regular, 
>>>>>>>>>>>>>>>>>>>>>>>> weight=433",
>>>>>>>>>>>>>>>>>>>>>>>>             "Charukola Round Head Bold, weight=443"
>>>>>>>>>>>>>>>>>>>>>>>> ,
>>>>>>>>>>>>>>>>>>>>>>>>             "Ador Orjoma Unicode",
>>>>>>>>>>>>>>>>>>>>>>>>       
>>>>>>>>>>>>>>>>>>>>>>>>           
>>>>>>>>>>>>>>>>>>>>>>>>                        
>>>>>>>>>>>>>>>>>>>>>>>> ]                         
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> then import in the above code,
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> *for breakpoint command:*
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> sudo python3 split_training_text.py --start 0  --end 
>>>>>>>>>>>>>>>>>>>>>>>> 11
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> change checkpoint according to you  --start 0 --end 
>>>>>>>>>>>>>>>>>>>>>>>> 11.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> *and training checkpoint as you know already.*
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 1:22:34 am UTC+6 
>>>>>>>>>>>>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Hi mhalidu, 
>>>>>>>>>>>>>>>>>>>>>>>>> the script you posted here seems much more 
>>>>>>>>>>>>>>>>>>>>>>>>> extensive than you posted before: 
>>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/0e2880d9-64c0-4659-b497-902a5747caf4n%40googlegroups.com
>>>>>>>>>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> I have been using your earlier script. It is 
>>>>>>>>>>>>>>>>>>>>>>>>> magical. How is this one different from the 
>>>>>>>>>>>>>>>>>>>>>>>>> earlier one?
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Thank you for posting these scripts, by the way. 
>>>>>>>>>>>>>>>>>>>>>>>>> It has saved my countless hours; by running multiple 
>>>>>>>>>>>>>>>>>>>>>>>>> fonts in one sweep. I 
>>>>>>>>>>>>>>>>>>>>>>>>> was not able to find any instruction on how to train 
>>>>>>>>>>>>>>>>>>>>>>>>> for  multiple fonts. 
>>>>>>>>>>>>>>>>>>>>>>>>> The official manual is also unclear. YOUr script 
>>>>>>>>>>>>>>>>>>>>>>>>> helped me to get started. 
>>>>>>>>>>>>>>>>>>>>>>>>> On Wednesday, August 9, 2023 at 11:00:49 PM UTC+3 
>>>>>>>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> ok, I will try as you said.
>>>>>>>>>>>>>>>>>>>>>>>>>> one more thing, what's the role of the 
>>>>>>>>>>>>>>>>>>>>>>>>>> trained_text lines will be? I have seen Bengali text 
>>>>>>>>>>>>>>>>>>>>>>>>>> are long words of 
>>>>>>>>>>>>>>>>>>>>>>>>>> lines. so I wanna know how many words or characters 
>>>>>>>>>>>>>>>>>>>>>>>>>> will be the better 
>>>>>>>>>>>>>>>>>>>>>>>>>> choice for the train? and 
>>>>>>>>>>>>>>>>>>>>>>>>>> '--xsize=3600','--ysize=350',  will be according 
>>>>>>>>>>>>>>>>>>>>>>>>>> to words of lines?
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> On Thursday, 10 August, 2023 at 1:10:14 am UTC+6 
>>>>>>>>>>>>>>>>>>>>>>>>>> shree wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Include the default fonts also in your 
>>>>>>>>>>>>>>>>>>>>>>>>>>> fine-tuning list of fonts and see if that helps.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Aug 9, 2023, 2:27 PM Ali hussain <
>>>>>>>>>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have trained some new fonts by fine-tune 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> methods for the Bengali language in Tesseract 5 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> and I have used all 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> official trained_text and tessdata_best and other 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> things also.  everything 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> is good but the problem is the default font which 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> was trained before that 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> does not convert text like prev but my new fonts 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> work well. I don't 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> understand why it's happening. I share code based 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> to understand what going 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> on.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> *codes  for creating tif, gt.txt, .box files:*
>>>>>>>>>>>>>>>>>>>>>>>>>>>> import os
>>>>>>>>>>>>>>>>>>>>>>>>>>>> import random
>>>>>>>>>>>>>>>>>>>>>>>>>>>> import pathlib
>>>>>>>>>>>>>>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>>>>>>>>>>>>>>> import argparse
>>>>>>>>>>>>>>>>>>>>>>>>>>>> from FontList import FontList
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> def read_line_count():
>>>>>>>>>>>>>>>>>>>>>>>>>>>>     if os.path.exists('line_count.txt'):
>>>>>>>>>>>>>>>>>>>>>>>>>>>>         with open('line_count.txt', 'r') as 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> file:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>             return int(file.read())
>>>>>>>>>>>>>>>>>>>>>>>>>>>>     return 0
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> def write_line_count(line_count):
>>>>>>>>>>>>>>>>>>>>>>>>>>>>     with open('line_count.txt', 'w') as file:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>         file.write(str(line_count))
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> def create_training_data(training_text_file, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, start_line=None, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> end_line=None):
>>>>>>>>>>>>>>>>>>>>>>>>>>>>     lines = []
>>>>>>>>>>>>>>>>>>>>>>>>>>>>     with open(training_text_file, 'r') as 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> input_file:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>         for line in input_file.readlines():
>>>>>>>>>>>>>>>>>>>>>>>>>>>>             lines.append(line.strip())
>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>     if not os.path.exists(output_directory):
>>>>>>>>>>>>>>>>>>>>>>>>>>>>         os.mkdir(output_directory)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>     random.shuffle(lines)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>     if start_line is None:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>         line_count = read_line_count()  # Set 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> the starting line_count from the file
>>>>>>>>>>>>>>>>>>>>>>>>>>>>     else:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>         line_count = start_line
>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>     if end_line is None:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>         end_line_count = len(lines) - 1  # Set 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> the ending line_count
>>>>>>>>>>>>>>>>>>>>>>>>>>>>     else:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>         end_line_count = min(end_line, len(lines) 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> - 1)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>     for font in font_list.fonts:  # Iterate 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> through all the fonts in the font_list
>>>>>>>>>>>>>>>>>>>>>>>>>>>>         font_serial = 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>         for line in lines:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>             training_text_file_name = pathlib.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Path(training_text_file).stem
>>>>>>>>>>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>>>>>>>>>>             # Generate a unique serial number 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> for each line
>>>>>>>>>>>>>>>>>>>>>>>>>>>>             line_serial = f"{line_count:d}"
>>>>>>>>>>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>>>>>>>>>>             # GT (Ground Truth) text filename
>>>>>>>>>>>>>>>>>>>>>>>>>>>>             line_gt_text = os.path.join(
>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory, f'{training_text_file_name}_{
>>>>>>>>>>>>>>>>>>>>>>>>>>>> line_serial}.gt.txt')
>>>>>>>>>>>>>>>>>>>>>>>>>>>>             with open(line_gt_text, 'w') as 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_file:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 output_file.writelines([line])
>>>>>>>>>>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>>>>>>>>>>             # Image filename
>>>>>>>>>>>>>>>>>>>>>>>>>>>>             file_base_name = f'ben_{line_serial
>>>>>>>>>>>>>>>>>>>>>>>>>>>> }'  # Unique filename for each font
>>>>>>>>>>>>>>>>>>>>>>>>>>>>             subprocess.run([
>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 'text2image',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 f'--font={font}',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 f'--text={line_gt_text}',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 f'--outputbase={
>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory}/{file_base_name}',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--max_pages=1',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--strip_unrenderable_words',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--leading=36',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--xsize=3600',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--ysize=350',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--char_spacing=1.0',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--exposure=0',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '
>>>>>>>>>>>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/ben.unicharset',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>             ])
>>>>>>>>>>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>>>>>>>>>>             line_count += 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>             font_serial += 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>         
>>>>>>>>>>>>>>>>>>>>>>>>>>>>         # Reset font_serial for the next font 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> iteration
>>>>>>>>>>>>>>>>>>>>>>>>>>>>         font_serial = 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>     write_line_count(line_count)  # Update the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> line_count in the file
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> if __name__ == "__main__":
>>>>>>>>>>>>>>>>>>>>>>>>>>>>     parser = argparse.ArgumentParser()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>     parser.add_argument('--start', type=int, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> help='Starting line count (inclusive)')
>>>>>>>>>>>>>>>>>>>>>>>>>>>>     parser.add_argument('--end', type=int, help
>>>>>>>>>>>>>>>>>>>>>>>>>>>> ='Ending line count (inclusive)')
>>>>>>>>>>>>>>>>>>>>>>>>>>>>     args = parser.parse_args()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>     training_text_file = '
>>>>>>>>>>>>>>>>>>>>>>>>>>>> langdata/ben.training_text'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>     output_directory = '
>>>>>>>>>>>>>>>>>>>>>>>>>>>> tesstrain/data/ben-ground-truth'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>     # Create an instance of the FontList class
>>>>>>>>>>>>>>>>>>>>>>>>>>>>     font_list = FontList()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>      
>>>>>>>>>>>>>>>>>>>>>>>>>>>>     create_training_data(training_text_file, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, args.start, args.end)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> *and for training code:*
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> # List of font names
>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_names = ['ben']
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> for font in font_names:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>     command = 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> make training MODEL_NAME={font} 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> LANG_TYPE=Indic"
>>>>>>>>>>>>>>>>>>>>>>>>>>>>     subprocess.run(command, shell=True)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> any suggestion to identify to extract the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> problem.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> thanks, everyone
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> You received this message because you are 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> subscribed to the Google Groups "tesseract-ocr" 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> group.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> To unsubscribe from this group and stop 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> receiving emails from it, send an email to 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> tesseract-oc...@googlegroups.com.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> -- 
>>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>>>>
>>>>>>> To view this discussion on the web visit 
>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com
>>>>>>>>  
>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>> .
>>>>>>>>
>>>>>>> -- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>>
>>>>> To view this discussion on the web visit 
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/eb833902-7258-43e3-8854-d51ce26b7257n%40googlegroups.com
>>>>>>  
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/eb833902-7258-43e3-8854-d51ce26b7257n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/10994725-a949-4916-a57e-884c2cd88c58n%40googlegroups.com.

Re: [tesseract-ocr] accuracy problem after trained in fine-tune

Reply via email to