Re: [tesseract-ocr] accuracy problem after trained in fine-tune

Des Bw Fri, 20 Oct 2023 22:36:44 -0700

Yah, that is what I am getting as well. I was able to add the missing 
letter. But, the overall accuracy become lower than the default model.


On Saturday, October 21, 2023 at 3:22:44 AM UTC+3 mdalihu...@gmail.com 
wrote:

> not good result. that's way i stop to training now. default traineddata is 
> overall good then scratch.
> On Thursday, 19 October, 2023 at 11:32:08 pm UTC+6 desal...@gmail.com 
> wrote:
>
>> Hi Ali, 
>> How is your training going?
>> Do you get good results with the training-from-the-scratch?
>>
>> On Friday, September 15, 2023 at 6:42:26 PM UTC+3 tesseract-ocr wrote:
>>
>>> yes, two months ago when I started to learn OCR I saw that. it was very 
>>> helpful at the beginning.
>>> On Friday, 15 September, 2023 at 4:01:32 pm UTC+6 desal...@gmail.com 
>>> wrote:
>>>
>>>> Just saw this paper: https://osf.io/b8h7q
>>>>
>>>> On Thursday, September 14, 2023 at 9:02:22 PM UTC+3 
>>>> mdalihu...@gmail.com wrote:
>>>>
>>>>> I will try some changes. thx
>>>>>
>>>>> On Thursday, 14 September, 2023 at 2:46:36 pm UTC+6 elvi...@gmail.com 
>>>>> wrote:
>>>>>
>>>>>> I also faced that issue in the Windows. Apparently, the issue is 
>>>>>> related with unicode. You can try your luck by changing  "r" to "utf8" 
>>>>>> in 
>>>>>> the script.
>>>>>> I end up installing Ubuntu because i was having too many errors in 
>>>>>> the Windows.
>>>>>>
>>>>>> On Thu, Sep 14, 2023, 9:33 AM Ali hussain <mdalihu...@gmail.com> 
>>>>>> wrote:
>>>>>>
>>>>>>> you faced this error,  Can't encode transcription? if you faced how 
>>>>>>> you have solved this?
>>>>>>>
>>>>>>> On Thursday, 14 September, 2023 at 10:51:52 am UTC+6 
>>>>>>> elvi...@gmail.com wrote:
>>>>>>>
>>>>>>>> I was using my own text
>>>>>>>>
>>>>>>>> On Thu, Sep 14, 2023, 6:58 AM Ali hussain <mdalihu...@gmail.com> 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> you are training from Tessearact default text data or your own 
>>>>>>>>> collected text data?
>>>>>>>>> On Thursday, 14 September, 2023 at 12:19:53 am UTC+6 
>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>
>>>>>>>>>> I now get to 200000 iterations; and the error rate is stuck at 
>>>>>>>>>> 0.46. The result is absolutely trash: nowhere close to the 
>>>>>>>>>> default/Ray's 
>>>>>>>>>> training. 
>>>>>>>>>>
>>>>>>>>>> On Wednesday, September 13, 2023 at 2:47:05 PM UTC+3 
>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> after Tesseact recognizes text from images. then you can apply 
>>>>>>>>>>> regex to replace the wrong word with to correct word.
>>>>>>>>>>> I'm not familiar with paddleOcr and scanTailor also.
>>>>>>>>>>>
>>>>>>>>>>> On Wednesday, 13 September, 2023 at 5:06:12 pm UTC+6 
>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>
>>>>>>>>>>>> At what stage are you doing the regex replacement?
>>>>>>>>>>>> My process has been: Scan (tif)--> ScanTailor --> Tesseract --> 
>>>>>>>>>>>> pdf
>>>>>>>>>>>>
>>>>>>>>>>>> >EasyOCR I think is best for ID cards or something like that 
>>>>>>>>>>>> image process. but document images like books, here Tesseract is 
>>>>>>>>>>>> better 
>>>>>>>>>>>> than EasyOCR.
>>>>>>>>>>>>
>>>>>>>>>>>> How about paddleOcr?, are you familiar with it?
>>>>>>>>>>>>
>>>>>>>>>>>> On Wednesday, September 13, 2023 at 1:45:54 PM UTC+3 
>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I know what you mean. but in some cases, it helps me.  I have 
>>>>>>>>>>>>> faced specific characters and words are always not recognized by 
>>>>>>>>>>>>> Tesseract. 
>>>>>>>>>>>>> That way I use these regex to replace those characters   and 
>>>>>>>>>>>>> words if  
>>>>>>>>>>>>> those characters are incorrect.
>>>>>>>>>>>>>
>>>>>>>>>>>>> see what I have done: 
>>>>>>>>>>>>>
>>>>>>>>>>>>>    " ী": "ী",
>>>>>>>>>>>>>     " ্": " ",
>>>>>>>>>>>>>     " ে": " ",
>>>>>>>>>>>>>     জ্া: "জা",
>>>>>>>>>>>>>     "  ": " ",
>>>>>>>>>>>>>     "   ": " ",
>>>>>>>>>>>>>     "    ": " ",
>>>>>>>>>>>>>     "্প": " ",
>>>>>>>>>>>>>     " য": "র্য",
>>>>>>>>>>>>>     য: "য",
>>>>>>>>>>>>>     " া": "া",
>>>>>>>>>>>>>     আা: "আ",
>>>>>>>>>>>>>     ম্ি: "মি",
>>>>>>>>>>>>>     স্ু: "সু",
>>>>>>>>>>>>>     "হূ ": "হূ",
>>>>>>>>>>>>>     " ণ": "ণ",
>>>>>>>>>>>>>     র্্: "র",
>>>>>>>>>>>>>     "চিন্ত ": "চিন্তা ",
>>>>>>>>>>>>>     ন্া: "না",
>>>>>>>>>>>>>     "সম ূর্ন": "সম্পূর্ণ",
>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 4:18:22 pm UTC+6 
>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> The problem for regex is that Tesseract is not consistent in 
>>>>>>>>>>>>>> its replacement. 
>>>>>>>>>>>>>> Think of the original training of English data doesn't 
>>>>>>>>>>>>>> contain the letter /u/. What does Tesseract do when it faces /u/ 
>>>>>>>>>>>>>> in actual 
>>>>>>>>>>>>>> processing??
>>>>>>>>>>>>>> In some cases, it replaces it with closely similar letters 
>>>>>>>>>>>>>> such as /v/ and /w/. In other cases, it completely removes it. 
>>>>>>>>>>>>>> That is what 
>>>>>>>>>>>>>> is happening with my case. Those characters re sometimes 
>>>>>>>>>>>>>> completely 
>>>>>>>>>>>>>> removed; other times, they are replaced by closely resembling 
>>>>>>>>>>>>>> characters. 
>>>>>>>>>>>>>> Because of this inconsistency, applying regex is very difficult. 
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 1:02:01 PM UTC+3 
>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> if Some specific characters or words are always missing 
>>>>>>>>>>>>>>> from the OCR result.  then you can apply logic with the Regular 
>>>>>>>>>>>>>>> expressions 
>>>>>>>>>>>>>>> method on your applications. After OCR, these specific 
>>>>>>>>>>>>>>> characters or words 
>>>>>>>>>>>>>>> will be replaced by current characters or words that you 
>>>>>>>>>>>>>>> defined in your 
>>>>>>>>>>>>>>> applications by  Regular expressions. it can be done in some 
>>>>>>>>>>>>>>> major problems.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 3:51:29 pm UTC+6 
>>>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The characters are getting missed, even after fine-tuning. 
>>>>>>>>>>>>>>>> I never made any progress. I tried many different 
>>>>>>>>>>>>>>>> ways. Some  specific characters are always missing from the 
>>>>>>>>>>>>>>>> OCR result.  
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 12:49:20 PM UTC+3 
>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> EasyOCR I think is best for ID cards or something like 
>>>>>>>>>>>>>>>>> that image process. but document images like books, here 
>>>>>>>>>>>>>>>>> Tesseract is 
>>>>>>>>>>>>>>>>> better than EasyOCR.  Even I didn't use EasyOCR. you can try 
>>>>>>>>>>>>>>>>> it.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I have added words of dictionaries but the result is the 
>>>>>>>>>>>>>>>>> same. 
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> what kind of problem you have faced in fine-tuning in few 
>>>>>>>>>>>>>>>>> new characters as you said (*but, I failed in every 
>>>>>>>>>>>>>>>>> possible way to introduce a few new characters into the 
>>>>>>>>>>>>>>>>> database.)*
>>>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 3:33:48 pm UTC+6 
>>>>>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Yes, we are new to this. I find the instructions (the 
>>>>>>>>>>>>>>>>>> manual) very hard to follow. The video you linked above was 
>>>>>>>>>>>>>>>>>> really helpful  
>>>>>>>>>>>>>>>>>> to get started. My plan at the beginning was to fine tune 
>>>>>>>>>>>>>>>>>> the existing 
>>>>>>>>>>>>>>>>>> .traineddata. But, I failed in every possible way to 
>>>>>>>>>>>>>>>>>> introduce a few new 
>>>>>>>>>>>>>>>>>> characters into the database. That is why I started from 
>>>>>>>>>>>>>>>>>> scratch. 
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Sure, I will follow Lorenzo's suggestion: will run more 
>>>>>>>>>>>>>>>>>> the iterations, and see if I can improve. 
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Another areas we need to explore is usage of dictionaries 
>>>>>>>>>>>>>>>>>> actually. May be adding millions of words into the 
>>>>>>>>>>>>>>>>>> dictionary could help 
>>>>>>>>>>>>>>>>>> Tesseract. I don't have millions of words; but I am looking 
>>>>>>>>>>>>>>>>>> into some 
>>>>>>>>>>>>>>>>>> corpus to get more words into the dictionary. 
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> If this all fails, EasyOCR (and probably other similar 
>>>>>>>>>>>>>>>>>> open-source packages)  is probably our next option to try 
>>>>>>>>>>>>>>>>>> on. Sure, sharing 
>>>>>>>>>>>>>>>>>> our experiences will be helpful. I will let you know if I 
>>>>>>>>>>>>>>>>>> made good 
>>>>>>>>>>>>>>>>>> progresses in any of these options. 
>>>>>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 12:19:48 PM UTC+3 
>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> How is your training going for Bengali?  It was nearly 
>>>>>>>>>>>>>>>>>>> good but I faced space problems between two words, some 
>>>>>>>>>>>>>>>>>>> words are spaces 
>>>>>>>>>>>>>>>>>>> but most of them have no space. I think is problem is in 
>>>>>>>>>>>>>>>>>>> the dataset but I 
>>>>>>>>>>>>>>>>>>> use the default training dataset from Tesseract which is 
>>>>>>>>>>>>>>>>>>> used in Ben That 
>>>>>>>>>>>>>>>>>>> way I am confused so I have to explore more. by the way,  
>>>>>>>>>>>>>>>>>>> you can try as Lorenzo 
>>>>>>>>>>>>>>>>>>> Blz said.  Actually training from scratch is harder 
>>>>>>>>>>>>>>>>>>> than fine-tuning. so you can use different datasets to 
>>>>>>>>>>>>>>>>>>> explore. if you 
>>>>>>>>>>>>>>>>>>> succeed. please let me know how you have done this whole 
>>>>>>>>>>>>>>>>>>> process.  I'm also 
>>>>>>>>>>>>>>>>>>> new in this field.
>>>>>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 1:13:43 pm UTC+6 
>>>>>>>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> How is your training going for Bengali?
>>>>>>>>>>>>>>>>>>>> I have been trying to train from scratch. I made about 
>>>>>>>>>>>>>>>>>>>> 64,000 lines of text (which produced about 255,000 files, 
>>>>>>>>>>>>>>>>>>>> in the end) and 
>>>>>>>>>>>>>>>>>>>> run the training for 150,000 iterations; getting 0.51 
>>>>>>>>>>>>>>>>>>>> training error rate. 
>>>>>>>>>>>>>>>>>>>> I was hopping to get reasonable accuracy. Unfortunately, 
>>>>>>>>>>>>>>>>>>>> when I run the OCR 
>>>>>>>>>>>>>>>>>>>> using  .traineddata,  the accuracy is absolutely terrible. 
>>>>>>>>>>>>>>>>>>>> Do you think I 
>>>>>>>>>>>>>>>>>>>> made some mistakes, or that is an expected result?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Tuesday, September 12, 2023 at 11:15:25 PM UTC+3 
>>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Yes, he doesn't mention all fonts but only one font.  
>>>>>>>>>>>>>>>>>>>>> That way he didn't use *MODEL_NAME in a separate *
>>>>>>>>>>>>>>>>>>>>> *script **file script I think.*
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Actually, here we teach all *tif, gt.txt, and .box 
>>>>>>>>>>>>>>>>>>>>> files *which are created by  *MODEL_NAME I mean **eng, 
>>>>>>>>>>>>>>>>>>>>> ben, oro flag or language code *because when we first 
>>>>>>>>>>>>>>>>>>>>> create *tif, gt.txt, and .box files, *every file 
>>>>>>>>>>>>>>>>>>>>> starts by  *MODEL_NAME*. This  *MODEL_NAME*  we 
>>>>>>>>>>>>>>>>>>>>> selected on the training script for looping each tif, 
>>>>>>>>>>>>>>>>>>>>> gt.txt, and .box 
>>>>>>>>>>>>>>>>>>>>> files which are created by  *MODEL_NAME.*
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Tuesday, 12 September, 2023 at 9:42:13 pm UTC+6 
>>>>>>>>>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Yes, I am familiar with the video and have set up the 
>>>>>>>>>>>>>>>>>>>>>> folder structure as you did. Indeed, I have tried a 
>>>>>>>>>>>>>>>>>>>>>> number of fine-tuning 
>>>>>>>>>>>>>>>>>>>>>> with a single font following Gracia's video. But, your 
>>>>>>>>>>>>>>>>>>>>>> script is much  
>>>>>>>>>>>>>>>>>>>>>> better because supports multiple fonts. The whole 
>>>>>>>>>>>>>>>>>>>>>> improvement you made is  
>>>>>>>>>>>>>>>>>>>>>> brilliant; and very useful. It is all working for me. 
>>>>>>>>>>>>>>>>>>>>>> The only part that I didn't understand is the trick 
>>>>>>>>>>>>>>>>>>>>>> you used in your tesseract_train.py script. You see, I 
>>>>>>>>>>>>>>>>>>>>>> have been doing 
>>>>>>>>>>>>>>>>>>>>>> exactly to you did except this script. 
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> The scripts seems to have the trick of 
>>>>>>>>>>>>>>>>>>>>>> sending/teaching each of the fonts (iteratively) into 
>>>>>>>>>>>>>>>>>>>>>> the model. The script 
>>>>>>>>>>>>>>>>>>>>>> I have been using  (which I get from Garcia) doesn't 
>>>>>>>>>>>>>>>>>>>>>> mention font at all. 
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> *TESSDATA_PREFIX=../tesseract/tessdata make training 
>>>>>>>>>>>>>>>>>>>>>> MODEL_NAME=oro TESSDATA=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000*
>>>>>>>>>>>>>>>>>>>>>> Does it mean that my model does't train the fonts 
>>>>>>>>>>>>>>>>>>>>>> (even if the fonts have been included in the splitting 
>>>>>>>>>>>>>>>>>>>>>> process, in the 
>>>>>>>>>>>>>>>>>>>>>> other script)?
>>>>>>>>>>>>>>>>>>>>>> On Monday, September 11, 2023 at 10:54:08 AM UTC+3 
>>>>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> *import subprocess# List of font namesfont_names = 
>>>>>>>>>>>>>>>>>>>>>>> ['ben']for font in font_names:    command = 
>>>>>>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make training 
>>>>>>>>>>>>>>>>>>>>>>> MODEL_NAME={font} 
>>>>>>>>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000"*
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> *    subprocess.run(command, shell=True) 1 . This 
>>>>>>>>>>>>>>>>>>>>>>> command is for training data that I have named '*
>>>>>>>>>>>>>>>>>>>>>>> tesseract_training*.py' inside tesstrain folder.*
>>>>>>>>>>>>>>>>>>>>>>> *2. root directory means your main training folder 
>>>>>>>>>>>>>>>>>>>>>>> and inside it as like langdata, tessearact,  tesstrain 
>>>>>>>>>>>>>>>>>>>>>>> folders. if you see 
>>>>>>>>>>>>>>>>>>>>>>> this tutorial    *
>>>>>>>>>>>>>>>>>>>>>>> https://www.youtube.com/watch?v=KE4xEzFGSU8   you 
>>>>>>>>>>>>>>>>>>>>>>> will understand better the folder structure. only I 
>>>>>>>>>>>>>>>>>>>>>>> created tesseract_training.py in tesstrain folder for 
>>>>>>>>>>>>>>>>>>>>>>> training and  
>>>>>>>>>>>>>>>>>>>>>>> FontList.py file is the main path as *like 
>>>>>>>>>>>>>>>>>>>>>>> langdata, tessearact,  tesstrain, and *
>>>>>>>>>>>>>>>>>>>>>>> split_training_text.py.
>>>>>>>>>>>>>>>>>>>>>>> 3. first of all you have to put all fonts in your 
>>>>>>>>>>>>>>>>>>>>>>> Linux fonts folder.   /usr/share/fonts/  then run:  
>>>>>>>>>>>>>>>>>>>>>>> sudo apt update  then sudo fc-cache -fv
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> after that, you have to add the exact font's name in 
>>>>>>>>>>>>>>>>>>>>>>> FontList.py file like me.
>>>>>>>>>>>>>>>>>>>>>>> I  have added two pic my folder structure. first is 
>>>>>>>>>>>>>>>>>>>>>>> main structure pic and the second is the Colopse 
>>>>>>>>>>>>>>>>>>>>>>> tesstrain folder.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I[image: Screenshot 2023-09-11 134947.png][image: 
>>>>>>>>>>>>>>>>>>>>>>> Screenshot 2023-09-11 135014.png] 
>>>>>>>>>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 12:50:03 pm UTC+6 
>>>>>>>>>>>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Thank you so much for putting out these brilliant 
>>>>>>>>>>>>>>>>>>>>>>>> scripts. They make the process  much more efficient.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> I have one more question on the other script that 
>>>>>>>>>>>>>>>>>>>>>>>> you use to train. 
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> *import subprocess# List of font namesfont_names = 
>>>>>>>>>>>>>>>>>>>>>>>> ['ben']for font in font_names:    command = 
>>>>>>>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make training 
>>>>>>>>>>>>>>>>>>>>>>>> MODEL_NAME={font} 
>>>>>>>>>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000"*
>>>>>>>>>>>>>>>>>>>>>>>> *    subprocess.run(command, shell=True) *
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Do you have the name of fonts listed in file in the 
>>>>>>>>>>>>>>>>>>>>>>>> same/root directory?
>>>>>>>>>>>>>>>>>>>>>>>> How do you setup the names of the fonts in the 
>>>>>>>>>>>>>>>>>>>>>>>> file, if you don't mind sharing it?
>>>>>>>>>>>>>>>>>>>>>>>> On Monday, September 11, 2023 at 4:27:27 AM UTC+3 
>>>>>>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> You can use the new script below. it's better than 
>>>>>>>>>>>>>>>>>>>>>>>>> the previous two scripts.  You can create *tif, 
>>>>>>>>>>>>>>>>>>>>>>>>> gt.txt, and .box files *by multiple fonts and 
>>>>>>>>>>>>>>>>>>>>>>>>> also use breakpoint if vs code close or anything 
>>>>>>>>>>>>>>>>>>>>>>>>> during creating *tif, 
>>>>>>>>>>>>>>>>>>>>>>>>> gt.txt, and .box files *then you can checkpoint 
>>>>>>>>>>>>>>>>>>>>>>>>> to navigate where you close vs code.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> command for *tif, gt.txt, and .box files *
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> import os
>>>>>>>>>>>>>>>>>>>>>>>>> import random
>>>>>>>>>>>>>>>>>>>>>>>>> import pathlib
>>>>>>>>>>>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>>>>>>>>>>>> import argparse
>>>>>>>>>>>>>>>>>>>>>>>>> from FontList import FontList
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> def create_training_data(training_text_file, 
>>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, start_line=None, 
>>>>>>>>>>>>>>>>>>>>>>>>> end_line=None):
>>>>>>>>>>>>>>>>>>>>>>>>>     lines = []
>>>>>>>>>>>>>>>>>>>>>>>>>     with open(training_text_file, 'r') as 
>>>>>>>>>>>>>>>>>>>>>>>>> input_file:
>>>>>>>>>>>>>>>>>>>>>>>>>         lines = input_file.readlines()
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>     if not os.path.exists(output_directory):
>>>>>>>>>>>>>>>>>>>>>>>>>         os.mkdir(output_directory)
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>     if start_line is None:
>>>>>>>>>>>>>>>>>>>>>>>>>         start_line = 0
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>     if end_line is None:
>>>>>>>>>>>>>>>>>>>>>>>>>         end_line = len(lines) - 1
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>     for font_name in font_list.fonts:
>>>>>>>>>>>>>>>>>>>>>>>>>         for line_index in range(start_line, 
>>>>>>>>>>>>>>>>>>>>>>>>> end_line + 1):
>>>>>>>>>>>>>>>>>>>>>>>>>             line = lines[line_index].strip()
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>             training_text_file_name = pathlib.Path
>>>>>>>>>>>>>>>>>>>>>>>>> (training_text_file).stem
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>             line_serial = f"{line_index:d}"
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>             line_gt_text = os.path.join(
>>>>>>>>>>>>>>>>>>>>>>>>> output_directory, f'{training_text_file_name}_{
>>>>>>>>>>>>>>>>>>>>>>>>> line_serial}_{font_name.replace(" ", "_")}.gt.txt'
>>>>>>>>>>>>>>>>>>>>>>>>> )
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>             with open(line_gt_text, 'w') as 
>>>>>>>>>>>>>>>>>>>>>>>>> output_file:
>>>>>>>>>>>>>>>>>>>>>>>>>                 output_file.writelines([line])
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>             file_base_name = f'{
>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file_name}_{line_serial}_{
>>>>>>>>>>>>>>>>>>>>>>>>> font_name.replace(" ", "_")}'
>>>>>>>>>>>>>>>>>>>>>>>>>             subprocess.run([
>>>>>>>>>>>>>>>>>>>>>>>>>                 'text2image',
>>>>>>>>>>>>>>>>>>>>>>>>>                 f'--font={font_name}',
>>>>>>>>>>>>>>>>>>>>>>>>>                 f'--text={line_gt_text}',
>>>>>>>>>>>>>>>>>>>>>>>>>                 f'--outputbase={output_directory}/
>>>>>>>>>>>>>>>>>>>>>>>>> {file_base_name}',
>>>>>>>>>>>>>>>>>>>>>>>>>                 '--max_pages=1',
>>>>>>>>>>>>>>>>>>>>>>>>>                 '--strip_unrenderable_words',
>>>>>>>>>>>>>>>>>>>>>>>>>                 '--leading=36',
>>>>>>>>>>>>>>>>>>>>>>>>>                 '--xsize=3600',
>>>>>>>>>>>>>>>>>>>>>>>>>                 '--ysize=330',
>>>>>>>>>>>>>>>>>>>>>>>>>                 '--char_spacing=1.0',
>>>>>>>>>>>>>>>>>>>>>>>>>                 '--exposure=0',
>>>>>>>>>>>>>>>>>>>>>>>>>                 '
>>>>>>>>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/eng.unicharset',
>>>>>>>>>>>>>>>>>>>>>>>>>             ])
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> if __name__ == "__main__":
>>>>>>>>>>>>>>>>>>>>>>>>>     parser = argparse.ArgumentParser()
>>>>>>>>>>>>>>>>>>>>>>>>>     parser.add_argument('--start', type=int, help=
>>>>>>>>>>>>>>>>>>>>>>>>> 'Starting line count (inclusive)')
>>>>>>>>>>>>>>>>>>>>>>>>>     parser.add_argument('--end', type=int, 
>>>>>>>>>>>>>>>>>>>>>>>>> help='Ending 
>>>>>>>>>>>>>>>>>>>>>>>>> line count (inclusive)')
>>>>>>>>>>>>>>>>>>>>>>>>>     args = parser.parse_args()
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>     training_text_file = '
>>>>>>>>>>>>>>>>>>>>>>>>> langdata/eng.training_text'
>>>>>>>>>>>>>>>>>>>>>>>>>     output_directory = '
>>>>>>>>>>>>>>>>>>>>>>>>> tesstrain/data/eng-ground-truth'
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>     font_list = FontList()
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>     create_training_data(training_text_file, 
>>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, args.start, args.end)
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Then create a file called "FontList" in the root 
>>>>>>>>>>>>>>>>>>>>>>>>> directory and paste it.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> class FontList:
>>>>>>>>>>>>>>>>>>>>>>>>>     def __init__(self):
>>>>>>>>>>>>>>>>>>>>>>>>>         self.fonts = [
>>>>>>>>>>>>>>>>>>>>>>>>>         "Gerlick"
>>>>>>>>>>>>>>>>>>>>>>>>>             "Sagar Medium",
>>>>>>>>>>>>>>>>>>>>>>>>>             "Ekushey Lohit Normal",  
>>>>>>>>>>>>>>>>>>>>>>>>>            "Charukola Round Head Regular, 
>>>>>>>>>>>>>>>>>>>>>>>>> weight=433",
>>>>>>>>>>>>>>>>>>>>>>>>>             "Charukola Round Head Bold, weight=443
>>>>>>>>>>>>>>>>>>>>>>>>> ",
>>>>>>>>>>>>>>>>>>>>>>>>>             "Ador Orjoma Unicode",
>>>>>>>>>>>>>>>>>>>>>>>>>       
>>>>>>>>>>>>>>>>>>>>>>>>>           
>>>>>>>>>>>>>>>>>>>>>>>>>                        
>>>>>>>>>>>>>>>>>>>>>>>>> ]                         
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> then import in the above code,
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> *for breakpoint command:*
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> sudo python3 split_training_text.py --start 0  --end 
>>>>>>>>>>>>>>>>>>>>>>>>> 11
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> change checkpoint according to you  --start 0 
>>>>>>>>>>>>>>>>>>>>>>>>> --end 11.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> *and training checkpoint as you know already.*
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 1:22:34 am UTC+6 
>>>>>>>>>>>>>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Hi mhalidu, 
>>>>>>>>>>>>>>>>>>>>>>>>>> the script you posted here seems much more 
>>>>>>>>>>>>>>>>>>>>>>>>>> extensive than you posted before: 
>>>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/0e2880d9-64c0-4659-b497-902a5747caf4n%40googlegroups.com
>>>>>>>>>>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> I have been using your earlier script. It is 
>>>>>>>>>>>>>>>>>>>>>>>>>> magical. How is this one different from the 
>>>>>>>>>>>>>>>>>>>>>>>>>> earlier one?
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Thank you for posting these scripts, by the way. 
>>>>>>>>>>>>>>>>>>>>>>>>>> It has saved my countless hours; by running multiple 
>>>>>>>>>>>>>>>>>>>>>>>>>> fonts in one sweep. I 
>>>>>>>>>>>>>>>>>>>>>>>>>> was not able to find any instruction on how to train 
>>>>>>>>>>>>>>>>>>>>>>>>>> for  multiple fonts. 
>>>>>>>>>>>>>>>>>>>>>>>>>> The official manual is also unclear. YOUr script 
>>>>>>>>>>>>>>>>>>>>>>>>>> helped me to get started. 
>>>>>>>>>>>>>>>>>>>>>>>>>> On Wednesday, August 9, 2023 at 11:00:49 PM UTC+3 
>>>>>>>>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> ok, I will try as you said.
>>>>>>>>>>>>>>>>>>>>>>>>>>> one more thing, what's the role of the 
>>>>>>>>>>>>>>>>>>>>>>>>>>> trained_text lines will be? I have seen Bengali 
>>>>>>>>>>>>>>>>>>>>>>>>>>> text are long words of 
>>>>>>>>>>>>>>>>>>>>>>>>>>> lines. so I wanna know how many words or characters 
>>>>>>>>>>>>>>>>>>>>>>>>>>> will be the better 
>>>>>>>>>>>>>>>>>>>>>>>>>>> choice for the train? and 
>>>>>>>>>>>>>>>>>>>>>>>>>>> '--xsize=3600','--ysize=350',  will be according 
>>>>>>>>>>>>>>>>>>>>>>>>>>> to words of lines?
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thursday, 10 August, 2023 at 1:10:14 am UTC+6 
>>>>>>>>>>>>>>>>>>>>>>>>>>> shree wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Include the default fonts also in your 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> fine-tuning list of fonts and see if that helps.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Aug 9, 2023, 2:27 PM Ali hussain <
>>>>>>>>>>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have trained some new fonts by fine-tune 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> methods for the Bengali language in Tesseract 5 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and I have used all 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> official trained_text and tessdata_best and other 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> things also.  everything 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is good but the problem is the default font which 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> was trained before that 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> does not convert text like prev but my new fonts 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> work well. I don't 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> understand why it's happening. I share code based 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to understand what going 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *codes  for creating tif, gt.txt, .box files:*
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import os
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import random
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import pathlib
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import argparse
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> from FontList import FontList
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> def read_line_count():
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     if os.path.exists('line_count.txt'):
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         with open('line_count.txt', 'r') as 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> file:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             return int(file.read())
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     return 0
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> def write_line_count(line_count):
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     with open('line_count.txt', 'w') as file:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         file.write(str(line_count))
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> def create_training_data(training_text_file, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, start_line=None, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> end_line=None):
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     lines = []
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     with open(training_text_file, 'r') as 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> input_file:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         for line in input_file.readlines():
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             lines.append(line.strip())
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     if not os.path.exists(output_directory):
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         os.mkdir(output_directory)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     random.shuffle(lines)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     if start_line is None:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         line_count = read_line_count()  # Set 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the starting line_count from the file
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     else:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         line_count = start_line
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     if end_line is None:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         end_line_count = len(lines) - 1  # 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Set the ending line_count
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     else:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         end_line_count = min(end_line, len(lines) 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - 1)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     for font in font_list.fonts:  # Iterate 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> through all the fonts in the font_list
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         font_serial = 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         for line in lines:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             training_text_file_name = pathlib.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Path(training_text_file).stem
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             # Generate a unique serial number 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for each line
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             line_serial = f"{line_count:d}"
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             # GT (Ground Truth) text filename
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             line_gt_text = os.path.join(
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory, f'{training_text_file_name}_
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> {line_serial}.gt.txt')
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             with open(line_gt_text, 'w') as 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_file:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 output_file.writelines([line])
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             # Image filename
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             file_base_name = f'ben_{
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> line_serial}'  # Unique filename for each font
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             subprocess.run([
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 'text2image',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 f'--font={font}',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 f'--text={line_gt_text}',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 f'--outputbase={
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory}/{file_base_name}',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--max_pages=1',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--strip_unrenderable_words',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--leading=36',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--xsize=3600',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--ysize=350',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--char_spacing=1.0',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--exposure=0',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/ben.unicharset',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             ])
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             line_count += 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             font_serial += 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         # Reset font_serial for the next font 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> iteration
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         font_serial = 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     write_line_count(line_count)  # Update 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the line_count in the file
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if __name__ == "__main__":
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     parser = argparse.ArgumentParser()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     parser.add_argument('--start', type=int, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> help='Starting line count (inclusive)')
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     parser.add_argument('--end', type=int, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> help='Ending line count (inclusive)')
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     args = parser.parse_args()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     training_text_file = '
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> langdata/ben.training_text'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     output_directory = '
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tesstrain/data/ben-ground-truth'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     # Create an instance of the FontList class
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     font_list = FontList()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>      
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     create_training_data(training_text_file, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, args.start, args.end)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *and for training code:*
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> # List of font names
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_names = ['ben']
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for font in font_names:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     command = 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> make training MODEL_NAME={font} 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LANG_TYPE=Indic"
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     subprocess.run(command, shell=True)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> any suggestion to identify to extract the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> problem.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> thanks, everyone
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> You received this message because you are 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> subscribed to the Google Groups "tesseract-ocr" 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> group.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To unsubscribe from this group and stop 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> receiving emails from it, send an email to 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tesseract-oc...@googlegroups.com.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> -- 
>>>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>>>>>
>>>>>>>> To view this discussion on the web visit 
>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com
>>>>>>>>>  
>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>> .
>>>>>>>>>
>>>>>>>> -- 
>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>> Groups "tesseract-ocr" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>>>
>>>>>> To view this discussion on the web visit 
>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/eb833902-7258-43e3-8854-d51ce26b7257n%40googlegroups.com
>>>>>>>  
>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/eb833902-7258-43e3-8854-d51ce26b7257n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>>
>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a78fce2d-c803-4c33-98c0-90ef5feea736n%40googlegroups.com.

Re: [tesseract-ocr] accuracy problem after trained in fine-tune

Reply via email to