Re: [tesseract-ocr] accuracy problem after trained in fine-tune

Des Bw Fri, 20 Oct 2023 22:37:18 -0700

How many lines of text and iterations did you use?

On Saturday, October 21, 2023 at 8:36:38 AM UTC+3 Des Bw wrote:


> Yah, that is what I am getting as well. I was able to add the missing 
> letter. But, the overall accuracy become lower than the default model. 
>
> On Saturday, October 21, 2023 at 3:22:44 AM UTC+3 mdalihu...@gmail.com 
> wrote:
>
>> not good result. that's way i stop to training now. default traineddata 
>> is overall good then scratch.
>> On Thursday, 19 October, 2023 at 11:32:08 pm UTC+6 desal...@gmail.com 
>> wrote:
>>
>>> Hi Ali, 
>>> How is your training going?
>>> Do you get good results with the training-from-the-scratch?
>>>
>>> On Friday, September 15, 2023 at 6:42:26 PM UTC+3 tesseract-ocr wrote:
>>>
>>>> yes, two months ago when I started to learn OCR I saw that. it was very 
>>>> helpful at the beginning.
>>>> On Friday, 15 September, 2023 at 4:01:32 pm UTC+6 desal...@gmail.com 
>>>> wrote:
>>>>
>>>>> Just saw this paper: https://osf.io/b8h7q
>>>>>
>>>>> On Thursday, September 14, 2023 at 9:02:22 PM UTC+3 
>>>>> mdalihu...@gmail.com wrote:
>>>>>
>>>>>> I will try some changes. thx
>>>>>>
>>>>>> On Thursday, 14 September, 2023 at 2:46:36 pm UTC+6 elvi...@gmail.com 
>>>>>> wrote:
>>>>>>
>>>>>>> I also faced that issue in the Windows. Apparently, the issue is 
>>>>>>> related with unicode. You can try your luck by changing  "r" to "utf8" 
>>>>>>> in 
>>>>>>> the script.
>>>>>>> I end up installing Ubuntu because i was having too many errors in 
>>>>>>> the Windows.
>>>>>>>
>>>>>>> On Thu, Sep 14, 2023, 9:33 AM Ali hussain <mdalihu...@gmail.com> 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> you faced this error,  Can't encode transcription? if you faced how 
>>>>>>>> you have solved this?
>>>>>>>>
>>>>>>>> On Thursday, 14 September, 2023 at 10:51:52 am UTC+6 
>>>>>>>> elvi...@gmail.com wrote:
>>>>>>>>
>>>>>>>>> I was using my own text
>>>>>>>>>
>>>>>>>>> On Thu, Sep 14, 2023, 6:58 AM Ali hussain <mdalihu...@gmail.com> 
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> you are training from Tessearact default text data or your own 
>>>>>>>>>> collected text data?
>>>>>>>>>> On Thursday, 14 September, 2023 at 12:19:53 am UTC+6 
>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>
>>>>>>>>>>> I now get to 200000 iterations; and the error rate is stuck at 
>>>>>>>>>>> 0.46. The result is absolutely trash: nowhere close to the 
>>>>>>>>>>> default/Ray's 
>>>>>>>>>>> training. 
>>>>>>>>>>>
>>>>>>>>>>> On Wednesday, September 13, 2023 at 2:47:05 PM UTC+3 
>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> after Tesseact recognizes text from images. then you can apply 
>>>>>>>>>>>> regex to replace the wrong word with to correct word.
>>>>>>>>>>>> I'm not familiar with paddleOcr and scanTailor also.
>>>>>>>>>>>>
>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 5:06:12 pm UTC+6 
>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> At what stage are you doing the regex replacement?
>>>>>>>>>>>>> My process has been: Scan (tif)--> ScanTailor --> Tesseract 
>>>>>>>>>>>>> --> pdf
>>>>>>>>>>>>>
>>>>>>>>>>>>> >EasyOCR I think is best for ID cards or something like that 
>>>>>>>>>>>>> image process. but document images like books, here Tesseract is 
>>>>>>>>>>>>> better 
>>>>>>>>>>>>> than EasyOCR.
>>>>>>>>>>>>>
>>>>>>>>>>>>> How about paddleOcr?, are you familiar with it?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 1:45:54 PM UTC+3 
>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I know what you mean. but in some cases, it helps me.  I have 
>>>>>>>>>>>>>> faced specific characters and words are always not recognized by 
>>>>>>>>>>>>>> Tesseract. 
>>>>>>>>>>>>>> That way I use these regex to replace those characters   and 
>>>>>>>>>>>>>> words if  
>>>>>>>>>>>>>> those characters are incorrect.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> see what I have done: 
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    " ী": "ী",
>>>>>>>>>>>>>>     " ্": " ",
>>>>>>>>>>>>>>     " ে": " ",
>>>>>>>>>>>>>>     জ্া: "জা",
>>>>>>>>>>>>>>     "  ": " ",
>>>>>>>>>>>>>>     "   ": " ",
>>>>>>>>>>>>>>     "    ": " ",
>>>>>>>>>>>>>>     "্প": " ",
>>>>>>>>>>>>>>     " য": "র্য",
>>>>>>>>>>>>>>     য: "য",
>>>>>>>>>>>>>>     " া": "া",
>>>>>>>>>>>>>>     আা: "আ",
>>>>>>>>>>>>>>     ম্ি: "মি",
>>>>>>>>>>>>>>     স্ু: "সু",
>>>>>>>>>>>>>>     "হূ ": "হূ",
>>>>>>>>>>>>>>     " ণ": "ণ",
>>>>>>>>>>>>>>     র্্: "র",
>>>>>>>>>>>>>>     "চিন্ত ": "চিন্তা ",
>>>>>>>>>>>>>>     ন্া: "না",
>>>>>>>>>>>>>>     "সম ূর্ন": "সম্পূর্ণ",
>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 4:18:22 pm UTC+6 
>>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The problem for regex is that Tesseract is not consistent in 
>>>>>>>>>>>>>>> its replacement. 
>>>>>>>>>>>>>>> Think of the original training of English data doesn't 
>>>>>>>>>>>>>>> contain the letter /u/. What does Tesseract do when it faces 
>>>>>>>>>>>>>>> /u/ in actual 
>>>>>>>>>>>>>>> processing??
>>>>>>>>>>>>>>> In some cases, it replaces it with closely similar letters 
>>>>>>>>>>>>>>> such as /v/ and /w/. In other cases, it completely removes it. 
>>>>>>>>>>>>>>> That is what 
>>>>>>>>>>>>>>> is happening with my case. Those characters re sometimes 
>>>>>>>>>>>>>>> completely 
>>>>>>>>>>>>>>> removed; other times, they are replaced by closely resembling 
>>>>>>>>>>>>>>> characters. 
>>>>>>>>>>>>>>> Because of this inconsistency, applying regex is very 
>>>>>>>>>>>>>>> difficult. 
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 1:02:01 PM UTC+3 
>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> if Some specific characters or words are always missing 
>>>>>>>>>>>>>>>> from the OCR result.  then you can apply logic with the 
>>>>>>>>>>>>>>>> Regular expressions 
>>>>>>>>>>>>>>>> method on your applications. After OCR, these specific 
>>>>>>>>>>>>>>>> characters or words 
>>>>>>>>>>>>>>>> will be replaced by current characters or words that you 
>>>>>>>>>>>>>>>> defined in your 
>>>>>>>>>>>>>>>> applications by  Regular expressions. it can be done in some 
>>>>>>>>>>>>>>>> major problems.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 3:51:29 pm UTC+6 
>>>>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The characters are getting missed, even after fine-tuning. 
>>>>>>>>>>>>>>>>> I never made any progress. I tried many different 
>>>>>>>>>>>>>>>>> ways. Some  specific characters are always missing from the 
>>>>>>>>>>>>>>>>> OCR result.  
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 12:49:20 PM UTC+3 
>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> EasyOCR I think is best for ID cards or something like 
>>>>>>>>>>>>>>>>>> that image process. but document images like books, here 
>>>>>>>>>>>>>>>>>> Tesseract is 
>>>>>>>>>>>>>>>>>> better than EasyOCR.  Even I didn't use EasyOCR. you can try 
>>>>>>>>>>>>>>>>>> it.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I have added words of dictionaries but the result is the 
>>>>>>>>>>>>>>>>>> same. 
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> what kind of problem you have faced in fine-tuning in few 
>>>>>>>>>>>>>>>>>> new characters as you said (*but, I failed in every 
>>>>>>>>>>>>>>>>>> possible way to introduce a few new characters into the 
>>>>>>>>>>>>>>>>>> database.)*
>>>>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 3:33:48 pm UTC+6 
>>>>>>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Yes, we are new to this. I find the instructions (the 
>>>>>>>>>>>>>>>>>>> manual) very hard to follow. The video you linked above was 
>>>>>>>>>>>>>>>>>>> really helpful  
>>>>>>>>>>>>>>>>>>> to get started. My plan at the beginning was to fine tune 
>>>>>>>>>>>>>>>>>>> the existing 
>>>>>>>>>>>>>>>>>>> .traineddata. But, I failed in every possible way to 
>>>>>>>>>>>>>>>>>>> introduce a few new 
>>>>>>>>>>>>>>>>>>> characters into the database. That is why I started from 
>>>>>>>>>>>>>>>>>>> scratch. 
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Sure, I will follow Lorenzo's suggestion: will run more 
>>>>>>>>>>>>>>>>>>> the iterations, and see if I can improve. 
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Another areas we need to explore is usage of 
>>>>>>>>>>>>>>>>>>> dictionaries actually. May be adding millions of words into 
>>>>>>>>>>>>>>>>>>> the 
>>>>>>>>>>>>>>>>>>> dictionary could help Tesseract. I don't have millions of 
>>>>>>>>>>>>>>>>>>> words; but I am 
>>>>>>>>>>>>>>>>>>> looking into some corpus to get more words into the 
>>>>>>>>>>>>>>>>>>> dictionary. 
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> If this all fails, EasyOCR (and probably other similar 
>>>>>>>>>>>>>>>>>>> open-source packages)  is probably our next option to try 
>>>>>>>>>>>>>>>>>>> on. Sure, sharing 
>>>>>>>>>>>>>>>>>>> our experiences will be helpful. I will let you know if I 
>>>>>>>>>>>>>>>>>>> made good 
>>>>>>>>>>>>>>>>>>> progresses in any of these options. 
>>>>>>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 12:19:48 PM UTC+3 
>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> How is your training going for Bengali?  It was nearly 
>>>>>>>>>>>>>>>>>>>> good but I faced space problems between two words, some 
>>>>>>>>>>>>>>>>>>>> words are spaces 
>>>>>>>>>>>>>>>>>>>> but most of them have no space. I think is problem is in 
>>>>>>>>>>>>>>>>>>>> the dataset but I 
>>>>>>>>>>>>>>>>>>>> use the default training dataset from Tesseract which is 
>>>>>>>>>>>>>>>>>>>> used in Ben That 
>>>>>>>>>>>>>>>>>>>> way I am confused so I have to explore more. by the way,  
>>>>>>>>>>>>>>>>>>>> you can try as Lorenzo 
>>>>>>>>>>>>>>>>>>>> Blz said.  Actually training from scratch is harder 
>>>>>>>>>>>>>>>>>>>> than fine-tuning. so you can use different datasets to 
>>>>>>>>>>>>>>>>>>>> explore. if you 
>>>>>>>>>>>>>>>>>>>> succeed. please let me know how you have done this whole 
>>>>>>>>>>>>>>>>>>>> process.  I'm also 
>>>>>>>>>>>>>>>>>>>> new in this field.
>>>>>>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 1:13:43 pm UTC+6 
>>>>>>>>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> How is your training going for Bengali?
>>>>>>>>>>>>>>>>>>>>> I have been trying to train from scratch. I made about 
>>>>>>>>>>>>>>>>>>>>> 64,000 lines of text (which produced about 255,000 files, 
>>>>>>>>>>>>>>>>>>>>> in the end) and 
>>>>>>>>>>>>>>>>>>>>> run the training for 150,000 iterations; getting 0.51 
>>>>>>>>>>>>>>>>>>>>> training error rate. 
>>>>>>>>>>>>>>>>>>>>> I was hopping to get reasonable accuracy. Unfortunately, 
>>>>>>>>>>>>>>>>>>>>> when I run the OCR 
>>>>>>>>>>>>>>>>>>>>> using  .traineddata,  the accuracy is absolutely 
>>>>>>>>>>>>>>>>>>>>> terrible. Do you think I 
>>>>>>>>>>>>>>>>>>>>> made some mistakes, or that is an expected result?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Tuesday, September 12, 2023 at 11:15:25 PM UTC+3 
>>>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Yes, he doesn't mention all fonts but only one font.  
>>>>>>>>>>>>>>>>>>>>>> That way he didn't use *MODEL_NAME in a separate *
>>>>>>>>>>>>>>>>>>>>>> *script **file script I think.*
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Actually, here we teach all *tif, gt.txt, and .box 
>>>>>>>>>>>>>>>>>>>>>> files *which are created by  *MODEL_NAME I mean **eng, 
>>>>>>>>>>>>>>>>>>>>>> ben, oro flag or language code *because when we 
>>>>>>>>>>>>>>>>>>>>>> first create *tif, gt.txt, and .box files, *every 
>>>>>>>>>>>>>>>>>>>>>> file starts by  *MODEL_NAME*. This  *MODEL_NAME*  we 
>>>>>>>>>>>>>>>>>>>>>> selected on the training script for looping each tif, 
>>>>>>>>>>>>>>>>>>>>>> gt.txt, and .box 
>>>>>>>>>>>>>>>>>>>>>> files which are created by  *MODEL_NAME.*
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Tuesday, 12 September, 2023 at 9:42:13 pm UTC+6 
>>>>>>>>>>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Yes, I am familiar with the video and have set up 
>>>>>>>>>>>>>>>>>>>>>>> the folder structure as you did. Indeed, I have tried a 
>>>>>>>>>>>>>>>>>>>>>>> number of 
>>>>>>>>>>>>>>>>>>>>>>> fine-tuning with a single font following Gracia's 
>>>>>>>>>>>>>>>>>>>>>>> video. But, your script 
>>>>>>>>>>>>>>>>>>>>>>> is much  better because supports multiple fonts. The 
>>>>>>>>>>>>>>>>>>>>>>> whole improvement you 
>>>>>>>>>>>>>>>>>>>>>>> made is  brilliant; and very useful. It is all working 
>>>>>>>>>>>>>>>>>>>>>>> for me. 
>>>>>>>>>>>>>>>>>>>>>>> The only part that I didn't understand is the trick 
>>>>>>>>>>>>>>>>>>>>>>> you used in your tesseract_train.py script. You see, I 
>>>>>>>>>>>>>>>>>>>>>>> have been doing 
>>>>>>>>>>>>>>>>>>>>>>> exactly to you did except this script. 
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> The scripts seems to have the trick of 
>>>>>>>>>>>>>>>>>>>>>>> sending/teaching each of the fonts (iteratively) into 
>>>>>>>>>>>>>>>>>>>>>>> the model. The script 
>>>>>>>>>>>>>>>>>>>>>>> I have been using  (which I get from Garcia) doesn't 
>>>>>>>>>>>>>>>>>>>>>>> mention font at all. 
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> *TESSDATA_PREFIX=../tesseract/tessdata make training 
>>>>>>>>>>>>>>>>>>>>>>> MODEL_NAME=oro TESSDATA=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000*
>>>>>>>>>>>>>>>>>>>>>>> Does it mean that my model does't train the fonts 
>>>>>>>>>>>>>>>>>>>>>>> (even if the fonts have been included in the splitting 
>>>>>>>>>>>>>>>>>>>>>>> process, in the 
>>>>>>>>>>>>>>>>>>>>>>> other script)?
>>>>>>>>>>>>>>>>>>>>>>> On Monday, September 11, 2023 at 10:54:08 AM UTC+3 
>>>>>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> *import subprocess# List of font namesfont_names = 
>>>>>>>>>>>>>>>>>>>>>>>> ['ben']for font in font_names:    command = 
>>>>>>>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make training 
>>>>>>>>>>>>>>>>>>>>>>>> MODEL_NAME={font} 
>>>>>>>>>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000"*
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> *    subprocess.run(command, shell=True) 1 . This 
>>>>>>>>>>>>>>>>>>>>>>>> command is for training data that I have named '*
>>>>>>>>>>>>>>>>>>>>>>>> tesseract_training*.py' inside tesstrain folder.*
>>>>>>>>>>>>>>>>>>>>>>>> *2. root directory means your main training folder 
>>>>>>>>>>>>>>>>>>>>>>>> and inside it as like langdata, tessearact,  tesstrain 
>>>>>>>>>>>>>>>>>>>>>>>> folders. if you see 
>>>>>>>>>>>>>>>>>>>>>>>> this tutorial    *
>>>>>>>>>>>>>>>>>>>>>>>> https://www.youtube.com/watch?v=KE4xEzFGSU8   you 
>>>>>>>>>>>>>>>>>>>>>>>> will understand better the folder structure. only I 
>>>>>>>>>>>>>>>>>>>>>>>> created tesseract_training.py in tesstrain folder for 
>>>>>>>>>>>>>>>>>>>>>>>> training and  
>>>>>>>>>>>>>>>>>>>>>>>> FontList.py file is the main path as *like 
>>>>>>>>>>>>>>>>>>>>>>>> langdata, tessearact,  tesstrain, and *
>>>>>>>>>>>>>>>>>>>>>>>> split_training_text.py.
>>>>>>>>>>>>>>>>>>>>>>>> 3. first of all you have to put all fonts in your 
>>>>>>>>>>>>>>>>>>>>>>>> Linux fonts folder.   /usr/share/fonts/  then run:  
>>>>>>>>>>>>>>>>>>>>>>>> sudo apt update  then sudo fc-cache -fv
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> after that, you have to add the exact font's name 
>>>>>>>>>>>>>>>>>>>>>>>> in FontList.py file like me.
>>>>>>>>>>>>>>>>>>>>>>>> I  have added two pic my folder structure. first 
>>>>>>>>>>>>>>>>>>>>>>>> is main structure pic and the second is the Colopse 
>>>>>>>>>>>>>>>>>>>>>>>> tesstrain folder.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> I[image: Screenshot 2023-09-11 134947.png][image: 
>>>>>>>>>>>>>>>>>>>>>>>> Screenshot 2023-09-11 135014.png] 
>>>>>>>>>>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 12:50:03 pm UTC+6 
>>>>>>>>>>>>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Thank you so much for putting out these brilliant 
>>>>>>>>>>>>>>>>>>>>>>>>> scripts. They make the process  much more efficient.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> I have one more question on the other script that 
>>>>>>>>>>>>>>>>>>>>>>>>> you use to train. 
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> *import subprocess# List of font namesfont_names = 
>>>>>>>>>>>>>>>>>>>>>>>>> ['ben']for font in font_names:    command = 
>>>>>>>>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make training 
>>>>>>>>>>>>>>>>>>>>>>>>> MODEL_NAME={font} 
>>>>>>>>>>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000"*
>>>>>>>>>>>>>>>>>>>>>>>>> *    subprocess.run(command, shell=True) *
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Do you have the name of fonts listed in file in 
>>>>>>>>>>>>>>>>>>>>>>>>> the same/root directory?
>>>>>>>>>>>>>>>>>>>>>>>>> How do you setup the names of the fonts in the 
>>>>>>>>>>>>>>>>>>>>>>>>> file, if you don't mind sharing it?
>>>>>>>>>>>>>>>>>>>>>>>>> On Monday, September 11, 2023 at 4:27:27 AM UTC+3 
>>>>>>>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> You can use the new script below. it's better 
>>>>>>>>>>>>>>>>>>>>>>>>>> than the previous two scripts.  You can create *tif, 
>>>>>>>>>>>>>>>>>>>>>>>>>> gt.txt, and .box files *by multiple fonts and 
>>>>>>>>>>>>>>>>>>>>>>>>>> also use breakpoint if vs code close or anything 
>>>>>>>>>>>>>>>>>>>>>>>>>> during creating *tif, 
>>>>>>>>>>>>>>>>>>>>>>>>>> gt.txt, and .box files *then you can checkpoint 
>>>>>>>>>>>>>>>>>>>>>>>>>> to navigate where you close vs code.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> command for *tif, gt.txt, and .box files *
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> import os
>>>>>>>>>>>>>>>>>>>>>>>>>> import random
>>>>>>>>>>>>>>>>>>>>>>>>>> import pathlib
>>>>>>>>>>>>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>>>>>>>>>>>>> import argparse
>>>>>>>>>>>>>>>>>>>>>>>>>> from FontList import FontList
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> def create_training_data(training_text_file, 
>>>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, start_line=None, 
>>>>>>>>>>>>>>>>>>>>>>>>>> end_line=None):
>>>>>>>>>>>>>>>>>>>>>>>>>>     lines = []
>>>>>>>>>>>>>>>>>>>>>>>>>>     with open(training_text_file, 'r') as 
>>>>>>>>>>>>>>>>>>>>>>>>>> input_file:
>>>>>>>>>>>>>>>>>>>>>>>>>>         lines = input_file.readlines()
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>     if not os.path.exists(output_directory):
>>>>>>>>>>>>>>>>>>>>>>>>>>         os.mkdir(output_directory)
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>     if start_line is None:
>>>>>>>>>>>>>>>>>>>>>>>>>>         start_line = 0
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>     if end_line is None:
>>>>>>>>>>>>>>>>>>>>>>>>>>         end_line = len(lines) - 1
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>     for font_name in font_list.fonts:
>>>>>>>>>>>>>>>>>>>>>>>>>>         for line_index in range(start_line, 
>>>>>>>>>>>>>>>>>>>>>>>>>> end_line + 1):
>>>>>>>>>>>>>>>>>>>>>>>>>>             line = lines[line_index].strip()
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>             training_text_file_name = pathlib.
>>>>>>>>>>>>>>>>>>>>>>>>>> Path(training_text_file).stem
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>             line_serial = f"{line_index:d}"
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>             line_gt_text = os.path.join(
>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory, f'{training_text_file_name}_{
>>>>>>>>>>>>>>>>>>>>>>>>>> line_serial}_{font_name.replace(" ", "_")}
>>>>>>>>>>>>>>>>>>>>>>>>>> .gt.txt')
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>             with open(line_gt_text, 'w') as 
>>>>>>>>>>>>>>>>>>>>>>>>>> output_file:
>>>>>>>>>>>>>>>>>>>>>>>>>>                 output_file.writelines([line])
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>             file_base_name = f'{
>>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file_name}_{line_serial}_{
>>>>>>>>>>>>>>>>>>>>>>>>>> font_name.replace(" ", "_")}'
>>>>>>>>>>>>>>>>>>>>>>>>>>             subprocess.run([
>>>>>>>>>>>>>>>>>>>>>>>>>>                 'text2image',
>>>>>>>>>>>>>>>>>>>>>>>>>>                 f'--font={font_name}',
>>>>>>>>>>>>>>>>>>>>>>>>>>                 f'--text={line_gt_text}',
>>>>>>>>>>>>>>>>>>>>>>>>>>                 f'--outputbase={output_directory}
>>>>>>>>>>>>>>>>>>>>>>>>>> /{file_base_name}',
>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--max_pages=1',
>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--strip_unrenderable_words',
>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--leading=36',
>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--xsize=3600',
>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--ysize=330',
>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--char_spacing=1.0',
>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--exposure=0',
>>>>>>>>>>>>>>>>>>>>>>>>>>                 '
>>>>>>>>>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/eng.unicharset',
>>>>>>>>>>>>>>>>>>>>>>>>>>             ])
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> if __name__ == "__main__":
>>>>>>>>>>>>>>>>>>>>>>>>>>     parser = argparse.ArgumentParser()
>>>>>>>>>>>>>>>>>>>>>>>>>>     parser.add_argument('--start', type=int, help
>>>>>>>>>>>>>>>>>>>>>>>>>> ='Starting line count (inclusive)')
>>>>>>>>>>>>>>>>>>>>>>>>>>     parser.add_argument('--end', type=int, 
>>>>>>>>>>>>>>>>>>>>>>>>>> help='Ending 
>>>>>>>>>>>>>>>>>>>>>>>>>> line count (inclusive)')
>>>>>>>>>>>>>>>>>>>>>>>>>>     args = parser.parse_args()
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>     training_text_file = '
>>>>>>>>>>>>>>>>>>>>>>>>>> langdata/eng.training_text'
>>>>>>>>>>>>>>>>>>>>>>>>>>     output_directory = '
>>>>>>>>>>>>>>>>>>>>>>>>>> tesstrain/data/eng-ground-truth'
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>     font_list = FontList()
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>     create_training_data(training_text_file, 
>>>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, args.start, args.end)
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Then create a file called "FontList" in the root 
>>>>>>>>>>>>>>>>>>>>>>>>>> directory and paste it.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> class FontList:
>>>>>>>>>>>>>>>>>>>>>>>>>>     def __init__(self):
>>>>>>>>>>>>>>>>>>>>>>>>>>         self.fonts = [
>>>>>>>>>>>>>>>>>>>>>>>>>>         "Gerlick"
>>>>>>>>>>>>>>>>>>>>>>>>>>             "Sagar Medium",
>>>>>>>>>>>>>>>>>>>>>>>>>>             "Ekushey Lohit Normal",  
>>>>>>>>>>>>>>>>>>>>>>>>>>            "Charukola Round Head Regular, 
>>>>>>>>>>>>>>>>>>>>>>>>>> weight=433",
>>>>>>>>>>>>>>>>>>>>>>>>>>             "Charukola Round Head Bold, 
>>>>>>>>>>>>>>>>>>>>>>>>>> weight=443",
>>>>>>>>>>>>>>>>>>>>>>>>>>             "Ador Orjoma Unicode",
>>>>>>>>>>>>>>>>>>>>>>>>>>       
>>>>>>>>>>>>>>>>>>>>>>>>>>           
>>>>>>>>>>>>>>>>>>>>>>>>>>                        
>>>>>>>>>>>>>>>>>>>>>>>>>> ]                         
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> then import in the above code,
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> *for breakpoint command:*
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> sudo python3 split_training_text.py --start 0  --end 
>>>>>>>>>>>>>>>>>>>>>>>>>> 11
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> change checkpoint according to you  --start 0 
>>>>>>>>>>>>>>>>>>>>>>>>>> --end 11.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> *and training checkpoint as you know already.*
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 1:22:34 am UTC+6 
>>>>>>>>>>>>>>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi mhalidu, 
>>>>>>>>>>>>>>>>>>>>>>>>>>> the script you posted here seems much more 
>>>>>>>>>>>>>>>>>>>>>>>>>>> extensive than you posted before: 
>>>>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/0e2880d9-64c0-4659-b497-902a5747caf4n%40googlegroups.com
>>>>>>>>>>>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> I have been using your earlier script. It is 
>>>>>>>>>>>>>>>>>>>>>>>>>>> magical. How is this one different from the 
>>>>>>>>>>>>>>>>>>>>>>>>>>> earlier one?
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Thank you for posting these scripts, by the way. 
>>>>>>>>>>>>>>>>>>>>>>>>>>> It has saved my countless hours; by running 
>>>>>>>>>>>>>>>>>>>>>>>>>>> multiple fonts in one sweep. I 
>>>>>>>>>>>>>>>>>>>>>>>>>>> was not able to find any instruction on how to 
>>>>>>>>>>>>>>>>>>>>>>>>>>> train for  multiple fonts. 
>>>>>>>>>>>>>>>>>>>>>>>>>>> The official manual is also unclear. YOUr script 
>>>>>>>>>>>>>>>>>>>>>>>>>>> helped me to get started. 
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wednesday, August 9, 2023 at 11:00:49 PM 
>>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+3 mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> ok, I will try as you said.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> one more thing, what's the role of the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> trained_text lines will be? I have seen Bengali 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> text are long words of 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> lines. so I wanna know how many words or 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> characters will be the better 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> choice for the train? and 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--xsize=3600','--ysize=350',  will be according 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> to words of lines?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thursday, 10 August, 2023 at 1:10:14 am 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+6 shree wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Include the default fonts also in your 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fine-tuning list of fonts and see if that helps.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Aug 9, 2023, 2:27 PM Ali hussain <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have trained some new fonts by fine-tune 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> methods for the Bengali language in Tesseract 5 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and I have used all 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> official trained_text and tessdata_best and 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> other things also.  everything 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is good but the problem is the default font 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which was trained before that 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> does not convert text like prev but my new fonts 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> work well. I don't 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> understand why it's happening. I share code 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> based to understand what going 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *codes  for creating tif, gt.txt, .box files:*
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import os
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import random
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import pathlib
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import argparse
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> from FontList import FontList
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> def read_line_count():
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     if os.path.exists('line_count.txt'):
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         with open('line_count.txt', 'r') as 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> file:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             return int(file.read())
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     return 0
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> def write_line_count(line_count):
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     with open('line_count.txt', 'w') as file:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         file.write(str(line_count))
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> def create_training_data(training_text_file, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, start_line=None, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> end_line=None):
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     lines = []
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     with open(training_text_file, 'r') as 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> input_file:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         for line in input_file.readlines():
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             lines.append(line.strip())
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     if not os.path.exists(output_directory):
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         os.mkdir(output_directory)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     random.shuffle(lines)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     if start_line is None:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         line_count = read_line_count()  # 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Set the starting line_count from the file
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     else:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         line_count = start_line
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     if end_line is None:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         end_line_count = len(lines) - 1  # 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Set the ending line_count
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     else:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         end_line_count = min(end_line, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> len(lines) 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - 1)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     for font in font_list.fonts:  # Iterate 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> through all the fonts in the font_list
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         font_serial = 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         for line in lines:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             training_text_file_name = 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pathlib.Path(training_text_file).stem
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             # Generate a unique serial 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> number for each line
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             line_serial = f"{line_count:d}"
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             # GT (Ground Truth) text filename
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             line_gt_text = os.path.join(
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory, f'{training_text_file_name}
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> _{line_serial}.gt.txt')
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             with open(line_gt_text, 'w') as 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_file:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 output_file.writelines
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ([line])
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             # Image filename
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             file_base_name = f'ben_{
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> line_serial}'  # Unique filename for each 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             subprocess.run([
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 'text2image',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 f'--font={font}',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 f'--text={line_gt_text}',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 f'--outputbase={
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory}/{file_base_name}',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--max_pages=1',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--strip_unrenderable_words',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--leading=36',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--xsize=3600',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--ysize=350',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--char_spacing=1.0',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--exposure=0',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/ben.unicharset',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             ])
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             line_count += 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             font_serial += 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         # Reset font_serial for the next 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font iteration
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         font_serial = 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     write_line_count(line_count)  # Update 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the line_count in the file
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if __name__ == "__main__":
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     parser = argparse.ArgumentParser()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     parser.add_argument('--start', type=int, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> help='Starting line count (inclusive)')
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     parser.add_argument('--end', type=int, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> help='Ending line count (inclusive)')
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     args = parser.parse_args()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     training_text_file = '
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> langdata/ben.training_text'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     output_directory = '
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tesstrain/data/ben-ground-truth'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     # Create an instance of the FontList 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     font_list = FontList()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>      
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     create_training_data(training_text_file, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, args.start, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> args.end)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *and for training code:*
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> # List of font names
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_names = ['ben']
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for font in font_names:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     command = 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> make training MODEL_NAME={font} 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LANG_TYPE=Indic"
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     subprocess.run(command, shell=True)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> any suggestion to identify to extract the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> problem.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> thanks, everyone
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> You received this message because you are 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> subscribed to the Google Groups "tesseract-ocr" 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> group.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To unsubscribe from this group and stop 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> receiving emails from it, send an email to 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tesseract-oc...@googlegroups.com.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>>>>>>
>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com
>>>>>>>>>>  
>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>> .
>>>>>>>>>>
>>>>>>>>> -- 
>>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>>>>
>>>>>>> To view this discussion on the web visit 
>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/eb833902-7258-43e3-8854-d51ce26b7257n%40googlegroups.com
>>>>>>>>  
>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/eb833902-7258-43e3-8854-d51ce26b7257n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>> .
>>>>>>>>
>>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/81f7aee2-aec8-4b93-bd18-4bb15325b8afn%40googlegroups.com.

Re: [tesseract-ocr] accuracy problem after trained in fine-tune

Reply via email to