Re: [tesseract-ocr] accuracy problem after trained in fine-tune

Ali hussain Sat, 21 Oct 2023 18:09:52 -0700

600000 lines of text and the itarations  higher then 600000. but some time 
i got better result in lower itarations in finetune like 100000 lines of 
text and itaration is only 5000 to10000. 
On Saturday, 21 October, 2023 at 11:37:13 am UTC+6 [email protected] wrote:


> How many lines of text and iterations did you use?
>
> On Saturday, October 21, 2023 at 8:36:38 AM UTC+3 Des Bw wrote:
>
>> Yah, that is what I am getting as well. I was able to add the missing 
>> letter. But, the overall accuracy become lower than the default model. 
>>
>> On Saturday, October 21, 2023 at 3:22:44 AM UTC+3 [email protected] 
>> wrote:
>>
>>> not good result. that's way i stop to training now. default traineddata 
>>> is overall good then scratch.
>>> On Thursday, 19 October, 2023 at 11:32:08 pm UTC+6 [email protected] 
>>> wrote:
>>>
>>>> Hi Ali, 
>>>> How is your training going?
>>>> Do you get good results with the training-from-the-scratch?
>>>>
>>>> On Friday, September 15, 2023 at 6:42:26 PM UTC+3 tesseract-ocr wrote:
>>>>
>>>>> yes, two months ago when I started to learn OCR I saw that. it was 
>>>>> very helpful at the beginning.
>>>>> On Friday, 15 September, 2023 at 4:01:32 pm UTC+6 [email protected] 
>>>>> wrote:
>>>>>
>>>>>> Just saw this paper: https://osf.io/b8h7q
>>>>>>
>>>>>> On Thursday, September 14, 2023 at 9:02:22 PM UTC+3 
>>>>>> [email protected] wrote:
>>>>>>
>>>>>>> I will try some changes. thx
>>>>>>>
>>>>>>> On Thursday, 14 September, 2023 at 2:46:36 pm UTC+6 
>>>>>>> [email protected] wrote:
>>>>>>>
>>>>>>>> I also faced that issue in the Windows. Apparently, the issue is 
>>>>>>>> related with unicode. You can try your luck by changing  "r" to "utf8" 
>>>>>>>> in 
>>>>>>>> the script.
>>>>>>>> I end up installing Ubuntu because i was having too many errors in 
>>>>>>>> the Windows.
>>>>>>>>
>>>>>>>> On Thu, Sep 14, 2023, 9:33 AM Ali hussain <[email protected]> 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> you faced this error,  Can't encode transcription? if you faced 
>>>>>>>>> how you have solved this?
>>>>>>>>>
>>>>>>>>> On Thursday, 14 September, 2023 at 10:51:52 am UTC+6 
>>>>>>>>> [email protected] wrote:
>>>>>>>>>
>>>>>>>>>> I was using my own text
>>>>>>>>>>
>>>>>>>>>> On Thu, Sep 14, 2023, 6:58 AM Ali hussain <[email protected]> 
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> you are training from Tessearact default text data or your own 
>>>>>>>>>>> collected text data?
>>>>>>>>>>> On Thursday, 14 September, 2023 at 12:19:53 am UTC+6 
>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I now get to 200000 iterations; and the error rate is stuck at 
>>>>>>>>>>>> 0.46. The result is absolutely trash: nowhere close to the 
>>>>>>>>>>>> default/Ray's 
>>>>>>>>>>>> training. 
>>>>>>>>>>>>
>>>>>>>>>>>> On Wednesday, September 13, 2023 at 2:47:05 PM UTC+3 
>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> after Tesseact recognizes text from images. then you can apply 
>>>>>>>>>>>>> regex to replace the wrong word with to correct word.
>>>>>>>>>>>>> I'm not familiar with paddleOcr and scanTailor also.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 5:06:12 pm UTC+6 
>>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> At what stage are you doing the regex replacement?
>>>>>>>>>>>>>> My process has been: Scan (tif)--> ScanTailor --> Tesseract 
>>>>>>>>>>>>>> --> pdf
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> >EasyOCR I think is best for ID cards or something like that 
>>>>>>>>>>>>>> image process. but document images like books, here Tesseract is 
>>>>>>>>>>>>>> better 
>>>>>>>>>>>>>> than EasyOCR.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> How about paddleOcr?, are you familiar with it?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 1:45:54 PM UTC+3 
>>>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I know what you mean. but in some cases, it helps me.  I 
>>>>>>>>>>>>>>> have faced specific characters and words are always not 
>>>>>>>>>>>>>>> recognized by 
>>>>>>>>>>>>>>> Tesseract. That way I use these regex to replace those 
>>>>>>>>>>>>>>> characters   and 
>>>>>>>>>>>>>>> words if  those characters are incorrect.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> see what I have done: 
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    " ী": "ী",
>>>>>>>>>>>>>>>     " ্": " ",
>>>>>>>>>>>>>>>     " ে": " ",
>>>>>>>>>>>>>>>     জ্া: "জা",
>>>>>>>>>>>>>>>     "  ": " ",
>>>>>>>>>>>>>>>     "   ": " ",
>>>>>>>>>>>>>>>     "    ": " ",
>>>>>>>>>>>>>>>     "্প": " ",
>>>>>>>>>>>>>>>     " য": "র্য",
>>>>>>>>>>>>>>>     য: "য",
>>>>>>>>>>>>>>>     " া": "া",
>>>>>>>>>>>>>>>     আা: "আ",
>>>>>>>>>>>>>>>     ম্ি: "মি",
>>>>>>>>>>>>>>>     স্ু: "সু",
>>>>>>>>>>>>>>>     "হূ ": "হূ",
>>>>>>>>>>>>>>>     " ণ": "ণ",
>>>>>>>>>>>>>>>     র্্: "র",
>>>>>>>>>>>>>>>     "চিন্ত ": "চিন্তা ",
>>>>>>>>>>>>>>>     ন্া: "না",
>>>>>>>>>>>>>>>     "সম ূর্ন": "সম্পূর্ণ",
>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 4:18:22 pm UTC+6 
>>>>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The problem for regex is that Tesseract is not consistent 
>>>>>>>>>>>>>>>> in its replacement. 
>>>>>>>>>>>>>>>> Think of the original training of English data doesn't 
>>>>>>>>>>>>>>>> contain the letter /u/. What does Tesseract do when it faces 
>>>>>>>>>>>>>>>> /u/ in actual 
>>>>>>>>>>>>>>>> processing??
>>>>>>>>>>>>>>>> In some cases, it replaces it with closely similar letters 
>>>>>>>>>>>>>>>> such as /v/ and /w/. In other cases, it completely removes it. 
>>>>>>>>>>>>>>>> That is what 
>>>>>>>>>>>>>>>> is happening with my case. Those characters re sometimes 
>>>>>>>>>>>>>>>> completely 
>>>>>>>>>>>>>>>> removed; other times, they are replaced by closely resembling 
>>>>>>>>>>>>>>>> characters. 
>>>>>>>>>>>>>>>> Because of this inconsistency, applying regex is very 
>>>>>>>>>>>>>>>> difficult. 
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 1:02:01 PM UTC+3 
>>>>>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> if Some specific characters or words are always missing 
>>>>>>>>>>>>>>>>> from the OCR result.  then you can apply logic with the 
>>>>>>>>>>>>>>>>> Regular expressions 
>>>>>>>>>>>>>>>>> method on your applications. After OCR, these specific 
>>>>>>>>>>>>>>>>> characters or words 
>>>>>>>>>>>>>>>>> will be replaced by current characters or words that you 
>>>>>>>>>>>>>>>>> defined in your 
>>>>>>>>>>>>>>>>> applications by  Regular expressions. it can be done in some 
>>>>>>>>>>>>>>>>> major problems.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 3:51:29 pm UTC+6 
>>>>>>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The characters are getting missed, even after 
>>>>>>>>>>>>>>>>>> fine-tuning. 
>>>>>>>>>>>>>>>>>> I never made any progress. I tried many different 
>>>>>>>>>>>>>>>>>> ways. Some  specific characters are always missing from the 
>>>>>>>>>>>>>>>>>> OCR result.  
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 12:49:20 PM UTC+3 
>>>>>>>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> EasyOCR I think is best for ID cards or something like 
>>>>>>>>>>>>>>>>>>> that image process. but document images like books, here 
>>>>>>>>>>>>>>>>>>> Tesseract is 
>>>>>>>>>>>>>>>>>>> better than EasyOCR.  Even I didn't use EasyOCR. you can 
>>>>>>>>>>>>>>>>>>> try it.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I have added words of dictionaries but the result is the 
>>>>>>>>>>>>>>>>>>> same. 
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> what kind of problem you have faced in fine-tuning in 
>>>>>>>>>>>>>>>>>>> few new characters as you said (*but, I failed in every 
>>>>>>>>>>>>>>>>>>> possible way to introduce a few new characters into the 
>>>>>>>>>>>>>>>>>>> database.)*
>>>>>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 3:33:48 pm UTC+6 
>>>>>>>>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Yes, we are new to this. I find the instructions (the 
>>>>>>>>>>>>>>>>>>>> manual) very hard to follow. The video you linked above 
>>>>>>>>>>>>>>>>>>>> was really helpful  
>>>>>>>>>>>>>>>>>>>> to get started. My plan at the beginning was to fine tune 
>>>>>>>>>>>>>>>>>>>> the existing 
>>>>>>>>>>>>>>>>>>>> .traineddata. But, I failed in every possible way to 
>>>>>>>>>>>>>>>>>>>> introduce a few new 
>>>>>>>>>>>>>>>>>>>> characters into the database. That is why I started from 
>>>>>>>>>>>>>>>>>>>> scratch. 
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Sure, I will follow Lorenzo's suggestion: will run more 
>>>>>>>>>>>>>>>>>>>> the iterations, and see if I can improve. 
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Another areas we need to explore is usage of 
>>>>>>>>>>>>>>>>>>>> dictionaries actually. May be adding millions of words 
>>>>>>>>>>>>>>>>>>>> into the 
>>>>>>>>>>>>>>>>>>>> dictionary could help Tesseract. I don't have millions of 
>>>>>>>>>>>>>>>>>>>> words; but I am 
>>>>>>>>>>>>>>>>>>>> looking into some corpus to get more words into the 
>>>>>>>>>>>>>>>>>>>> dictionary. 
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> If this all fails, EasyOCR (and probably other similar 
>>>>>>>>>>>>>>>>>>>> open-source packages)  is probably our next option to try 
>>>>>>>>>>>>>>>>>>>> on. Sure, sharing 
>>>>>>>>>>>>>>>>>>>> our experiences will be helpful. I will let you know if I 
>>>>>>>>>>>>>>>>>>>> made good 
>>>>>>>>>>>>>>>>>>>> progresses in any of these options. 
>>>>>>>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 12:19:48 PM UTC+3 
>>>>>>>>>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> How is your training going for Bengali?  It was nearly 
>>>>>>>>>>>>>>>>>>>>> good but I faced space problems between two words, some 
>>>>>>>>>>>>>>>>>>>>> words are spaces 
>>>>>>>>>>>>>>>>>>>>> but most of them have no space. I think is problem is in 
>>>>>>>>>>>>>>>>>>>>> the dataset but I 
>>>>>>>>>>>>>>>>>>>>> use the default training dataset from Tesseract which is 
>>>>>>>>>>>>>>>>>>>>> used in Ben That 
>>>>>>>>>>>>>>>>>>>>> way I am confused so I have to explore more. by the way,  
>>>>>>>>>>>>>>>>>>>>> you can try as Lorenzo 
>>>>>>>>>>>>>>>>>>>>> Blz said.  Actually training from scratch is harder 
>>>>>>>>>>>>>>>>>>>>> than fine-tuning. so you can use different datasets to 
>>>>>>>>>>>>>>>>>>>>> explore. if you 
>>>>>>>>>>>>>>>>>>>>> succeed. please let me know how you have done this whole 
>>>>>>>>>>>>>>>>>>>>> process.  I'm also 
>>>>>>>>>>>>>>>>>>>>> new in this field.
>>>>>>>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 1:13:43 pm UTC+6 
>>>>>>>>>>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> How is your training going for Bengali?
>>>>>>>>>>>>>>>>>>>>>> I have been trying to train from scratch. I made 
>>>>>>>>>>>>>>>>>>>>>> about 64,000 lines of text (which produced about 255,000 
>>>>>>>>>>>>>>>>>>>>>> files, in the end) 
>>>>>>>>>>>>>>>>>>>>>> and run the training for 150,000 iterations; getting 
>>>>>>>>>>>>>>>>>>>>>> 0.51 training error 
>>>>>>>>>>>>>>>>>>>>>> rate. I was hopping to get reasonable accuracy. 
>>>>>>>>>>>>>>>>>>>>>> Unfortunately, when I run 
>>>>>>>>>>>>>>>>>>>>>> the OCR using  .traineddata,  the accuracy is absolutely 
>>>>>>>>>>>>>>>>>>>>>> terrible. Do you 
>>>>>>>>>>>>>>>>>>>>>> think I made some mistakes, or that is an expected 
>>>>>>>>>>>>>>>>>>>>>> result?
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Tuesday, September 12, 2023 at 11:15:25 PM UTC+3 
>>>>>>>>>>>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Yes, he doesn't mention all fonts but only one 
>>>>>>>>>>>>>>>>>>>>>>> font.  That way he didn't use *MODEL_NAME in a 
>>>>>>>>>>>>>>>>>>>>>>> separate **script **file script I think.*
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Actually, here we teach all *tif, gt.txt, and .box 
>>>>>>>>>>>>>>>>>>>>>>> files *which are created by  *MODEL_NAME I mean **eng, 
>>>>>>>>>>>>>>>>>>>>>>> ben, oro flag or language code *because when we 
>>>>>>>>>>>>>>>>>>>>>>> first create *tif, gt.txt, and .box files, *every 
>>>>>>>>>>>>>>>>>>>>>>> file starts by  *MODEL_NAME*. This  *MODEL_NAME*  
>>>>>>>>>>>>>>>>>>>>>>> we selected on the training script for looping each 
>>>>>>>>>>>>>>>>>>>>>>> tif, gt.txt, and .box 
>>>>>>>>>>>>>>>>>>>>>>> files which are created by  *MODEL_NAME.*
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Tuesday, 12 September, 2023 at 9:42:13 pm UTC+6 
>>>>>>>>>>>>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Yes, I am familiar with the video and have set up 
>>>>>>>>>>>>>>>>>>>>>>>> the folder structure as you did. Indeed, I have tried 
>>>>>>>>>>>>>>>>>>>>>>>> a number of 
>>>>>>>>>>>>>>>>>>>>>>>> fine-tuning with a single font following Gracia's 
>>>>>>>>>>>>>>>>>>>>>>>> video. But, your script 
>>>>>>>>>>>>>>>>>>>>>>>> is much  better because supports multiple fonts. The 
>>>>>>>>>>>>>>>>>>>>>>>> whole improvement you 
>>>>>>>>>>>>>>>>>>>>>>>> made is  brilliant; and very useful. It is all working 
>>>>>>>>>>>>>>>>>>>>>>>> for me. 
>>>>>>>>>>>>>>>>>>>>>>>> The only part that I didn't understand is the trick 
>>>>>>>>>>>>>>>>>>>>>>>> you used in your tesseract_train.py script. You see, I 
>>>>>>>>>>>>>>>>>>>>>>>> have been doing 
>>>>>>>>>>>>>>>>>>>>>>>> exactly to you did except this script. 
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> The scripts seems to have the trick of 
>>>>>>>>>>>>>>>>>>>>>>>> sending/teaching each of the fonts (iteratively) into 
>>>>>>>>>>>>>>>>>>>>>>>> the model. The script 
>>>>>>>>>>>>>>>>>>>>>>>> I have been using  (which I get from Garcia) doesn't 
>>>>>>>>>>>>>>>>>>>>>>>> mention font at all. 
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> *TESSDATA_PREFIX=../tesseract/tessdata make 
>>>>>>>>>>>>>>>>>>>>>>>> training MODEL_NAME=oro TESSDATA=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000*
>>>>>>>>>>>>>>>>>>>>>>>> Does it mean that my model does't train the fonts 
>>>>>>>>>>>>>>>>>>>>>>>> (even if the fonts have been included in the splitting 
>>>>>>>>>>>>>>>>>>>>>>>> process, in the 
>>>>>>>>>>>>>>>>>>>>>>>> other script)?
>>>>>>>>>>>>>>>>>>>>>>>> On Monday, September 11, 2023 at 10:54:08 AM UTC+3 
>>>>>>>>>>>>>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> *import subprocess# List of font namesfont_names = 
>>>>>>>>>>>>>>>>>>>>>>>>> ['ben']for font in font_names:    command = 
>>>>>>>>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make training 
>>>>>>>>>>>>>>>>>>>>>>>>> MODEL_NAME={font} 
>>>>>>>>>>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000"*
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> *    subprocess.run(command, shell=True) 1 . This 
>>>>>>>>>>>>>>>>>>>>>>>>> command is for training data that I have named '*
>>>>>>>>>>>>>>>>>>>>>>>>> tesseract_training*.py' inside tesstrain folder.*
>>>>>>>>>>>>>>>>>>>>>>>>> *2. root directory means your main training folder 
>>>>>>>>>>>>>>>>>>>>>>>>> and inside it as like langdata, tessearact,  
>>>>>>>>>>>>>>>>>>>>>>>>> tesstrain folders. if you see 
>>>>>>>>>>>>>>>>>>>>>>>>> this tutorial    *
>>>>>>>>>>>>>>>>>>>>>>>>> https://www.youtube.com/watch?v=KE4xEzFGSU8   you 
>>>>>>>>>>>>>>>>>>>>>>>>> will understand better the folder structure. only I 
>>>>>>>>>>>>>>>>>>>>>>>>> created tesseract_training.py in tesstrain folder for 
>>>>>>>>>>>>>>>>>>>>>>>>> training and  
>>>>>>>>>>>>>>>>>>>>>>>>> FontList.py file is the main path as *like 
>>>>>>>>>>>>>>>>>>>>>>>>> langdata, tessearact,  tesstrain, and *
>>>>>>>>>>>>>>>>>>>>>>>>> split_training_text.py.
>>>>>>>>>>>>>>>>>>>>>>>>> 3. first of all you have to put all fonts in your 
>>>>>>>>>>>>>>>>>>>>>>>>> Linux fonts folder.   /usr/share/fonts/  then 
>>>>>>>>>>>>>>>>>>>>>>>>> run:  sudo apt update  then sudo fc-cache -fv
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> after that, you have to add the exact font's name 
>>>>>>>>>>>>>>>>>>>>>>>>> in FontList.py file like me.
>>>>>>>>>>>>>>>>>>>>>>>>> I  have added two pic my folder structure. first 
>>>>>>>>>>>>>>>>>>>>>>>>> is main structure pic and the second is the Colopse 
>>>>>>>>>>>>>>>>>>>>>>>>> tesstrain folder.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> I[image: Screenshot 2023-09-11 134947.png][image: 
>>>>>>>>>>>>>>>>>>>>>>>>> Screenshot 2023-09-11 135014.png] 
>>>>>>>>>>>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 12:50:03 pm UTC+6 
>>>>>>>>>>>>>>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Thank you so much for putting out these brilliant 
>>>>>>>>>>>>>>>>>>>>>>>>>> scripts. They make the process  much more efficient.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> I have one more question on the other script that 
>>>>>>>>>>>>>>>>>>>>>>>>>> you use to train. 
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> *import subprocess# List of font namesfont_names 
>>>>>>>>>>>>>>>>>>>>>>>>>> = ['ben']for font in font_names:    command = 
>>>>>>>>>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make 
>>>>>>>>>>>>>>>>>>>>>>>>>> training MODEL_NAME={font} 
>>>>>>>>>>>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000"*
>>>>>>>>>>>>>>>>>>>>>>>>>> *    subprocess.run(command, shell=True) *
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Do you have the name of fonts listed in file in 
>>>>>>>>>>>>>>>>>>>>>>>>>> the same/root directory?
>>>>>>>>>>>>>>>>>>>>>>>>>> How do you setup the names of the fonts in the 
>>>>>>>>>>>>>>>>>>>>>>>>>> file, if you don't mind sharing it?
>>>>>>>>>>>>>>>>>>>>>>>>>> On Monday, September 11, 2023 at 4:27:27 AM UTC+3 
>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> You can use the new script below. it's better 
>>>>>>>>>>>>>>>>>>>>>>>>>>> than the previous two scripts.  You can create 
>>>>>>>>>>>>>>>>>>>>>>>>>>> *tif, 
>>>>>>>>>>>>>>>>>>>>>>>>>>> gt.txt, and .box files *by multiple fonts and 
>>>>>>>>>>>>>>>>>>>>>>>>>>> also use breakpoint if vs code close or anything 
>>>>>>>>>>>>>>>>>>>>>>>>>>> during creating *tif, 
>>>>>>>>>>>>>>>>>>>>>>>>>>> gt.txt, and .box files *then you can checkpoint 
>>>>>>>>>>>>>>>>>>>>>>>>>>> to navigate where you close vs code.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> command for *tif, gt.txt, and .box files *
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> import os
>>>>>>>>>>>>>>>>>>>>>>>>>>> import random
>>>>>>>>>>>>>>>>>>>>>>>>>>> import pathlib
>>>>>>>>>>>>>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>>>>>>>>>>>>>> import argparse
>>>>>>>>>>>>>>>>>>>>>>>>>>> from FontList import FontList
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> def create_training_data(training_text_file, 
>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, start_line=None, 
>>>>>>>>>>>>>>>>>>>>>>>>>>> end_line=None):
>>>>>>>>>>>>>>>>>>>>>>>>>>>     lines = []
>>>>>>>>>>>>>>>>>>>>>>>>>>>     with open(training_text_file, 'r') as 
>>>>>>>>>>>>>>>>>>>>>>>>>>> input_file:
>>>>>>>>>>>>>>>>>>>>>>>>>>>         lines = input_file.readlines()
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>     if not os.path.exists(output_directory):
>>>>>>>>>>>>>>>>>>>>>>>>>>>         os.mkdir(output_directory)
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>     if start_line is None:
>>>>>>>>>>>>>>>>>>>>>>>>>>>         start_line = 0
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>     if end_line is None:
>>>>>>>>>>>>>>>>>>>>>>>>>>>         end_line = len(lines) - 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>     for font_name in font_list.fonts:
>>>>>>>>>>>>>>>>>>>>>>>>>>>         for line_index in range(start_line, 
>>>>>>>>>>>>>>>>>>>>>>>>>>> end_line + 1):
>>>>>>>>>>>>>>>>>>>>>>>>>>>             line = lines[line_index].strip()
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>             training_text_file_name = pathlib.
>>>>>>>>>>>>>>>>>>>>>>>>>>> Path(training_text_file).stem
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>             line_serial = f"{line_index:d}"
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>             line_gt_text = os.path.join(
>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory, f'{training_text_file_name}_{
>>>>>>>>>>>>>>>>>>>>>>>>>>> line_serial}_{font_name.replace(" ", "_")}
>>>>>>>>>>>>>>>>>>>>>>>>>>> .gt.txt')
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>             with open(line_gt_text, 'w') as 
>>>>>>>>>>>>>>>>>>>>>>>>>>> output_file:
>>>>>>>>>>>>>>>>>>>>>>>>>>>                 output_file.writelines([line])
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>             file_base_name = f'{
>>>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file_name}_{line_serial}_{
>>>>>>>>>>>>>>>>>>>>>>>>>>> font_name.replace(" ", "_")}'
>>>>>>>>>>>>>>>>>>>>>>>>>>>             subprocess.run([
>>>>>>>>>>>>>>>>>>>>>>>>>>>                 'text2image',
>>>>>>>>>>>>>>>>>>>>>>>>>>>                 f'--font={font_name}',
>>>>>>>>>>>>>>>>>>>>>>>>>>>                 f'--text={line_gt_text}',
>>>>>>>>>>>>>>>>>>>>>>>>>>>                 f'--outputbase={output_directory
>>>>>>>>>>>>>>>>>>>>>>>>>>> }/{file_base_name}',
>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--max_pages=1',
>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--strip_unrenderable_words',
>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--leading=36',
>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--xsize=3600',
>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--ysize=330',
>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--char_spacing=1.0',
>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--exposure=0',
>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '
>>>>>>>>>>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/eng.unicharset',
>>>>>>>>>>>>>>>>>>>>>>>>>>>             ])
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> if __name__ == "__main__":
>>>>>>>>>>>>>>>>>>>>>>>>>>>     parser = argparse.ArgumentParser()
>>>>>>>>>>>>>>>>>>>>>>>>>>>     parser.add_argument('--start', type=int, 
>>>>>>>>>>>>>>>>>>>>>>>>>>> help='Starting line count (inclusive)')
>>>>>>>>>>>>>>>>>>>>>>>>>>>     parser.add_argument('--end', type=int, help=
>>>>>>>>>>>>>>>>>>>>>>>>>>> 'Ending line count (inclusive)')
>>>>>>>>>>>>>>>>>>>>>>>>>>>     args = parser.parse_args()
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>     training_text_file = '
>>>>>>>>>>>>>>>>>>>>>>>>>>> langdata/eng.training_text'
>>>>>>>>>>>>>>>>>>>>>>>>>>>     output_directory = '
>>>>>>>>>>>>>>>>>>>>>>>>>>> tesstrain/data/eng-ground-truth'
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>     font_list = FontList()
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>     create_training_data(training_text_file, 
>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, args.start, args.end)
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Then create a file called "FontList" in the root 
>>>>>>>>>>>>>>>>>>>>>>>>>>> directory and paste it.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> class FontList:
>>>>>>>>>>>>>>>>>>>>>>>>>>>     def __init__(self):
>>>>>>>>>>>>>>>>>>>>>>>>>>>         self.fonts = [
>>>>>>>>>>>>>>>>>>>>>>>>>>>         "Gerlick"
>>>>>>>>>>>>>>>>>>>>>>>>>>>             "Sagar Medium",
>>>>>>>>>>>>>>>>>>>>>>>>>>>             "Ekushey Lohit Normal",  
>>>>>>>>>>>>>>>>>>>>>>>>>>>            "Charukola Round Head Regular, 
>>>>>>>>>>>>>>>>>>>>>>>>>>> weight=433",
>>>>>>>>>>>>>>>>>>>>>>>>>>>             "Charukola Round Head Bold, 
>>>>>>>>>>>>>>>>>>>>>>>>>>> weight=443",
>>>>>>>>>>>>>>>>>>>>>>>>>>>             "Ador Orjoma Unicode",
>>>>>>>>>>>>>>>>>>>>>>>>>>>       
>>>>>>>>>>>>>>>>>>>>>>>>>>>           
>>>>>>>>>>>>>>>>>>>>>>>>>>>                        
>>>>>>>>>>>>>>>>>>>>>>>>>>> ]                         
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> then import in the above code,
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> *for breakpoint command:*
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> sudo python3 split_training_text.py --start 0  
>>>>>>>>>>>>>>>>>>>>>>>>>>> --end 11
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> change checkpoint according to you  --start 0 
>>>>>>>>>>>>>>>>>>>>>>>>>>> --end 11.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> *and training checkpoint as you know already.*
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 1:22:34 am 
>>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+6 [email protected] wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi mhalidu, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> the script you posted here seems much more 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> extensive than you posted before: 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/0e2880d9-64c0-4659-b497-902a5747caf4n%40googlegroups.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have been using your earlier script. It is 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> magical. How is this one different from the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> earlier one?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thank you for posting these scripts, by the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> way. It has saved my countless hours; by running 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> multiple fonts in one 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> sweep. I was not able to find any instruction on 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> how to train for  multiple 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> fonts. The official manual is also unclear. YOUr 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> script helped me to get 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> started. 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wednesday, August 9, 2023 at 11:00:49 PM 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+3 [email protected] wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ok, I will try as you said.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> one more thing, what's the role of the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> trained_text lines will be? I have seen Bengali 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> text are long words of 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lines. so I wanna know how many words or 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> characters will be the better 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> choice for the train? and 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--xsize=3600','--ysize=350',  will be according 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to words of lines?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thursday, 10 August, 2023 at 1:10:14 am 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+6 shree wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Include the default fonts also in your 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fine-tuning list of fonts and see if that helps.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Aug 9, 2023, 2:27 PM Ali hussain <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have trained some new fonts by fine-tune 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> methods for the Bengali language in Tesseract 5 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and I have used all 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> official trained_text and tessdata_best and 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> other things also.  everything 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is good but the problem is the default font 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which was trained before that 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> does not convert text like prev but my new 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fonts work well. I don't 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> understand why it's happening. I share code 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> based to understand what going 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *codes  for creating tif, gt.txt, .box 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> files:*
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import os
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import random
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import pathlib
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import argparse
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> from FontList import FontList
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> def read_line_count():
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     if os.path.exists('line_count.txt'):
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         with open('line_count.txt', 'r') as 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> file:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             return int(file.read())
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     return 0
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> def write_line_count(line_count):
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     with open('line_count.txt', 'w') as 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> file:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         file.write(str(line_count))
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> def create_training_data(training_text_file, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, start_line=None, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> end_line=None):
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     lines = []
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     with open(training_text_file, 'r') as 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> input_file:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         for line in input_file.readlines():
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             lines.append(line.strip())
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     if not os.path.exists(output_directory):
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         os.mkdir(output_directory)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     random.shuffle(lines)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     if start_line is None:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         line_count = read_line_count()  # 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Set the starting line_count from the file
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     else:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         line_count = start_line
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     if end_line is None:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         end_line_count = len(lines) - 1  # 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Set the ending line_count
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     else:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         end_line_count = min(end_line, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> len(lines) 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - 1)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     for font in font_list.fonts:  # Iterate 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> through all the fonts in the font_list
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         font_serial = 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         for line in lines:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             training_text_file_name = 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pathlib.Path(training_text_file).stem
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             # Generate a unique serial 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> number for each line
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             line_serial = f"{line_count:d}"
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             # GT (Ground Truth) text 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filename
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             line_gt_text = os.path.join(
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory, f'{training_text_file_name
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> }_{line_serial}.gt.txt')
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             with open(line_gt_text, 'w') as 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_file:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 output_file.writelines
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ([line])
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             # Image filename
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             file_base_name = f'ben_{
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> line_serial}'  # Unique filename for each 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             subprocess.run([
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 'text2image',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 f'--font={font}',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 f'--text={line_gt_text}',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 f'--outputbase={
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory}/{file_base_name}',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--max_pages=1',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--strip_unrenderable_words'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--leading=36',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--xsize=3600',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--ysize=350',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--char_spacing=1.0',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--exposure=0',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/ben.unicharset',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             ])
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             line_count += 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             font_serial += 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         # Reset font_serial for the next 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font iteration
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         font_serial = 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     write_line_count(line_count)  # Update 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the line_count in the file
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if __name__ == "__main__":
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     parser = argparse.ArgumentParser()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     parser.add_argument('--start', type=int, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> help='Starting line count (inclusive)')
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     parser.add_argument('--end', type=int, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> help='Ending line count (inclusive)')
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     args = parser.parse_args()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     training_text_file = '
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> langdata/ben.training_text'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     output_directory = '
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tesstrain/data/ben-ground-truth'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     # Create an instance of the FontList 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     font_list = FontList()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>      
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     create_training_data(training_text_file, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, args.start, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> args.end)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *and for training code:*
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> # List of font names
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_names = ['ben']
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for font in font_names:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     command = 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> make training MODEL_NAME={font} 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LANG_TYPE=Indic"
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     subprocess.run(command, shell=True)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> any suggestion to identify to extract the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> problem.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> thanks, everyone
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> You received this message because you are 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> subscribed to the Google Groups "tesseract-ocr" 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> group.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To unsubscribe from this group and stop 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> receiving emails from it, send an email to 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected].
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from 
>>>>>>>>>>> it, send an email to [email protected].
>>>>>>>>>>>
>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com
>>>>>>>>>>>  
>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>> .
>>>>>>>>>>>
>>>>>>>>>> -- 
>>>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>>> send an email to [email protected].
>>>>>>>>>
>>>>>>>> To view this discussion on the web visit 
>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/eb833902-7258-43e3-8854-d51ce26b7257n%40googlegroups.com
>>>>>>>>>  
>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/eb833902-7258-43e3-8854-d51ce26b7257n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>> .
>>>>>>>>>
>>>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/edcc9b3f-235e-48a2-8a0b-407318544a5fn%40googlegroups.com.

Re: [tesseract-ocr] accuracy problem after trained in fine-tune

Reply via email to