Re: [tesseract-ocr] accuracy problem after trained in fine-tune

Ali hussain Sun, 22 Oct 2023 02:07:22 -0700

i haven't tried by cut the top layer of the network. you can share your 
knowledge what you done by cut the top layer of the network. or github 
project link.
On Sunday, 22 October, 2023 at 12:27:32 pm UTC+6 desal...@gmail.com wrote:


> That is massive data. Have you tried to train by cut the top layer of the 
> network?
> I think that is the most promising approach. I was getting really good 
> results with that. But, the result is not getting translated to scanned 
> documents. I get best results with the syntethic data. I am no 
> experimenting with the settings in text2image if it is possible to emulate 
> the scanned documents. 
> I am also suspecting this setting   '--char_spacing=1.0', in our setup is 
> causing more trouble. Scanned documents come with characters spacing close 
> to zero.If you are planning to train more, try removing this parameter. 
>
> On Sunday, October 22, 2023 at 4:09:46 AM UTC+3 mdalihu...@gmail.com 
> wrote:
>
>> 600000 lines of text and the itarations  higher then 600000. but some 
>> time i got better result in lower itarations in finetune like 100000 lines 
>> of text and itaration is only 5000 to10000. 
>> On Saturday, 21 October, 2023 at 11:37:13 am UTC+6 desal...@gmail.com 
>> wrote:
>>
>>> How many lines of text and iterations did you use?
>>>
>>> On Saturday, October 21, 2023 at 8:36:38 AM UTC+3 Des Bw wrote:
>>>
>>>> Yah, that is what I am getting as well. I was able to add the missing 
>>>> letter. But, the overall accuracy become lower than the default model. 
>>>>
>>>> On Saturday, October 21, 2023 at 3:22:44 AM UTC+3 mdalihu...@gmail.com 
>>>> wrote:
>>>>
>>>>> not good result. that's way i stop to training now. default 
>>>>> traineddata is overall good then scratch.
>>>>> On Thursday, 19 October, 2023 at 11:32:08 pm UTC+6 desal...@gmail.com 
>>>>> wrote:
>>>>>
>>>>>> Hi Ali, 
>>>>>> How is your training going?
>>>>>> Do you get good results with the training-from-the-scratch?
>>>>>>
>>>>>> On Friday, September 15, 2023 at 6:42:26 PM UTC+3 tesseract-ocr wrote:
>>>>>>
>>>>>>> yes, two months ago when I started to learn OCR I saw that. it was 
>>>>>>> very helpful at the beginning.
>>>>>>> On Friday, 15 September, 2023 at 4:01:32 pm UTC+6 desal...@gmail.com 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Just saw this paper: https://osf.io/b8h7q
>>>>>>>>
>>>>>>>> On Thursday, September 14, 2023 at 9:02:22 PM UTC+3 
>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>
>>>>>>>>> I will try some changes. thx
>>>>>>>>>
>>>>>>>>> On Thursday, 14 September, 2023 at 2:46:36 pm UTC+6 
>>>>>>>>> elvi...@gmail.com wrote:
>>>>>>>>>
>>>>>>>>>> I also faced that issue in the Windows. Apparently, the issue is 
>>>>>>>>>> related with unicode. You can try your luck by changing  "r" to 
>>>>>>>>>> "utf8" in 
>>>>>>>>>> the script.
>>>>>>>>>> I end up installing Ubuntu because i was having too many errors 
>>>>>>>>>> in the Windows.
>>>>>>>>>>
>>>>>>>>>> On Thu, Sep 14, 2023, 9:33 AM Ali hussain <mdalihu...@gmail.com> 
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> you faced this error,  Can't encode transcription? if you faced 
>>>>>>>>>>> how you have solved this?
>>>>>>>>>>>
>>>>>>>>>>> On Thursday, 14 September, 2023 at 10:51:52 am UTC+6 
>>>>>>>>>>> elvi...@gmail.com wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I was using my own text
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Sep 14, 2023, 6:58 AM Ali hussain <mdalihu...@gmail.com> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> you are training from Tessearact default text data or your own 
>>>>>>>>>>>>> collected text data?
>>>>>>>>>>>>> On Thursday, 14 September, 2023 at 12:19:53 am UTC+6 
>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I now get to 200000 iterations; and the error rate is stuck 
>>>>>>>>>>>>>> at 0.46. The result is absolutely trash: nowhere close to the 
>>>>>>>>>>>>>> default/Ray's 
>>>>>>>>>>>>>> training. 
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 2:47:05 PM UTC+3 
>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> after Tesseact recognizes text from images. then you can 
>>>>>>>>>>>>>>> apply regex to replace the wrong word with to correct word.
>>>>>>>>>>>>>>> I'm not familiar with paddleOcr and scanTailor also.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 5:06:12 pm UTC+6 
>>>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> At what stage are you doing the regex replacement?
>>>>>>>>>>>>>>>> My process has been: Scan (tif)--> ScanTailor --> Tesseract 
>>>>>>>>>>>>>>>> --> pdf
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> >EasyOCR I think is best for ID cards or something like 
>>>>>>>>>>>>>>>> that image process. but document images like books, here 
>>>>>>>>>>>>>>>> Tesseract is 
>>>>>>>>>>>>>>>> better than EasyOCR.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> How about paddleOcr?, are you familiar with it?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 1:45:54 PM UTC+3 
>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I know what you mean. but in some cases, it helps me.  I 
>>>>>>>>>>>>>>>>> have faced specific characters and words are always not 
>>>>>>>>>>>>>>>>> recognized by 
>>>>>>>>>>>>>>>>> Tesseract. That way I use these regex to replace those 
>>>>>>>>>>>>>>>>> characters   and 
>>>>>>>>>>>>>>>>> words if  those characters are incorrect.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> see what I have done: 
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    " ী": "ী",
>>>>>>>>>>>>>>>>>     " ্": " ",
>>>>>>>>>>>>>>>>>     " ে": " ",
>>>>>>>>>>>>>>>>>     জ্া: "জা",
>>>>>>>>>>>>>>>>>     "  ": " ",
>>>>>>>>>>>>>>>>>     "   ": " ",
>>>>>>>>>>>>>>>>>     "    ": " ",
>>>>>>>>>>>>>>>>>     "্প": " ",
>>>>>>>>>>>>>>>>>     " য": "র্য",
>>>>>>>>>>>>>>>>>     য: "য",
>>>>>>>>>>>>>>>>>     " া": "া",
>>>>>>>>>>>>>>>>>     আা: "আ",
>>>>>>>>>>>>>>>>>     ম্ি: "মি",
>>>>>>>>>>>>>>>>>     স্ু: "সু",
>>>>>>>>>>>>>>>>>     "হূ ": "হূ",
>>>>>>>>>>>>>>>>>     " ণ": "ণ",
>>>>>>>>>>>>>>>>>     র্্: "র",
>>>>>>>>>>>>>>>>>     "চিন্ত ": "চিন্তা ",
>>>>>>>>>>>>>>>>>     ন্া: "না",
>>>>>>>>>>>>>>>>>     "সম ূর্ন": "সম্পূর্ণ",
>>>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 4:18:22 pm UTC+6 
>>>>>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The problem for regex is that Tesseract is not consistent 
>>>>>>>>>>>>>>>>>> in its replacement. 
>>>>>>>>>>>>>>>>>> Think of the original training of English data doesn't 
>>>>>>>>>>>>>>>>>> contain the letter /u/. What does Tesseract do when it faces 
>>>>>>>>>>>>>>>>>> /u/ in actual 
>>>>>>>>>>>>>>>>>> processing??
>>>>>>>>>>>>>>>>>> In some cases, it replaces it with closely similar 
>>>>>>>>>>>>>>>>>> letters such as /v/ and /w/. In other cases, it completely 
>>>>>>>>>>>>>>>>>> removes it. That 
>>>>>>>>>>>>>>>>>> is what is happening with my case. Those characters re 
>>>>>>>>>>>>>>>>>> sometimes completely 
>>>>>>>>>>>>>>>>>> removed; other times, they are replaced by closely 
>>>>>>>>>>>>>>>>>> resembling characters. 
>>>>>>>>>>>>>>>>>> Because of this inconsistency, applying regex is very 
>>>>>>>>>>>>>>>>>> difficult. 
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 1:02:01 PM UTC+3 
>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> if Some specific characters or words are always missing 
>>>>>>>>>>>>>>>>>>> from the OCR result.  then you can apply logic with the 
>>>>>>>>>>>>>>>>>>> Regular expressions 
>>>>>>>>>>>>>>>>>>> method on your applications. After OCR, these specific 
>>>>>>>>>>>>>>>>>>> characters or words 
>>>>>>>>>>>>>>>>>>> will be replaced by current characters or words that you 
>>>>>>>>>>>>>>>>>>> defined in your 
>>>>>>>>>>>>>>>>>>> applications by  Regular expressions. it can be done in 
>>>>>>>>>>>>>>>>>>> some major problems.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 3:51:29 pm UTC+6 
>>>>>>>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> The characters are getting missed, even after 
>>>>>>>>>>>>>>>>>>>> fine-tuning. 
>>>>>>>>>>>>>>>>>>>> I never made any progress. I tried many different 
>>>>>>>>>>>>>>>>>>>> ways. Some  specific characters are always missing from 
>>>>>>>>>>>>>>>>>>>> the OCR result.  
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 12:49:20 PM UTC+3 
>>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> EasyOCR I think is best for ID cards or something like 
>>>>>>>>>>>>>>>>>>>>> that image process. but document images like books, here 
>>>>>>>>>>>>>>>>>>>>> Tesseract is 
>>>>>>>>>>>>>>>>>>>>> better than EasyOCR.  Even I didn't use EasyOCR. you can 
>>>>>>>>>>>>>>>>>>>>> try it.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I have added words of dictionaries but the result is 
>>>>>>>>>>>>>>>>>>>>> the same. 
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> what kind of problem you have faced in fine-tuning in 
>>>>>>>>>>>>>>>>>>>>> few new characters as you said (*but, I failed in 
>>>>>>>>>>>>>>>>>>>>> every possible way to introduce a few new characters into 
>>>>>>>>>>>>>>>>>>>>> the database.)*
>>>>>>>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 3:33:48 pm UTC+6 
>>>>>>>>>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Yes, we are new to this. I find the instructions (the 
>>>>>>>>>>>>>>>>>>>>>> manual) very hard to follow. The video you linked above 
>>>>>>>>>>>>>>>>>>>>>> was really helpful  
>>>>>>>>>>>>>>>>>>>>>> to get started. My plan at the beginning was to fine 
>>>>>>>>>>>>>>>>>>>>>> tune the existing 
>>>>>>>>>>>>>>>>>>>>>> .traineddata. But, I failed in every possible way to 
>>>>>>>>>>>>>>>>>>>>>> introduce a few new 
>>>>>>>>>>>>>>>>>>>>>> characters into the database. That is why I started from 
>>>>>>>>>>>>>>>>>>>>>> scratch. 
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Sure, I will follow Lorenzo's suggestion: will run 
>>>>>>>>>>>>>>>>>>>>>> more the iterations, and see if I can improve. 
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Another areas we need to explore is usage of 
>>>>>>>>>>>>>>>>>>>>>> dictionaries actually. May be adding millions of words 
>>>>>>>>>>>>>>>>>>>>>> into the 
>>>>>>>>>>>>>>>>>>>>>> dictionary could help Tesseract. I don't have millions 
>>>>>>>>>>>>>>>>>>>>>> of words; but I am 
>>>>>>>>>>>>>>>>>>>>>> looking into some corpus to get more words into the 
>>>>>>>>>>>>>>>>>>>>>> dictionary. 
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> If this all fails, EasyOCR (and probably other 
>>>>>>>>>>>>>>>>>>>>>> similar open-source packages)  is probably our next 
>>>>>>>>>>>>>>>>>>>>>> option to try on. Sure, 
>>>>>>>>>>>>>>>>>>>>>> sharing our experiences will be helpful. I will let you 
>>>>>>>>>>>>>>>>>>>>>> know if I made good 
>>>>>>>>>>>>>>>>>>>>>> progresses in any of these options. 
>>>>>>>>>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 12:19:48 PM UTC+3 
>>>>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> How is your training going for Bengali?  It was 
>>>>>>>>>>>>>>>>>>>>>>> nearly good but I faced space problems between two 
>>>>>>>>>>>>>>>>>>>>>>> words, some words are 
>>>>>>>>>>>>>>>>>>>>>>> spaces but most of them have no space. I think is 
>>>>>>>>>>>>>>>>>>>>>>> problem is in the dataset 
>>>>>>>>>>>>>>>>>>>>>>> but I use the default training dataset from Tesseract 
>>>>>>>>>>>>>>>>>>>>>>> which is used in Ben 
>>>>>>>>>>>>>>>>>>>>>>> That way I am confused so I have to explore more. by 
>>>>>>>>>>>>>>>>>>>>>>> the way,  you can try 
>>>>>>>>>>>>>>>>>>>>>>> as Lorenzo Blz said.  Actually training from 
>>>>>>>>>>>>>>>>>>>>>>> scratch is harder than fine-tuning. so you can use 
>>>>>>>>>>>>>>>>>>>>>>> different datasets to 
>>>>>>>>>>>>>>>>>>>>>>> explore. if you succeed. please let me know how you 
>>>>>>>>>>>>>>>>>>>>>>> have done this whole 
>>>>>>>>>>>>>>>>>>>>>>> process.  I'm also new in this field.
>>>>>>>>>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 1:13:43 pm UTC+6 
>>>>>>>>>>>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> How is your training going for Bengali?
>>>>>>>>>>>>>>>>>>>>>>>> I have been trying to train from scratch. I made 
>>>>>>>>>>>>>>>>>>>>>>>> about 64,000 lines of text (which produced about 
>>>>>>>>>>>>>>>>>>>>>>>> 255,000 files, in the end) 
>>>>>>>>>>>>>>>>>>>>>>>> and run the training for 150,000 iterations; getting 
>>>>>>>>>>>>>>>>>>>>>>>> 0.51 training error 
>>>>>>>>>>>>>>>>>>>>>>>> rate. I was hopping to get reasonable accuracy. 
>>>>>>>>>>>>>>>>>>>>>>>> Unfortunately, when I run 
>>>>>>>>>>>>>>>>>>>>>>>> the OCR using  .traineddata,  the accuracy is 
>>>>>>>>>>>>>>>>>>>>>>>> absolutely terrible. Do you 
>>>>>>>>>>>>>>>>>>>>>>>> think I made some mistakes, or that is an expected 
>>>>>>>>>>>>>>>>>>>>>>>> result?
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Tuesday, September 12, 2023 at 11:15:25 PM UTC+3 
>>>>>>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Yes, he doesn't mention all fonts but only one 
>>>>>>>>>>>>>>>>>>>>>>>>> font.  That way he didn't use *MODEL_NAME in a 
>>>>>>>>>>>>>>>>>>>>>>>>> separate **script **file script I think.*
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Actually, here we teach all *tif, gt.txt, and 
>>>>>>>>>>>>>>>>>>>>>>>>> .box files *which are created by  *MODEL_NAME I 
>>>>>>>>>>>>>>>>>>>>>>>>> mean **eng, ben, oro flag or language code *because 
>>>>>>>>>>>>>>>>>>>>>>>>> when we first create *tif, gt.txt, and .box 
>>>>>>>>>>>>>>>>>>>>>>>>> files, *every file starts by  *MODEL_NAME*. This  
>>>>>>>>>>>>>>>>>>>>>>>>> *MODEL_NAME*  we selected on the training script 
>>>>>>>>>>>>>>>>>>>>>>>>> for looping each tif, gt.txt, and .box files which 
>>>>>>>>>>>>>>>>>>>>>>>>> are created by
>>>>>>>>>>>>>>>>>>>>>>>>>   *MODEL_NAME.*
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> On Tuesday, 12 September, 2023 at 9:42:13 pm UTC+6 
>>>>>>>>>>>>>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I am familiar with the video and have set up 
>>>>>>>>>>>>>>>>>>>>>>>>>> the folder structure as you did. Indeed, I have 
>>>>>>>>>>>>>>>>>>>>>>>>>> tried a number of 
>>>>>>>>>>>>>>>>>>>>>>>>>> fine-tuning with a single font following Gracia's 
>>>>>>>>>>>>>>>>>>>>>>>>>> video. But, your script 
>>>>>>>>>>>>>>>>>>>>>>>>>> is much  better because supports multiple fonts. The 
>>>>>>>>>>>>>>>>>>>>>>>>>> whole improvement you 
>>>>>>>>>>>>>>>>>>>>>>>>>> made is  brilliant; and very useful. It is all 
>>>>>>>>>>>>>>>>>>>>>>>>>> working for me. 
>>>>>>>>>>>>>>>>>>>>>>>>>> The only part that I didn't understand is the 
>>>>>>>>>>>>>>>>>>>>>>>>>> trick you used in your tesseract_train.py script. 
>>>>>>>>>>>>>>>>>>>>>>>>>> You see, I have been 
>>>>>>>>>>>>>>>>>>>>>>>>>> doing exactly to you did except this script. 
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> The scripts seems to have the trick of 
>>>>>>>>>>>>>>>>>>>>>>>>>> sending/teaching each of the fonts (iteratively) 
>>>>>>>>>>>>>>>>>>>>>>>>>> into the model. The script 
>>>>>>>>>>>>>>>>>>>>>>>>>> I have been using  (which I get from Garcia) doesn't 
>>>>>>>>>>>>>>>>>>>>>>>>>> mention font at all. 
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> *TESSDATA_PREFIX=../tesseract/tessdata make 
>>>>>>>>>>>>>>>>>>>>>>>>>> training MODEL_NAME=oro 
>>>>>>>>>>>>>>>>>>>>>>>>>> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000*
>>>>>>>>>>>>>>>>>>>>>>>>>> Does it mean that my model does't train the fonts 
>>>>>>>>>>>>>>>>>>>>>>>>>> (even if the fonts have been included in the 
>>>>>>>>>>>>>>>>>>>>>>>>>> splitting process, in the 
>>>>>>>>>>>>>>>>>>>>>>>>>> other script)?
>>>>>>>>>>>>>>>>>>>>>>>>>> On Monday, September 11, 2023 at 10:54:08 AM 
>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+3 mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> *import subprocess# List of font namesfont_names 
>>>>>>>>>>>>>>>>>>>>>>>>>>> = ['ben']for font in font_names:    command = 
>>>>>>>>>>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make 
>>>>>>>>>>>>>>>>>>>>>>>>>>> training MODEL_NAME={font} 
>>>>>>>>>>>>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000"*
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> *    subprocess.run(command, shell=True) 1 . 
>>>>>>>>>>>>>>>>>>>>>>>>>>> This command is for training data that I have named 
>>>>>>>>>>>>>>>>>>>>>>>>>>> '*
>>>>>>>>>>>>>>>>>>>>>>>>>>> tesseract_training*.py' inside tesstrain 
>>>>>>>>>>>>>>>>>>>>>>>>>>> folder.*
>>>>>>>>>>>>>>>>>>>>>>>>>>> *2. root directory means your main training 
>>>>>>>>>>>>>>>>>>>>>>>>>>> folder and inside it as like langdata, tessearact,  
>>>>>>>>>>>>>>>>>>>>>>>>>>> tesstrain folders. if 
>>>>>>>>>>>>>>>>>>>>>>>>>>> you see this tutorial    *
>>>>>>>>>>>>>>>>>>>>>>>>>>> https://www.youtube.com/watch?v=KE4xEzFGSU8  
>>>>>>>>>>>>>>>>>>>>>>>>>>>  you will understand better the folder structure. 
>>>>>>>>>>>>>>>>>>>>>>>>>>> only I 
>>>>>>>>>>>>>>>>>>>>>>>>>>> created tesseract_training.py in tesstrain folder 
>>>>>>>>>>>>>>>>>>>>>>>>>>> for training and  
>>>>>>>>>>>>>>>>>>>>>>>>>>> FontList.py file is the main path as *like 
>>>>>>>>>>>>>>>>>>>>>>>>>>> langdata, tessearact,  tesstrain, and *
>>>>>>>>>>>>>>>>>>>>>>>>>>> split_training_text.py.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 3. first of all you have to put all fonts in 
>>>>>>>>>>>>>>>>>>>>>>>>>>> your Linux fonts folder.   /usr/share/fonts/  
>>>>>>>>>>>>>>>>>>>>>>>>>>> then run:  sudo apt update  then sudo fc-cache 
>>>>>>>>>>>>>>>>>>>>>>>>>>> -fv
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> after that, you have to add the exact font's 
>>>>>>>>>>>>>>>>>>>>>>>>>>> name in FontList.py file like me.
>>>>>>>>>>>>>>>>>>>>>>>>>>> I  have added two pic my folder structure. 
>>>>>>>>>>>>>>>>>>>>>>>>>>> first is main structure pic and the second is the 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Colopse tesstrain folder.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> I[image: Screenshot 2023-09-11 134947.png][image: 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Screenshot 2023-09-11 135014.png] 
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 12:50:03 pm 
>>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+6 desal...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thank you so much for putting out these 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> brilliant scripts. They make the process  much 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> more efficient.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have one more question on the other script 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> that you use to train. 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> *import subprocess# List of font 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> namesfont_names = ['ben']for font in font_names:   
>>>>>>>>>>>>>>>>>>>>>>>>>>>>  command = 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> training MODEL_NAME={font} 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000"*
>>>>>>>>>>>>>>>>>>>>>>>>>>>> *    subprocess.run(command, shell=True) *
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Do you have the name of fonts listed in file in 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> the same/root directory?
>>>>>>>>>>>>>>>>>>>>>>>>>>>> How do you setup the names of the fonts in the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> file, if you don't mind sharing it?
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Monday, September 11, 2023 at 4:27:27 AM 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+3 mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> You can use the new script below. it's better 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> than the previous two scripts.  You can create 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *tif, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> gt.txt, and .box files *by multiple fonts and 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> also use breakpoint if vs code close or anything 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> during creating *tif, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> gt.txt, and .box files *then you can 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> checkpoint to navigate where you close vs code.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> command for *tif, gt.txt, and .box files *
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import os
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import random
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import pathlib
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import argparse
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> from FontList import FontList
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> def create_training_data(training_text_file, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, start_line=None, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> end_line=None):
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     lines = []
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     with open(training_text_file, 'r') as 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> input_file:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         lines = input_file.readlines()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     if not os.path.exists(output_directory):
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         os.mkdir(output_directory)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     if start_line is None:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         start_line = 0
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     if end_line is None:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         end_line = len(lines) - 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     for font_name in font_list.fonts:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         for line_index in range(start_line, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> end_line + 1):
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             line = lines[line_index].strip()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             training_text_file_name = pathlib.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Path(training_text_file).stem
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             line_serial = f"{line_index:d}"
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             line_gt_text = os.path.join(
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory, f'{training_text_file_name}_
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> {line_serial}_{font_name.replace(" ", "_")}
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> .gt.txt')
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             with open(line_gt_text, 'w') as 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_file:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 output_file.writelines([line])
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             file_base_name = f'{
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file_name}_{line_serial}_{
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_name.replace(" ", "_")}'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             subprocess.run([
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 'text2image',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 f'--font={font_name}',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 f'--text={line_gt_text}',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 f'--outputbase={
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory}/{file_base_name}',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--max_pages=1',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--strip_unrenderable_words',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--leading=36',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--xsize=3600',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--ysize=330',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--char_spacing=1.0',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--exposure=0',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/eng.unicharset',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             ])
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if __name__ == "__main__":
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     parser = argparse.ArgumentParser()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     parser.add_argument('--start', type=int, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> help='Starting line count (inclusive)')
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     parser.add_argument('--end', type=int, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> help='Ending line count (inclusive)')
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     args = parser.parse_args()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     training_text_file = '
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> langdata/eng.training_text'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     output_directory = '
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tesstrain/data/eng-ground-truth'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     font_list = FontList()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     create_training_data(training_text_file, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, args.start, args.end)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Then create a file called "FontList" in the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> root directory and paste it.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class FontList:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     def __init__(self):
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         self.fonts = [
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         "Gerlick"
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             "Sagar Medium",
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             "Ekushey Lohit Normal",  
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>            "Charukola Round Head Regular, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> weight=433",
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             "Charukola Round Head Bold, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> weight=443",
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             "Ador Orjoma Unicode",
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>       
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>           
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                        
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ]                         
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> then import in the above code,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *for breakpoint command:*
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sudo python3 split_training_text.py --start 0 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>  --end 11
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> change checkpoint according to you  --start 0 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --end 11.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *and training checkpoint as you know already.*
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 1:22:34 am 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+6 desal...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi mhalidu, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the script you posted here seems much more 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> extensive than you posted before: 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/0e2880d9-64c0-4659-b497-902a5747caf4n%40googlegroups.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have been using your earlier script. It is 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> magical. How is this one different from the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> earlier one?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thank you for posting these scripts, by the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> way. It has saved my countless hours; by running 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> multiple fonts in one 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sweep. I was not able to find any instruction on 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> how to train for  multiple 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fonts. The official manual is also unclear. YOUr 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> script helped me to get 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> started. 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wednesday, August 9, 2023 at 11:00:49 PM 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+3 mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ok, I will try as you said.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> one more thing, what's the role of the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> trained_text lines will be? I have seen Bengali 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> text are long words of 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lines. so I wanna know how many words or 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> characters will be the better 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> choice for the train? and 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--xsize=3600','--ysize=350',  will be 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> according 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to words of lines?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thursday, 10 August, 2023 at 1:10:14 am 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+6 shree wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Include the default fonts also in your 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fine-tuning list of fonts and see if that 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> helps.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Aug 9, 2023, 2:27 PM Ali hussain <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have trained some new fonts by fine-tune 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> methods for the Bengali language in Tesseract 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 5 and I have used all 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> official trained_text and tessdata_best and 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> other things also.  everything 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is good but the problem is the default font 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which was trained before that 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> does not convert text like prev but my new 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fonts work well. I don't 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> understand why it's happening. I share code 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> based to understand what going 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *codes  for creating tif, gt.txt, .box 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> files:*
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import os
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import random
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import pathlib
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import argparse
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> from FontList import FontList
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> def read_line_count():
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     if os.path.exists('line_count.txt'):
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         with open('line_count.txt', 'r') 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> as file:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             return int(file.read())
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     return 0
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> def write_line_count(line_count):
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     with open('line_count.txt', 'w') as 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> file:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         file.write(str(line_count))
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> def create_training_data(
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file, font_list, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory, start_line=None, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> end_line=None):
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     lines = []
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     with open(training_text_file, 'r') as 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> input_file:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         for line in input_file.readlines
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ():
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             lines.append(line.strip())
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     if not os.path.exists(output_directory
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ):
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         os.mkdir(output_directory)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     random.shuffle(lines)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     if start_line is None:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         line_count = read_line_count()  # 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Set the starting line_count from the file
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     else:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         line_count = start_line
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     if end_line is None:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         end_line_count = len(lines) - 1  # 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Set the ending line_count
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     else:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         end_line_count = min(end_line, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> len(lines) 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - 1)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     for font in font_list.fonts:  # 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Iterate through all the fonts in the font_list
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         font_serial = 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         for line in lines:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             training_text_file_name = 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pathlib.Path(training_text_file).stem
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             # Generate a unique serial 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> number for each line
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             line_serial = f"{line_count:d}
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             # GT (Ground Truth) text 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filename
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             line_gt_text = os.path.join(
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory, f'{
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file_name}_{line_serial}
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> .gt.txt')
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             with open(line_gt_text, 'w') 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> as output_file:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 output_file.writelines
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ([line])
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             # Image filename
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             file_base_name = f'ben_{
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> line_serial}'  # Unique filename for each 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             subprocess.run([
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 'text2image',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 f'--font={font}',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 f'--text={line_gt_text}',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 f'--outputbase={
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory}/{file_base_name}',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--max_pages=1',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --strip_unrenderable_words',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--leading=36',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--xsize=3600',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--ysize=350',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--char_spacing=1.0',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--exposure=0',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/ben.unicharset'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             ])
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             line_count += 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             font_serial += 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         # Reset font_serial for the next 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font iteration
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         font_serial = 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     write_line_count(line_count)  # 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Update the line_count in the file
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if __name__ == "__main__":
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     parser = argparse.ArgumentParser()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     parser.add_argument('--start', type=
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> int, help='Starting line count (inclusive)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ')
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     parser.add_argument('--end', type=int, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> help='Ending line count (inclusive)')
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     args = parser.parse_args()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     training_text_file = '
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> langdata/ben.training_text'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     output_directory = '
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tesstrain/data/ben-ground-truth'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     # Create an instance of the FontList 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     font_list = FontList()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>      
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     create_training_data(training_text_file, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, args.start, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> args.end)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *and for training code:*
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> # List of font names
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_names = ['ben']
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for font in font_names:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     command = 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> make training MODEL_NAME={font} 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> START_MODEL=ben 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> TESSDATA=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LANG_TYPE=Indic"
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     subprocess.run(command, shell=True)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> any suggestion to identify to extract the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> problem.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> thanks, everyone
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> You received this message because you are 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> subscribed to the Google Groups 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "tesseract-ocr" group.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To unsubscribe from this group and stop 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> receiving emails from it, send an email to 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tesseract-oc...@googlegroups.com.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from 
>>>>>>>>>>>>> it, send an email to tesseract-oc...@googlegroups.com.
>>>>>>>>>>>>>
>>>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com
>>>>>>>>>>>>>  
>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>>> .
>>>>>>>>>>>>>
>>>>>>>>>>>> -- 
>>>>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from 
>>>>>>>>>>> it, send an email to tesseract-oc...@googlegroups.com.
>>>>>>>>>>>
>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/eb833902-7258-43e3-8854-d51ce26b7257n%40googlegroups.com
>>>>>>>>>>>  
>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/eb833902-7258-43e3-8854-d51ce26b7257n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>> .
>>>>>>>>>>>
>>>>>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/1f888a45-1024-4c63-814c-e6763bf74118n%40googlegroups.com.

Re: [tesseract-ocr] accuracy problem after trained in fine-tune

Reply via email to