Re: [tesseract-ocr] accuracy problem after trained in fine-tune

Ali hussain Sun, 22 Oct 2023 04:02:48 -0700

thx. i will try with this method  as soon as possible.

On Sunday, 22 October, 2023 at 3:49:46 pm UTC+6 desal...@gmail.com wrote:


> here it is: 
> https://github.com/tesseract-ocr/tessdoc/blob/main/Data-Files-in-tessdata_best.md
>
> On Sunday, October 22, 2023 at 12:45:40 PM UTC+3 Des Bw wrote:
>
>> This is the code I used to train from a layer: 
>> *make training MODEL_NAME=amh START_MODEL=amh APPEND_INDEX=5 
>> NET_SPEC='[Lfx256 O1c105]' TESSDATA=../tesseract/tessdata EPOCHS=3 
>> TARGET_ERROR_RATE=0.0001 training >> data/amh.log &*
>> *- I took it from Scheer' training *tesstrain-JSTORArabic*: *
>> https://github.com/Shreeshrii/tesstrain-JSTORArabic
>>
>> - The net_spec of ben might not be the same to amh. Shreeshrii has sent a 
>> link on the netspecs of languages, in this forum.  
>>
>> On Sunday, October 22, 2023 at 12:09:25 PM UTC+3 mdalihu...@gmail.com 
>> wrote:
>>
>>> you can test by changes '--char spacing=1.0 . i think it would be 
>>> problem accuracy of result on it also.
>>> On Sunday, 22 October, 2023 at 3:07:16 pm UTC+6 Ali hussain wrote:
>>>
>>>> i haven't tried by cut the top layer of the network. you can share your 
>>>> knowledge what you done by cut the top layer of the network. or github 
>>>> project link.
>>>> On Sunday, 22 October, 2023 at 12:27:32 pm UTC+6 desal...@gmail.com 
>>>> wrote:
>>>>
>>>>> That is massive data. Have you tried to train by cut the top layer of 
>>>>> the network?
>>>>> I think that is the most promising approach. I was getting really good 
>>>>> results with that. But, the result is not getting translated to scanned 
>>>>> documents. I get best results with the syntethic data. I am no 
>>>>> experimenting with the settings in text2image if it is possible to 
>>>>> emulate 
>>>>> the scanned documents. 
>>>>> I am also suspecting this setting   '--char_spacing=1.0', in our setup 
>>>>> is causing more trouble. Scanned documents come with characters spacing 
>>>>> close to zero.If you are planning to train more, try removing this 
>>>>> parameter. 
>>>>>
>>>>> On Sunday, October 22, 2023 at 4:09:46 AM UTC+3 mdalihu...@gmail.com 
>>>>> wrote:
>>>>>
>>>>>> 600000 lines of text and the itarations  higher then 600000. but some 
>>>>>> time i got better result in lower itarations in finetune like 100000 
>>>>>> lines 
>>>>>> of text and itaration is only 5000 to10000. 
>>>>>> On Saturday, 21 October, 2023 at 11:37:13 am UTC+6 desal...@gmail.com 
>>>>>> wrote:
>>>>>>
>>>>>>> How many lines of text and iterations did you use?
>>>>>>>
>>>>>>> On Saturday, October 21, 2023 at 8:36:38 AM UTC+3 Des Bw wrote:
>>>>>>>
>>>>>>>> Yah, that is what I am getting as well. I was able to add the 
>>>>>>>> missing letter. But, the overall accuracy become lower than the 
>>>>>>>> default 
>>>>>>>> model. 
>>>>>>>>
>>>>>>>> On Saturday, October 21, 2023 at 3:22:44 AM UTC+3 
>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>
>>>>>>>>> not good result. that's way i stop to training now. default 
>>>>>>>>> traineddata is overall good then scratch.
>>>>>>>>> On Thursday, 19 October, 2023 at 11:32:08 pm UTC+6 
>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>
>>>>>>>>>> Hi Ali, 
>>>>>>>>>> How is your training going?
>>>>>>>>>> Do you get good results with the training-from-the-scratch?
>>>>>>>>>>
>>>>>>>>>> On Friday, September 15, 2023 at 6:42:26 PM UTC+3 tesseract-ocr 
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> yes, two months ago when I started to learn OCR I saw that. it 
>>>>>>>>>>> was very helpful at the beginning.
>>>>>>>>>>> On Friday, 15 September, 2023 at 4:01:32 pm UTC+6 
>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Just saw this paper: https://osf.io/b8h7q
>>>>>>>>>>>>
>>>>>>>>>>>> On Thursday, September 14, 2023 at 9:02:22 PM UTC+3 
>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I will try some changes. thx
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thursday, 14 September, 2023 at 2:46:36 pm UTC+6 
>>>>>>>>>>>>> elvi...@gmail.com wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I also faced that issue in the Windows. Apparently, the issue 
>>>>>>>>>>>>>> is related with unicode. You can try your luck by changing  "r" 
>>>>>>>>>>>>>> to "utf8" 
>>>>>>>>>>>>>> in the script.
>>>>>>>>>>>>>> I end up installing Ubuntu because i was having too many 
>>>>>>>>>>>>>> errors in the Windows.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Sep 14, 2023, 9:33 AM Ali hussain <
>>>>>>>>>>>>>> mdalihu...@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> you faced this error,  Can't encode transcription? if you 
>>>>>>>>>>>>>>> faced how you have solved this?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thursday, 14 September, 2023 at 10:51:52 am UTC+6 
>>>>>>>>>>>>>>> elvi...@gmail.com wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I was using my own text
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, Sep 14, 2023, 6:58 AM Ali hussain <
>>>>>>>>>>>>>>>> mdalihu...@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> you are training from Tessearact default text data or your 
>>>>>>>>>>>>>>>>> own collected text data?
>>>>>>>>>>>>>>>>> On Thursday, 14 September, 2023 at 12:19:53 am UTC+6 
>>>>>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I now get to 200000 iterations; and the error rate is 
>>>>>>>>>>>>>>>>>> stuck at 0.46. The result is absolutely trash: nowhere close 
>>>>>>>>>>>>>>>>>> to the 
>>>>>>>>>>>>>>>>>> default/Ray's training. 
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 2:47:05 PM UTC+3 
>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> after Tesseact recognizes text from images. then you can 
>>>>>>>>>>>>>>>>>>> apply regex to replace the wrong word with to correct word.
>>>>>>>>>>>>>>>>>>> I'm not familiar with paddleOcr and scanTailor also.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 5:06:12 pm UTC+6 
>>>>>>>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> At what stage are you doing the regex replacement?
>>>>>>>>>>>>>>>>>>>> My process has been: Scan (tif)--> ScanTailor --> 
>>>>>>>>>>>>>>>>>>>> Tesseract --> pdf
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> >EasyOCR I think is best for ID cards or something like 
>>>>>>>>>>>>>>>>>>>> that image process. but document images like books, here 
>>>>>>>>>>>>>>>>>>>> Tesseract is 
>>>>>>>>>>>>>>>>>>>> better than EasyOCR.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> How about paddleOcr?, are you familiar with it?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 1:45:54 PM UTC+3 
>>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I know what you mean. but in some cases, it helps me.  
>>>>>>>>>>>>>>>>>>>>> I have faced specific characters and words are always not 
>>>>>>>>>>>>>>>>>>>>> recognized by 
>>>>>>>>>>>>>>>>>>>>> Tesseract. That way I use these regex to replace those 
>>>>>>>>>>>>>>>>>>>>> characters   and 
>>>>>>>>>>>>>>>>>>>>> words if  those characters are incorrect.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> see what I have done: 
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>    " ী": "ী",
>>>>>>>>>>>>>>>>>>>>>     " ্": " ",
>>>>>>>>>>>>>>>>>>>>>     " ে": " ",
>>>>>>>>>>>>>>>>>>>>>     জ্া: "জা",
>>>>>>>>>>>>>>>>>>>>>     "  ": " ",
>>>>>>>>>>>>>>>>>>>>>     "   ": " ",
>>>>>>>>>>>>>>>>>>>>>     "    ": " ",
>>>>>>>>>>>>>>>>>>>>>     "্প": " ",
>>>>>>>>>>>>>>>>>>>>>     " য": "র্য",
>>>>>>>>>>>>>>>>>>>>>     য: "য",
>>>>>>>>>>>>>>>>>>>>>     " া": "া",
>>>>>>>>>>>>>>>>>>>>>     আা: "আ",
>>>>>>>>>>>>>>>>>>>>>     ম্ি: "মি",
>>>>>>>>>>>>>>>>>>>>>     স্ু: "সু",
>>>>>>>>>>>>>>>>>>>>>     "হূ ": "হূ",
>>>>>>>>>>>>>>>>>>>>>     " ণ": "ণ",
>>>>>>>>>>>>>>>>>>>>>     র্্: "র",
>>>>>>>>>>>>>>>>>>>>>     "চিন্ত ": "চিন্তা ",
>>>>>>>>>>>>>>>>>>>>>     ন্া: "না",
>>>>>>>>>>>>>>>>>>>>>     "সম ূর্ন": "সম্পূর্ণ",
>>>>>>>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 4:18:22 pm UTC+6 
>>>>>>>>>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> The problem for regex is that Tesseract is not 
>>>>>>>>>>>>>>>>>>>>>> consistent in its replacement. 
>>>>>>>>>>>>>>>>>>>>>> Think of the original training of English data 
>>>>>>>>>>>>>>>>>>>>>> doesn't contain the letter /u/. What does Tesseract do 
>>>>>>>>>>>>>>>>>>>>>> when it faces /u/ in 
>>>>>>>>>>>>>>>>>>>>>> actual processing??
>>>>>>>>>>>>>>>>>>>>>> In some cases, it replaces it with closely similar 
>>>>>>>>>>>>>>>>>>>>>> letters such as /v/ and /w/. In other cases, it 
>>>>>>>>>>>>>>>>>>>>>> completely removes it. That 
>>>>>>>>>>>>>>>>>>>>>> is what is happening with my case. Those characters re 
>>>>>>>>>>>>>>>>>>>>>> sometimes completely 
>>>>>>>>>>>>>>>>>>>>>> removed; other times, they are replaced by closely 
>>>>>>>>>>>>>>>>>>>>>> resembling characters. 
>>>>>>>>>>>>>>>>>>>>>> Because of this inconsistency, applying regex is very 
>>>>>>>>>>>>>>>>>>>>>> difficult. 
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 1:02:01 PM UTC+3 
>>>>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> if Some specific characters or words are always 
>>>>>>>>>>>>>>>>>>>>>>> missing from the OCR result.  then you can apply logic 
>>>>>>>>>>>>>>>>>>>>>>> with the Regular 
>>>>>>>>>>>>>>>>>>>>>>> expressions method on your applications. After OCR, 
>>>>>>>>>>>>>>>>>>>>>>> these specific 
>>>>>>>>>>>>>>>>>>>>>>> characters or words will be replaced by current 
>>>>>>>>>>>>>>>>>>>>>>> characters or words that 
>>>>>>>>>>>>>>>>>>>>>>> you defined in your applications by  Regular 
>>>>>>>>>>>>>>>>>>>>>>> expressions. it can be done in 
>>>>>>>>>>>>>>>>>>>>>>> some major problems.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 3:51:29 pm UTC+6 
>>>>>>>>>>>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> The characters are getting missed, even after 
>>>>>>>>>>>>>>>>>>>>>>>> fine-tuning. 
>>>>>>>>>>>>>>>>>>>>>>>> I never made any progress. I tried many different 
>>>>>>>>>>>>>>>>>>>>>>>> ways. Some  specific characters are always missing 
>>>>>>>>>>>>>>>>>>>>>>>> from the OCR result.  
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 12:49:20 PM 
>>>>>>>>>>>>>>>>>>>>>>>> UTC+3 mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> EasyOCR I think is best for ID cards or something 
>>>>>>>>>>>>>>>>>>>>>>>>> like that image process. but document images like 
>>>>>>>>>>>>>>>>>>>>>>>>> books, here Tesseract is 
>>>>>>>>>>>>>>>>>>>>>>>>> better than EasyOCR.  Even I didn't use EasyOCR. you 
>>>>>>>>>>>>>>>>>>>>>>>>> can try it.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> I have added words of dictionaries but the result 
>>>>>>>>>>>>>>>>>>>>>>>>> is the same. 
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> what kind of problem you have faced in fine-tuning 
>>>>>>>>>>>>>>>>>>>>>>>>> in few new characters as you said (*but, I failed 
>>>>>>>>>>>>>>>>>>>>>>>>> in every possible way to introduce a few new 
>>>>>>>>>>>>>>>>>>>>>>>>> characters into the database.)*
>>>>>>>>>>>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 3:33:48 pm 
>>>>>>>>>>>>>>>>>>>>>>>>> UTC+6 desal...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, we are new to this. I find the instructions 
>>>>>>>>>>>>>>>>>>>>>>>>>> (the manual) very hard to follow. The video you 
>>>>>>>>>>>>>>>>>>>>>>>>>> linked above was really 
>>>>>>>>>>>>>>>>>>>>>>>>>> helpful  to get started. My plan at the beginning 
>>>>>>>>>>>>>>>>>>>>>>>>>> was to fine tune the 
>>>>>>>>>>>>>>>>>>>>>>>>>> existing .traineddata. But, I failed in every 
>>>>>>>>>>>>>>>>>>>>>>>>>> possible way to introduce a 
>>>>>>>>>>>>>>>>>>>>>>>>>> few new characters into the database. That is why I 
>>>>>>>>>>>>>>>>>>>>>>>>>> started from scratch. 
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Sure, I will follow Lorenzo's suggestion: will 
>>>>>>>>>>>>>>>>>>>>>>>>>> run more the iterations, and see if I can improve. 
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Another areas we need to explore is usage of 
>>>>>>>>>>>>>>>>>>>>>>>>>> dictionaries actually. May be adding millions of 
>>>>>>>>>>>>>>>>>>>>>>>>>> words into the 
>>>>>>>>>>>>>>>>>>>>>>>>>> dictionary could help Tesseract. I don't have 
>>>>>>>>>>>>>>>>>>>>>>>>>> millions of words; but I am 
>>>>>>>>>>>>>>>>>>>>>>>>>> looking into some corpus to get more words into the 
>>>>>>>>>>>>>>>>>>>>>>>>>> dictionary. 
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> If this all fails, EasyOCR (and probably other 
>>>>>>>>>>>>>>>>>>>>>>>>>> similar open-source packages)  is probably our next 
>>>>>>>>>>>>>>>>>>>>>>>>>> option to try on. Sure, 
>>>>>>>>>>>>>>>>>>>>>>>>>> sharing our experiences will be helpful. I will let 
>>>>>>>>>>>>>>>>>>>>>>>>>> you know if I made good 
>>>>>>>>>>>>>>>>>>>>>>>>>> progresses in any of these options. 
>>>>>>>>>>>>>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 12:19:48 PM 
>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+3 mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> How is your training going for Bengali?  It was 
>>>>>>>>>>>>>>>>>>>>>>>>>>> nearly good but I faced space problems between two 
>>>>>>>>>>>>>>>>>>>>>>>>>>> words, some words are 
>>>>>>>>>>>>>>>>>>>>>>>>>>> spaces but most of them have no space. I think is 
>>>>>>>>>>>>>>>>>>>>>>>>>>> problem is in the dataset 
>>>>>>>>>>>>>>>>>>>>>>>>>>> but I use the default training dataset from 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Tesseract which is used in Ben 
>>>>>>>>>>>>>>>>>>>>>>>>>>> That way I am confused so I have to explore more. 
>>>>>>>>>>>>>>>>>>>>>>>>>>> by the way,  you can try 
>>>>>>>>>>>>>>>>>>>>>>>>>>> as Lorenzo Blz said.  Actually training from 
>>>>>>>>>>>>>>>>>>>>>>>>>>> scratch is harder than fine-tuning. so you can use 
>>>>>>>>>>>>>>>>>>>>>>>>>>> different datasets to 
>>>>>>>>>>>>>>>>>>>>>>>>>>> explore. if you succeed. please let me know how you 
>>>>>>>>>>>>>>>>>>>>>>>>>>> have done this whole 
>>>>>>>>>>>>>>>>>>>>>>>>>>> process.  I'm also new in this field.
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 1:13:43 pm 
>>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+6 desal...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> How is your training going for Bengali?
>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have been trying to train from scratch. I 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> made about 64,000 lines of text (which produced 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> about 255,000 files, in the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> end) and run the training for 150,000 iterations; 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> getting 0.51 training 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> error rate. I was hopping to get reasonable 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> accuracy. Unfortunately, when I 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> run the OCR using  .traineddata,  the accuracy is 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> absolutely terrible. Do 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> you think I made some mistakes, or that is an 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> expected result?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tuesday, September 12, 2023 at 11:15:25 PM 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+3 mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, he doesn't mention all fonts but only one 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font.  That way he didn't use *MODEL_NAME in 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a separate **script **file script I think.*
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually, here we teach all *tif, gt.txt, and 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> .box files *which are created by  *MODEL_NAME 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I mean **eng, ben, oro flag or language code 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *because 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> when we first create *tif, gt.txt, and .box 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> files, *every file starts by  *MODEL_NAME*. 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> This  *MODEL_NAME*  we selected on the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> training script for looping each tif, gt.txt, and 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> .box files which are 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> created by  *MODEL_NAME.*
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tuesday, 12 September, 2023 at 9:42:13 pm 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+6 desal...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I am familiar with the video and have 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> set up the folder structure as you did. Indeed, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have tried a number of 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fine-tuning with a single font following 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Gracia's video. But, your script 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is much  better because supports multiple fonts. 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The whole improvement you 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> made is  brilliant; and very useful. It is all 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> working for me. 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The only part that I didn't understand is the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> trick you used in your tesseract_train.py 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> script. You see, I have been 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> doing exactly to you did except this script. 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The scripts seems to have the trick of 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sending/teaching each of the fonts (iteratively) 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> into the model. The script 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have been using  (which I get from Garcia) 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> doesn't mention font at all. 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *TESSDATA_PREFIX=../tesseract/tessdata make 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> training MODEL_NAME=oro 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> TESSDATA=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000*
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Does it mean that my model does't train the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fonts (even if the fonts have been included in 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the splitting process, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in the other script)?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Monday, September 11, 2023 at 10:54:08 AM 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+3 mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *import subprocess# List of font 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> namesfont_names = ['ben']for font in 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_names:    command = 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> training MODEL_NAME={font} 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000"*
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *    subprocess.run(command, shell=True) 1 . 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> This command is for training data that I have 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> named '*
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tesseract_training*.py' inside tesstrain 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> folder.*
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *2. root directory means your main training 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> folder and inside it as like langdata, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tessearact,  tesstrain folders. if 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> you see this tutorial    *
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://www.youtube.com/watch?v=KE4xEzFGSU8  
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>  you will understand better the folder 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> structure. only I 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> created tesseract_training.py in tesstrain 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> folder for training and  
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> FontList.py file is the main path as *like 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> langdata, tessearact,  tesstrain, and *
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> split_training_text.py.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3. first of all you have to put all fonts in 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> your Linux fonts folder.   /usr/share/fonts/  
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> then run:  sudo apt update  then sudo 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fc-cache -fv
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> after that, you have to add the exact font's 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> name in FontList.py file like me.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I  have added two pic my folder structure. 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> first is main structure pic and the second is 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the Colopse tesstrain folder.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I[image: Screenshot 2023-09-11 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 134947.png][image: 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Screenshot 2023-09-11 135014.png] 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 12:50:03 pm 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+6 desal...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thank you so much for putting out these 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> brilliant scripts. They make the process  much 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> more efficient.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have one more question on the other 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> script that you use to train. 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *import subprocess# List of font 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> namesfont_names = ['ben']for font in 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_names:    command = 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> training MODEL_NAME={font} 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000"*
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *    subprocess.run(command, shell=True) *
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Do you have the name of fonts listed in 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> file in the same/root directory?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> How do you setup the names of the fonts in 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the file, if you don't mind sharing it?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Monday, September 11, 2023 at 4:27:27 AM 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+3 mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> You can use the new script below. it's 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> better than the previous two scripts.  You 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> can create *tif, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> gt.txt, and .box files *by multiple fonts 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and also use breakpoint if vs code close or 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> anything during creating *tif, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> gt.txt, and .box files *then you can 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> checkpoint to navigate where you close vs 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> code.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> command for *tif, gt.txt, and .box files *
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import os
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import random
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import pathlib
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import argparse
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> from FontList import FontList
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> def create_training_data(
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file, font_list, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory, start_line=None, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> end_line=None):
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     lines = []
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     with open(training_text_file, 'r') as 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> input_file:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         lines = input_file.readlines()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     if not os.path.exists(output_directory
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ):
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         os.mkdir(output_directory)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     if start_line is None:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         start_line = 0
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     if end_line is None:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         end_line = len(lines) - 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     for font_name in font_list.fonts:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         for line_index in range(start_line, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> end_line + 1):
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             line = lines[line_index].strip
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             training_text_file_name = 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pathlib.Path(training_text_file).stem
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             line_serial = f"{line_index:d}
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             line_gt_text = os.path.join(
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory, f'{
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file_name}_{line_serial}_{
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_name.replace(" ", "_")}.gt.txt')
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             with open(line_gt_text, 'w') 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> as output_file:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 output_file.writelines
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ([line])
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             file_base_name = f'{
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file_name}_{line_serial}_{
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_name.replace(" ", "_")}'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             subprocess.run([
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 'text2image',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 f'--font={font_name}',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 f'--text={line_gt_text}',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 f'--outputbase={
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory}/{file_base_name}',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--max_pages=1',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --strip_unrenderable_words',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--leading=36',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--xsize=3600',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--ysize=330',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--char_spacing=1.0',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--exposure=0',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/eng.unicharset'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             ])
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if __name__ == "__main__":
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     parser = argparse.ArgumentParser()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     parser.add_argument('--start', type=
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> int, help='Starting line count (inclusive)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ')
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     parser.add_argument('--end', type=int, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> help='Ending line count (inclusive)')
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     args = parser.parse_args()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     training_text_file = '
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> langdata/eng.training_text'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     output_directory = '
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tesstrain/data/eng-ground-truth'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     font_list = FontList()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     create_training_data(training_text_file, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, args.start, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> args.end)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Then create a file called "FontList" in 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the root directory and paste it.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class FontList:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     def __init__(self):
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         self.fonts = [
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         "Gerlick"
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             "Sagar Medium",
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             "Ekushey Lohit Normal",  
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>            "Charukola Round Head Regular, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> weight=433",
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             "Charukola Round Head Bold, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> weight=443",
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             "Ador Orjoma Unicode",
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>       
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>           
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                        
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ]                         
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> then import in the above code,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *for breakpoint command:*
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sudo python3 split_training_text.py --start 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 0  --end 11
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> change checkpoint according to you  
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --start 0 --end 11.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *and training checkpoint as you know 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> already.*
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1:22:34 am UTC+6 desal...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi mhalidu, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the script you posted here seems much 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> more extensive than you posted before: 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/0e2880d9-64c0-4659-b497-902a5747caf4n%40googlegroups.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have been using your earlier script. It 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is magical. How is this one different from 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> earlier one?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thank you for posting these scripts, by 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the way. It has saved my countless hours; by 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> running multiple fonts in one 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sweep. I was not able to find any 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> instruction on how to train for  multiple 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fonts. The official manual is also unclear. 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> YOUr script helped me to get 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> started. 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wednesday, August 9, 2023 at 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 11:00:49 PM UTC+3 mdalihu...@gmail.com 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ok, I will try as you said.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> one more thing, what's the role of the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> trained_text lines will be? I have seen 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Bengali text are long words of 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lines. so I wanna know how many words or 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> characters will be the better 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> choice for the train? and 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--xsize=3600','--ysize=350',  will be 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> according 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to words of lines?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thursday, 10 August, 2023 at 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1:10:14 am UTC+6 shree wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Include the default fonts also in your 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fine-tuning list of fonts and see if that 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> helps.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Aug 9, 2023, 2:27 PM Ali 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> hussain <mdalihu...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have trained some new fonts by 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fine-tune methods for the Bengali 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> language in Tesseract 5 and I have used 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> all official trained_text and 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tessdata_best and other things also.  
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> everything is good but the problem is the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> default font which was trained 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> before that does not convert text like 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> prev but my new fonts work well. I 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> don't understand why it's happening. I 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> share code based to understand what 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> going on.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *codes  for creating tif, gt.txt, .box 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> files:*
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import os
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import random
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import pathlib
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import argparse
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> from FontList import FontList
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> def read_line_count():
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     if os.path.exists('line_count.txt'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ):
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         with open('line_count.txt', 'r
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ') as file:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             return int(file.read())
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     return 0
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> def write_line_count(line_count):
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     with open('line_count.txt', 'w') 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> as file:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         file.write(str(line_count))
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> def create_training_data(
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file, font_list, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory, start_line=None, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> end_line=None):
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     lines = []
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     with open(training_text_file, 'r') 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> as input_file:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         for line in input_file.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> readlines():
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             lines.append(line.strip())
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     if not os.path.exists(
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory):
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         os.mkdir(output_directory)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     random.shuffle(lines)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     if start_line is None:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         line_count = read_line_count() 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>  # Set the starting line_count from 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the file
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     else:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         line_count = start_line
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     if end_line is None:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         end_line_count = len(lines) - 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1  # Set the ending line_count
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     else:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         end_line_count = min(end_line, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> len(lines) - 1)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     for font in font_list.fonts:  # 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Iterate through all the fonts in the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         font_serial = 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         for line in lines:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             training_text_file_name = 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pathlib.Path(training_text_file).stem
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             # Generate a unique 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> serial number for each line
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             line_serial = f"{
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> line_count:d}"
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             # GT (Ground Truth) text 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filename
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             line_gt_text = os.path.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> join(output_directory, f'{
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file_name}_{line_serial}
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> .gt.txt')
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             with open(line_gt_text, 'w
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ') as output_file:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 output_file.writelines
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ([line])
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             # Image filename
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             file_base_name = f'ben_{
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> line_serial}'  # Unique filename for 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> each font
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             subprocess.run([
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 'text2image',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 f'--font={font}',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 f'--text={line_gt_text
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> }',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 f'--outputbase={
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory}/{file_base_name}',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--max_pages=1',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --strip_unrenderable_words',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--leading=36',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--xsize=3600',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--ysize=350',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--char_spacing=1.0',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--exposure=0',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/ben.unicharset
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             ])
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             line_count += 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             font_serial += 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         # Reset font_serial for the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> next font iteration
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         font_serial = 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     write_line_count(line_count)  # 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Update the line_count in the file
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if __name__ == "__main__":
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     parser = argparse.ArgumentParser()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     parser.add_argument('--start', 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> type=int, help='Starting line count 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (inclusive)')
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     parser.add_argument('--end', type=
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> int, help='Ending line count 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (inclusive)')
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     args = parser.parse_args()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     training_text_file = '
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> langdata/ben.training_text'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     output_directory = '
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tesstrain/data/ben-ground-truth'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     # Create an instance of the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> FontList class
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     font_list = FontList()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>      
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> create_training_data(training_text_file, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, args.start, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> args.end)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *and for training code:*
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> # List of font names
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_names = ['ben']
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for font in font_names:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     command = 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> make training MODEL_NAME={font} 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> START_MODEL=ben 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> TESSDATA=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LANG_TYPE=Indic"
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     subprocess.run(command, shell=True
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> )
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> any suggestion to identify to extract 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the problem.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> thanks, everyone
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> You received this message because you 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> are subscribed to the Google Groups 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "tesseract-ocr" group.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To unsubscribe from this group and 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> stop receiving emails from it, send an 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> email to 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tesseract-oc...@googlegroups.com.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To view this discussion on the web 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> visit 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>> You received this message because you are subscribed to 
>>>>>>>>>>>>>>>>> the Google Groups "tesseract-ocr" group.
>>>>>>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails 
>>>>>>>>>>>>>>>>> from it, send an email to tesseract-oc...@googlegroups.com
>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com
>>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails 
>>>>>>>>>>>>>>> from it, send an email to tesseract-oc...@googlegroups.com.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/eb833902-7258-43e3-8854-d51ce26b7257n%40googlegroups.com
>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/eb833902-7258-43e3-8854-d51ce26b7257n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/52329855-62f3-4419-9ec9-838a573fd4a0n%40googlegroups.com.

Re: [tesseract-ocr] accuracy problem after trained in fine-tune

Reply via email to