This is the code I used to train from a layer: 
*make training MODEL_NAME=amh START_MODEL=amh APPEND_INDEX=5 
NET_SPEC='[Lfx256 O1c105]' TESSDATA=../tesseract/tessdata EPOCHS=3 
TARGET_ERROR_RATE=0.0001 training >> data/amh.log &*
*- I took it from Scheer' training *tesstrain-JSTORArabic*: *
https://github.com/Shreeshrii/tesstrain-JSTORArabic

- The net_spec of ben might not be the same to amh. Shreeshrii has sent a 
link on the netspecs of languages, in this forum.  

On Sunday, October 22, 2023 at 12:09:25 PM UTC+3 mdalihu...@gmail.com wrote:

> you can test by changes '--char spacing=1.0 . i think it would be problem 
> accuracy of result on it also.
> On Sunday, 22 October, 2023 at 3:07:16 pm UTC+6 Ali hussain wrote:
>
>> i haven't tried by cut the top layer of the network. you can share your 
>> knowledge what you done by cut the top layer of the network. or github 
>> project link.
>> On Sunday, 22 October, 2023 at 12:27:32 pm UTC+6 desal...@gmail.com 
>> wrote:
>>
>>> That is massive data. Have you tried to train by cut the top layer of 
>>> the network?
>>> I think that is the most promising approach. I was getting really good 
>>> results with that. But, the result is not getting translated to scanned 
>>> documents. I get best results with the syntethic data. I am no 
>>> experimenting with the settings in text2image if it is possible to emulate 
>>> the scanned documents. 
>>> I am also suspecting this setting   '--char_spacing=1.0', in our setup 
>>> is causing more trouble. Scanned documents come with characters spacing 
>>> close to zero.If you are planning to train more, try removing this 
>>> parameter. 
>>>
>>> On Sunday, October 22, 2023 at 4:09:46 AM UTC+3 mdalihu...@gmail.com 
>>> wrote:
>>>
>>>> 600000 lines of text and the itarations  higher then 600000. but some 
>>>> time i got better result in lower itarations in finetune like 100000 lines 
>>>> of text and itaration is only 5000 to10000. 
>>>> On Saturday, 21 October, 2023 at 11:37:13 am UTC+6 desal...@gmail.com 
>>>> wrote:
>>>>
>>>>> How many lines of text and iterations did you use?
>>>>>
>>>>> On Saturday, October 21, 2023 at 8:36:38 AM UTC+3 Des Bw wrote:
>>>>>
>>>>>> Yah, that is what I am getting as well. I was able to add the missing 
>>>>>> letter. But, the overall accuracy become lower than the default model. 
>>>>>>
>>>>>> On Saturday, October 21, 2023 at 3:22:44 AM UTC+3 
>>>>>> mdalihu...@gmail.com wrote:
>>>>>>
>>>>>>> not good result. that's way i stop to training now. default 
>>>>>>> traineddata is overall good then scratch.
>>>>>>> On Thursday, 19 October, 2023 at 11:32:08 pm UTC+6 
>>>>>>> desal...@gmail.com wrote:
>>>>>>>
>>>>>>>> Hi Ali, 
>>>>>>>> How is your training going?
>>>>>>>> Do you get good results with the training-from-the-scratch?
>>>>>>>>
>>>>>>>> On Friday, September 15, 2023 at 6:42:26 PM UTC+3 tesseract-ocr 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> yes, two months ago when I started to learn OCR I saw that. it was 
>>>>>>>>> very helpful at the beginning.
>>>>>>>>> On Friday, 15 September, 2023 at 4:01:32 pm UTC+6 
>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>
>>>>>>>>>> Just saw this paper: https://osf.io/b8h7q
>>>>>>>>>>
>>>>>>>>>> On Thursday, September 14, 2023 at 9:02:22 PM UTC+3 
>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>
>>>>>>>>>>> I will try some changes. thx
>>>>>>>>>>>
>>>>>>>>>>> On Thursday, 14 September, 2023 at 2:46:36 pm UTC+6 
>>>>>>>>>>> elvi...@gmail.com wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I also faced that issue in the Windows. Apparently, the issue 
>>>>>>>>>>>> is related with unicode. You can try your luck by changing  "r" to 
>>>>>>>>>>>> "utf8" 
>>>>>>>>>>>> in the script.
>>>>>>>>>>>> I end up installing Ubuntu because i was having too many errors 
>>>>>>>>>>>> in the Windows.
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Sep 14, 2023, 9:33 AM Ali hussain <mdalihu...@gmail.com> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> you faced this error,  Can't encode transcription? if you 
>>>>>>>>>>>>> faced how you have solved this?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thursday, 14 September, 2023 at 10:51:52 am UTC+6 
>>>>>>>>>>>>> elvi...@gmail.com wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I was using my own text
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Sep 14, 2023, 6:58 AM Ali hussain <
>>>>>>>>>>>>>> mdalihu...@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> you are training from Tessearact default text data or your 
>>>>>>>>>>>>>>> own collected text data?
>>>>>>>>>>>>>>> On Thursday, 14 September, 2023 at 12:19:53 am UTC+6 
>>>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I now get to 200000 iterations; and the error rate is stuck 
>>>>>>>>>>>>>>>> at 0.46. The result is absolutely trash: nowhere close to the 
>>>>>>>>>>>>>>>> default/Ray's 
>>>>>>>>>>>>>>>> training. 
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 2:47:05 PM UTC+3 
>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> after Tesseact recognizes text from images. then you can 
>>>>>>>>>>>>>>>>> apply regex to replace the wrong word with to correct word.
>>>>>>>>>>>>>>>>> I'm not familiar with paddleOcr and scanTailor also.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 5:06:12 pm UTC+6 
>>>>>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> At what stage are you doing the regex replacement?
>>>>>>>>>>>>>>>>>> My process has been: Scan (tif)--> ScanTailor --> 
>>>>>>>>>>>>>>>>>> Tesseract --> pdf
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> >EasyOCR I think is best for ID cards or something like 
>>>>>>>>>>>>>>>>>> that image process. but document images like books, here 
>>>>>>>>>>>>>>>>>> Tesseract is 
>>>>>>>>>>>>>>>>>> better than EasyOCR.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> How about paddleOcr?, are you familiar with it?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 1:45:54 PM UTC+3 
>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I know what you mean. but in some cases, it helps me.  I 
>>>>>>>>>>>>>>>>>>> have faced specific characters and words are always not 
>>>>>>>>>>>>>>>>>>> recognized by 
>>>>>>>>>>>>>>>>>>> Tesseract. That way I use these regex to replace those 
>>>>>>>>>>>>>>>>>>> characters   and 
>>>>>>>>>>>>>>>>>>> words if  those characters are incorrect.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> see what I have done: 
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>    " ী": "ী",
>>>>>>>>>>>>>>>>>>>     " ্": " ",
>>>>>>>>>>>>>>>>>>>     " ে": " ",
>>>>>>>>>>>>>>>>>>>     জ্া: "জা",
>>>>>>>>>>>>>>>>>>>     "  ": " ",
>>>>>>>>>>>>>>>>>>>     "   ": " ",
>>>>>>>>>>>>>>>>>>>     "    ": " ",
>>>>>>>>>>>>>>>>>>>     "্প": " ",
>>>>>>>>>>>>>>>>>>>     " য": "র্য",
>>>>>>>>>>>>>>>>>>>     য: "য",
>>>>>>>>>>>>>>>>>>>     " া": "া",
>>>>>>>>>>>>>>>>>>>     আা: "আ",
>>>>>>>>>>>>>>>>>>>     ম্ি: "মি",
>>>>>>>>>>>>>>>>>>>     স্ু: "সু",
>>>>>>>>>>>>>>>>>>>     "হূ ": "হূ",
>>>>>>>>>>>>>>>>>>>     " ণ": "ণ",
>>>>>>>>>>>>>>>>>>>     র্্: "র",
>>>>>>>>>>>>>>>>>>>     "চিন্ত ": "চিন্তা ",
>>>>>>>>>>>>>>>>>>>     ন্া: "না",
>>>>>>>>>>>>>>>>>>>     "সম ূর্ন": "সম্পূর্ণ",
>>>>>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 4:18:22 pm UTC+6 
>>>>>>>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> The problem for regex is that Tesseract is not 
>>>>>>>>>>>>>>>>>>>> consistent in its replacement. 
>>>>>>>>>>>>>>>>>>>> Think of the original training of English data doesn't 
>>>>>>>>>>>>>>>>>>>> contain the letter /u/. What does Tesseract do when it 
>>>>>>>>>>>>>>>>>>>> faces /u/ in actual 
>>>>>>>>>>>>>>>>>>>> processing??
>>>>>>>>>>>>>>>>>>>> In some cases, it replaces it with closely similar 
>>>>>>>>>>>>>>>>>>>> letters such as /v/ and /w/. In other cases, it completely 
>>>>>>>>>>>>>>>>>>>> removes it. That 
>>>>>>>>>>>>>>>>>>>> is what is happening with my case. Those characters re 
>>>>>>>>>>>>>>>>>>>> sometimes completely 
>>>>>>>>>>>>>>>>>>>> removed; other times, they are replaced by closely 
>>>>>>>>>>>>>>>>>>>> resembling characters. 
>>>>>>>>>>>>>>>>>>>> Because of this inconsistency, applying regex is very 
>>>>>>>>>>>>>>>>>>>> difficult. 
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 1:02:01 PM UTC+3 
>>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> if Some specific characters or words are always 
>>>>>>>>>>>>>>>>>>>>> missing from the OCR result.  then you can apply logic 
>>>>>>>>>>>>>>>>>>>>> with the Regular 
>>>>>>>>>>>>>>>>>>>>> expressions method on your applications. After OCR, these 
>>>>>>>>>>>>>>>>>>>>> specific 
>>>>>>>>>>>>>>>>>>>>> characters or words will be replaced by current 
>>>>>>>>>>>>>>>>>>>>> characters or words that 
>>>>>>>>>>>>>>>>>>>>> you defined in your applications by  Regular expressions. 
>>>>>>>>>>>>>>>>>>>>> it can be done in 
>>>>>>>>>>>>>>>>>>>>> some major problems.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 3:51:29 pm UTC+6 
>>>>>>>>>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> The characters are getting missed, even after 
>>>>>>>>>>>>>>>>>>>>>> fine-tuning. 
>>>>>>>>>>>>>>>>>>>>>> I never made any progress. I tried many different 
>>>>>>>>>>>>>>>>>>>>>> ways. Some  specific characters are always missing from 
>>>>>>>>>>>>>>>>>>>>>> the OCR result.  
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 12:49:20 PM UTC+3 
>>>>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> EasyOCR I think is best for ID cards or something 
>>>>>>>>>>>>>>>>>>>>>>> like that image process. but document images like 
>>>>>>>>>>>>>>>>>>>>>>> books, here Tesseract is 
>>>>>>>>>>>>>>>>>>>>>>> better than EasyOCR.  Even I didn't use EasyOCR. you 
>>>>>>>>>>>>>>>>>>>>>>> can try it.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I have added words of dictionaries but the result is 
>>>>>>>>>>>>>>>>>>>>>>> the same. 
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> what kind of problem you have faced in fine-tuning 
>>>>>>>>>>>>>>>>>>>>>>> in few new characters as you said (*but, I failed 
>>>>>>>>>>>>>>>>>>>>>>> in every possible way to introduce a few new characters 
>>>>>>>>>>>>>>>>>>>>>>> into the database.)*
>>>>>>>>>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 3:33:48 pm UTC+6 
>>>>>>>>>>>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Yes, we are new to this. I find the instructions 
>>>>>>>>>>>>>>>>>>>>>>>> (the manual) very hard to follow. The video you linked 
>>>>>>>>>>>>>>>>>>>>>>>> above was really 
>>>>>>>>>>>>>>>>>>>>>>>> helpful  to get started. My plan at the beginning was 
>>>>>>>>>>>>>>>>>>>>>>>> to fine tune the 
>>>>>>>>>>>>>>>>>>>>>>>> existing .traineddata. But, I failed in every possible 
>>>>>>>>>>>>>>>>>>>>>>>> way to introduce a 
>>>>>>>>>>>>>>>>>>>>>>>> few new characters into the database. That is why I 
>>>>>>>>>>>>>>>>>>>>>>>> started from scratch. 
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Sure, I will follow Lorenzo's suggestion: will run 
>>>>>>>>>>>>>>>>>>>>>>>> more the iterations, and see if I can improve. 
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Another areas we need to explore is usage of 
>>>>>>>>>>>>>>>>>>>>>>>> dictionaries actually. May be adding millions of words 
>>>>>>>>>>>>>>>>>>>>>>>> into the 
>>>>>>>>>>>>>>>>>>>>>>>> dictionary could help Tesseract. I don't have millions 
>>>>>>>>>>>>>>>>>>>>>>>> of words; but I am 
>>>>>>>>>>>>>>>>>>>>>>>> looking into some corpus to get more words into the 
>>>>>>>>>>>>>>>>>>>>>>>> dictionary. 
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> If this all fails, EasyOCR (and probably other 
>>>>>>>>>>>>>>>>>>>>>>>> similar open-source packages)  is probably our next 
>>>>>>>>>>>>>>>>>>>>>>>> option to try on. Sure, 
>>>>>>>>>>>>>>>>>>>>>>>> sharing our experiences will be helpful. I will let 
>>>>>>>>>>>>>>>>>>>>>>>> you know if I made good 
>>>>>>>>>>>>>>>>>>>>>>>> progresses in any of these options. 
>>>>>>>>>>>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 12:19:48 PM 
>>>>>>>>>>>>>>>>>>>>>>>> UTC+3 mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> How is your training going for Bengali?  It was 
>>>>>>>>>>>>>>>>>>>>>>>>> nearly good but I faced space problems between two 
>>>>>>>>>>>>>>>>>>>>>>>>> words, some words are 
>>>>>>>>>>>>>>>>>>>>>>>>> spaces but most of them have no space. I think is 
>>>>>>>>>>>>>>>>>>>>>>>>> problem is in the dataset 
>>>>>>>>>>>>>>>>>>>>>>>>> but I use the default training dataset from Tesseract 
>>>>>>>>>>>>>>>>>>>>>>>>> which is used in Ben 
>>>>>>>>>>>>>>>>>>>>>>>>> That way I am confused so I have to explore more. by 
>>>>>>>>>>>>>>>>>>>>>>>>> the way,  you can try 
>>>>>>>>>>>>>>>>>>>>>>>>> as Lorenzo Blz said.  Actually training from 
>>>>>>>>>>>>>>>>>>>>>>>>> scratch is harder than fine-tuning. so you can use 
>>>>>>>>>>>>>>>>>>>>>>>>> different datasets to 
>>>>>>>>>>>>>>>>>>>>>>>>> explore. if you succeed. please let me know how you 
>>>>>>>>>>>>>>>>>>>>>>>>> have done this whole 
>>>>>>>>>>>>>>>>>>>>>>>>> process.  I'm also new in this field.
>>>>>>>>>>>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 1:13:43 pm 
>>>>>>>>>>>>>>>>>>>>>>>>> UTC+6 desal...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> How is your training going for Bengali?
>>>>>>>>>>>>>>>>>>>>>>>>>> I have been trying to train from scratch. I made 
>>>>>>>>>>>>>>>>>>>>>>>>>> about 64,000 lines of text (which produced about 
>>>>>>>>>>>>>>>>>>>>>>>>>> 255,000 files, in the end) 
>>>>>>>>>>>>>>>>>>>>>>>>>> and run the training for 150,000 iterations; getting 
>>>>>>>>>>>>>>>>>>>>>>>>>> 0.51 training error 
>>>>>>>>>>>>>>>>>>>>>>>>>> rate. I was hopping to get reasonable accuracy. 
>>>>>>>>>>>>>>>>>>>>>>>>>> Unfortunately, when I run 
>>>>>>>>>>>>>>>>>>>>>>>>>> the OCR using  .traineddata,  the accuracy is 
>>>>>>>>>>>>>>>>>>>>>>>>>> absolutely terrible. Do you 
>>>>>>>>>>>>>>>>>>>>>>>>>> think I made some mistakes, or that is an expected 
>>>>>>>>>>>>>>>>>>>>>>>>>> result?
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> On Tuesday, September 12, 2023 at 11:15:25 PM 
>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+3 mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, he doesn't mention all fonts but only one 
>>>>>>>>>>>>>>>>>>>>>>>>>>> font.  That way he didn't use *MODEL_NAME in a 
>>>>>>>>>>>>>>>>>>>>>>>>>>> separate **script **file script I think.*
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually, here we teach all *tif, gt.txt, and 
>>>>>>>>>>>>>>>>>>>>>>>>>>> .box files *which are created by  *MODEL_NAME I 
>>>>>>>>>>>>>>>>>>>>>>>>>>> mean **eng, ben, oro flag or language code *because 
>>>>>>>>>>>>>>>>>>>>>>>>>>> when we first create *tif, gt.txt, and .box 
>>>>>>>>>>>>>>>>>>>>>>>>>>> files, *every file starts by  *MODEL_NAME*. 
>>>>>>>>>>>>>>>>>>>>>>>>>>> This  *MODEL_NAME*  we selected on the training 
>>>>>>>>>>>>>>>>>>>>>>>>>>> script for looping each tif, gt.txt, and .box files 
>>>>>>>>>>>>>>>>>>>>>>>>>>> which are created by
>>>>>>>>>>>>>>>>>>>>>>>>>>>   *MODEL_NAME.*
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tuesday, 12 September, 2023 at 9:42:13 pm 
>>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+6 desal...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I am familiar with the video and have set 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> up the folder structure as you did. Indeed, I have 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> tried a number of 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> fine-tuning with a single font following Gracia's 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> video. But, your script 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> is much  better because supports multiple fonts. 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> The whole improvement you 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> made is  brilliant; and very useful. It is all 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> working for me. 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> The only part that I didn't understand is the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> trick you used in your tesseract_train.py script. 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> You see, I have been 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> doing exactly to you did except this script. 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> The scripts seems to have the trick of 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> sending/teaching each of the fonts (iteratively) 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> into the model. The script 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have been using  (which I get from Garcia) 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> doesn't mention font at all. 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> *TESSDATA_PREFIX=../tesseract/tessdata make 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> training MODEL_NAME=oro 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> TESSDATA=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000*
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Does it mean that my model does't train the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> fonts (even if the fonts have been included in the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> splitting process, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> in the other script)?
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Monday, September 11, 2023 at 10:54:08 AM 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+3 mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *import subprocess# List of font 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> namesfont_names = ['ben']for font in font_names:  
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   command = 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> training MODEL_NAME={font} 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000"*
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *    subprocess.run(command, shell=True) 1 . 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> This command is for training data that I have 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> named '*
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tesseract_training*.py' inside tesstrain 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> folder.*
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *2. root directory means your main training 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> folder and inside it as like langdata, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tessearact,  tesstrain folders. if 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> you see this tutorial    *
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://www.youtube.com/watch?v=KE4xEzFGSU8  
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>  you will understand better the folder structure. 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> only I 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> created tesseract_training.py in tesstrain folder 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for training and  
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> FontList.py file is the main path as *like 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> langdata, tessearact,  tesstrain, and *
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> split_training_text.py.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3. first of all you have to put all fonts in 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> your Linux fonts folder.   /usr/share/fonts/  
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> then run:  sudo apt update  then sudo 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fc-cache -fv
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> after that, you have to add the exact font's 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> name in FontList.py file like me.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I  have added two pic my folder structure. 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> first is main structure pic and the second is the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Colopse tesstrain folder.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I[image: Screenshot 2023-09-11 134947.png][image: 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Screenshot 2023-09-11 135014.png] 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 12:50:03 pm 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+6 desal...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thank you so much for putting out these 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> brilliant scripts. They make the process  much 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> more efficient.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have one more question on the other script 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that you use to train. 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *import subprocess# List of font 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> namesfont_names = ['ben']for font in font_names: 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    command = 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> training MODEL_NAME={font} 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000"*
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *    subprocess.run(command, shell=True) *
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Do you have the name of fonts listed in file 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in the same/root directory?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> How do you setup the names of the fonts in 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the file, if you don't mind sharing it?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Monday, September 11, 2023 at 4:27:27 AM 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+3 mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> You can use the new script below. it's 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> better than the previous two scripts.  You can 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> create *tif, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> gt.txt, and .box files *by multiple fonts 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and also use breakpoint if vs code close or 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> anything during creating *tif, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> gt.txt, and .box files *then you can 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> checkpoint to navigate where you close vs code.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> command for *tif, gt.txt, and .box files *
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import os
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import random
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import pathlib
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import argparse
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> from FontList import FontList
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> def create_training_data(training_text_file, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, start_line=None, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> end_line=None):
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     lines = []
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     with open(training_text_file, 'r') as 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> input_file:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         lines = input_file.readlines()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     if not os.path.exists(output_directory):
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         os.mkdir(output_directory)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     if start_line is None:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         start_line = 0
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     if end_line is None:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         end_line = len(lines) - 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     for font_name in font_list.fonts:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         for line_index in range(start_line, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> end_line + 1):
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             line = lines[line_index].strip()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             training_text_file_name = 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pathlib.Path(training_text_file).stem
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             line_serial = f"{line_index:d}"
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             line_gt_text = os.path.join(
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory, f'{training_text_file_name
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> }_{line_serial}_{font_name.replace(" ", "_")
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> }.gt.txt')
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             with open(line_gt_text, 'w') as 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_file:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 output_file.writelines
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ([line])
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             file_base_name = f'{
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file_name}_{line_serial}_{
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_name.replace(" ", "_")}'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             subprocess.run([
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 'text2image',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 f'--font={font_name}',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 f'--text={line_gt_text}',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 f'--outputbase={
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory}/{file_base_name}',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--max_pages=1',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--strip_unrenderable_words'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--leading=36',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--xsize=3600',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--ysize=330',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--char_spacing=1.0',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--exposure=0',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/eng.unicharset',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             ])
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if __name__ == "__main__":
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     parser = argparse.ArgumentParser()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     parser.add_argument('--start', type=int, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> help='Starting line count (inclusive)')
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     parser.add_argument('--end', type=int, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> help='Ending line count (inclusive)')
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     args = parser.parse_args()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     training_text_file = '
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> langdata/eng.training_text'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     output_directory = '
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tesstrain/data/eng-ground-truth'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     font_list = FontList()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     create_training_data(training_text_file, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, args.start, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> args.end)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Then create a file called "FontList" in the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> root directory and paste it.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class FontList:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     def __init__(self):
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         self.fonts = [
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         "Gerlick"
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             "Sagar Medium",
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             "Ekushey Lohit Normal",  
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>            "Charukola Round Head Regular, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> weight=433",
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             "Charukola Round Head Bold, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> weight=443",
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             "Ador Orjoma Unicode",
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>       
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>           
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                        
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ]                         
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> then import in the above code,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *for breakpoint command:*
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sudo python3 split_training_text.py --start 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 0  --end 11
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> change checkpoint according to you  --start 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 0 --end 11.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *and training checkpoint as you know 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> already.*
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 1:22:34 am 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+6 desal...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi mhalidu, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the script you posted here seems much more 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> extensive than you posted before: 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/0e2880d9-64c0-4659-b497-902a5747caf4n%40googlegroups.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have been using your earlier script. It 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is magical. How is this one different from the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> earlier one?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thank you for posting these scripts, by the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> way. It has saved my countless hours; by 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> running multiple fonts in one 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sweep. I was not able to find any instruction 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on how to train for  multiple 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fonts. The official manual is also unclear. 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> YOUr script helped me to get 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> started. 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wednesday, August 9, 2023 at 11:00:49 PM 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+3 mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ok, I will try as you said.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> one more thing, what's the role of the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> trained_text lines will be? I have seen 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Bengali text are long words of 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lines. so I wanna know how many words or 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> characters will be the better 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> choice for the train? and 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--xsize=3600','--ysize=350',  will be 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> according 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to words of lines?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thursday, 10 August, 2023 at 1:10:14 am 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+6 shree wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Include the default fonts also in your 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fine-tuning list of fonts and see if that 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> helps.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Aug 9, 2023, 2:27 PM Ali hussain <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have trained some new fonts by 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fine-tune methods for the Bengali language 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in Tesseract 5 and I have used 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> all official trained_text and tessdata_best 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and other things also.  
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> everything is good but the problem is the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> default font which was trained 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> before that does not convert text like prev 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> but my new fonts work well. I 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> don't understand why it's happening. I 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> share code based to understand what 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> going on.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *codes  for creating tif, gt.txt, .box 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> files:*
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import os
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import random
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import pathlib
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import argparse
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> from FontList import FontList
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> def read_line_count():
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     if os.path.exists('line_count.txt'):
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         with open('line_count.txt', 'r') 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> as file:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             return int(file.read())
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     return 0
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> def write_line_count(line_count):
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     with open('line_count.txt', 'w') as 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> file:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         file.write(str(line_count))
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> def create_training_data(
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file, font_list, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory, start_line=None, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> end_line=None):
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     lines = []
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     with open(training_text_file, 'r') 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> as input_file:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         for line in input_file.readlines
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ():
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             lines.append(line.strip())
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     if not os.path.exists(
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory):
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         os.mkdir(output_directory)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     random.shuffle(lines)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     if start_line is None:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         line_count = read_line_count()  # 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Set the starting line_count from the file
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     else:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         line_count = start_line
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     if end_line is None:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         end_line_count = len(lines) - 1 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>  # Set the ending line_count
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     else:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         end_line_count = min(end_line, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> len(lines) - 1)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     for font in font_list.fonts:  # 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Iterate through all the fonts in the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         font_serial = 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         for line in lines:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             training_text_file_name = 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pathlib.Path(training_text_file).stem
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             # Generate a unique serial 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> number for each line
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             line_serial = f"{line_count
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> :d}"
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             # GT (Ground Truth) text 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filename
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             line_gt_text = os.path.join(
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory, f'{
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file_name}_{line_serial}
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> .gt.txt')
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             with open(line_gt_text, 'w') 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> as output_file:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 output_file.writelines
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ([line])
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             # Image filename
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             file_base_name = f'ben_{
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> line_serial}'  # Unique filename for 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> each font
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             subprocess.run([
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 'text2image',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 f'--font={font}',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 f'--text={line_gt_text}'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 f'--outputbase={
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory}/{file_base_name}',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--max_pages=1',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --strip_unrenderable_words',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--leading=36',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--xsize=3600',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--ysize=350',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--char_spacing=1.0',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--exposure=0',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/ben.unicharset
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             ])
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             line_count += 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             font_serial += 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         # Reset font_serial for the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> next font iteration
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         font_serial = 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     write_line_count(line_count)  # 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Update the line_count in the file
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if __name__ == "__main__":
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     parser = argparse.ArgumentParser()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     parser.add_argument('--start', type=
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> int, help='Starting line count 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (inclusive)')
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     parser.add_argument('--end', type=
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> int, help='Ending line count (inclusive)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ')
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     args = parser.parse_args()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     training_text_file = '
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> langdata/ben.training_text'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     output_directory = '
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tesstrain/data/ben-ground-truth'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     # Create an instance of the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> FontList class
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     font_list = FontList()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>      
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> create_training_data(training_text_file, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, args.start, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> args.end)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *and for training code:*
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> # List of font names
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_names = ['ben']
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for font in font_names:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     command = 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> make training MODEL_NAME={font} 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> START_MODEL=ben 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> TESSDATA=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LANG_TYPE=Indic"
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     subprocess.run(command, shell=True)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> any suggestion to identify to extract 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the problem.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> thanks, everyone
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> You received this message because you 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> are subscribed to the Google Groups 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "tesseract-ocr" group.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To unsubscribe from this group and stop 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> receiving emails from it, send an email to 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tesseract-oc...@googlegroups.com.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails 
>>>>>>>>>>>>>>> from it, send an email to tesseract-oc...@googlegroups.com.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com
>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from 
>>>>>>>>>>>>> it, send an email to tesseract-oc...@googlegroups.com.
>>>>>>>>>>>>>
>>>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/eb833902-7258-43e3-8854-d51ce26b7257n%40googlegroups.com
>>>>>>>>>>>>>  
>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/eb833902-7258-43e3-8854-d51ce26b7257n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>>> .
>>>>>>>>>>>>>
>>>>>>>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/9f82ed5b-fb32-4b59-a0b3-5a4e6f13e3b7n%40googlegroups.com.

Reply via email to