[tesseract-ocr] Re: trainning question

Des Bw Mon, 30 Oct 2023 08:04:28 -0700

Hi Ali, 
Do you think the starting and stopping at a specific line would also be 
possible for the actual training, just you have done for the text2image?


Today, I have been very surprised that tesseract always restarting from the 
beginning every time we interrupted the process. 

https://github.com/tesseract-ocr/tesseract/issues/3954

This is very bad; it can definitely degrade the accuracy of the 
training especially for larger data sets, because the training 
is quintessentially running only on some lines (latter text lines are 
ignored). 
So, if you have 800,000 text lines; and you run your training step by step: 
Round 1: 10,000 iterations
Round 2: 10000, 000 iterations
Round 3:  400,000 iterations
Round 4: 400,000 iterations
Basically, you used only 400,000 text lines. The other 400,000 text lines 
are not used for training. They are wasted. 

So, it would be great if we can have similar python script that could stop 
and resume the training. 

On Monday, September 4, 2023 at 2:34:04 PM UTC+3 Des Bw wrote:

> Thank you man. This is very useful. 
>
> On Tuesday, July 25, 2023 at 12:01:20 PM UTC+3 [email protected] wrote:
>
>> make sure the command of the training file will be under tesstrain folder 
>> and run the first command for training data and if you train from any 
>> checkpoint then run the second post command. 
>> On Tuesday, 25 July, 2023 at 2:48:25 pm UTC+6 Ali hussain wrote:
>>
>>> import subprocess
>>>
>>> # List of font names
>>> font_names = ['ben']
>>>
>>> for font in font_names:
>>>     command = f"lstmtraining --continue_from 
>>> data/ben/checkpoints/ben_19.535_298_300.checkpoint --traineddata 
>>> data/ben/ben.traineddata --model_output data/ben/checkpoints/ben 
>>> --train_listfile data/ben/list.train --eval_listfile data/ben/list.eval 
>>> --max_iterations 1000"
>>>     subprocess.run(command, shell=True)
>>> i fixed the problem and this code work for me by adding the checkpoint.
>>> On Thursday, 20 July, 2023 at 7:51:43 am UTC+6 Ali hussain wrote:
>>>
>>>> I'm new in Tesseract and trying to train my own fonts on Tesseract 
>>>> 5.3.2 but I have to know if the electricity is cut off or if I cancel vs 
>>>> code or something like that of the process of training then if I run the 
>>>> training command again so after that it starts from begging or from 
>>>> electricity cut off?
>>>>
>>>> I have already to tested it but every time starts from begging. so I 
>>>> need to know any method to apply this problem to handle this. because it 
>>>> takes a lot of time and is not necessary to start by begging every time or 
>>>> it's normal?
>>>>
>>>>
>>>> I use this command to create text-to-image.tif files for multiple fonts 
>>>> in Tesseract 5.3.2: 
>>>>
>>>> import os
>>>> import random
>>>> import pathlib
>>>> import subprocess
>>>>
>>>> training_text_file = 'langdata/ben.training_text'
>>>> font_list = ['FL Badhon Ansari Rh. Unicode',
>>>>              'F Khairuddin Barbarusa Rah. Uni',
>>>>              'F Mahfuj Art Unicode Italic',
>>>>              'F Mahfuj Art Unicode',
>>>>              'FL Niribili Plain Unicode',
>>>>              'FL Niribili Plain Unicode Itali Italic'
>>>>              ]  # Add more fonts as needed
>>>>
>>>> lines = []
>>>>
>>>> with open(training_text_file, 'r') as input_file:
>>>>     for line in input_file.readlines():
>>>>         lines.append(line.strip())
>>>>
>>>> output_directory = 'tesstrain/data/ben-ground-truth'
>>>>
>>>> if not os.path.exists(output_directory):
>>>>     os.mkdir(output_directory)
>>>>
>>>> random.shuffle(lines)
>>>>
>>>> count = 100
>>>>
>>>> lines = lines[:count]
>>>>
>>>> line_count = 0
>>>> for line in lines:
>>>>     for font in font_list:
>>>>         training_text_file_name = pathlib.Path(training_text_file).stem
>>>>         line_training_text = os.path.join(
>>>>             output_directory, 
>>>> f'{training_text_file_name}_{line_count}.gt.txt')
>>>>         with open(line_training_text, 'w') as output_file:
>>>>             output_file.writelines([line])
>>>>
>>>>         file_base_name = f'ben_{line_count}'
>>>>
>>>>         subprocess.run([
>>>>             'text2image',
>>>>             f'--font={font}',
>>>>             f'--text={line_training_text}',
>>>>             f'--outputbase={output_directory}/{file_base_name}',
>>>>             '--max_pages=1',
>>>>             '--strip_unrenderable_words',
>>>>             '--leading=32',
>>>>             '--xsize=3600',
>>>>             '--ysize=350',
>>>>             '--char_spacing=1.0',
>>>>             '--exposure=0',
>>>>             '--unicharset_file=langdata/ben.unicharset'
>>>>         ])
>>>>
>>>>         line_count += 1
>>>>
>>>>
>>>>
>>>> and this command is for training :
>>>>
>>>> import subprocess
>>>>
>>>> # List of font names
>>>> font_names = ['ben']
>>>>
>>>> for font in font_names:
>>>>     command = f"TESSDATA_PREFIX=../tesseract/tessdata make training 
>>>> MODEL_NAME={font} START_MODEL=ben TESSDATA=../tesseract/tessdata 
>>>> MAX_ITERATIONS=10000 LANG_TYPE=Indic"
>>>>     subprocess.run(command, shell=True)
>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/daf7de8f-72d5-4b89-84eb-d3948ab5d1b1n%40googlegroups.com.

[tesseract-ocr] Re: trainning question

Reply via email to