I know what you mean. but in some cases, it helps me. I have faced
specific characters and words are always not recognized by Tesseract. That
way I use these regex to replace those characters and words if those
characters are incorrect.
see what I have done:
" ী": "ী",
" ্": " ",
" ে": " ",
জ্া: "জা",
" ": " ",
" ": " ",
" ": " ",
"্প": " ",
" য": "র্য",
য: "য",
" া": "া",
আা: "আ",
ম্ি: "মি",
স্ু: "সু",
"হূ ": "হূ",
" ণ": "ণ",
র্্: "র",
"চিন্ত ": "চিন্তা ",
ন্া: "না",
"সম ূর্ন": "সম্পূর্ণ",
On Wednesday, 13 September, 2023 at 4:18:22 pm UTC+6 [email protected]
wrote:
> The problem for regex is that Tesseract is not consistent in its
> replacement.
> Think of the original training of English data doesn't contain the letter
> /u/. What does Tesseract do when it faces /u/ in actual processing??
> In some cases, it replaces it with closely similar letters such as /v/ and
> /w/. In other cases, it completely removes it. That is what is happening
> with my case. Those characters re sometimes completely removed; other
> times, they are replaced by closely resembling characters. Because of this
> inconsistency, applying regex is very difficult.
>
> On Wednesday, September 13, 2023 at 1:02:01 PM UTC+3 [email protected]
> wrote:
>
>> if Some specific characters or words are always missing from the OCR
>> result. then you can apply logic with the Regular expressions method on
>> your applications. After OCR, these specific characters or words will be
>> replaced by current characters or words that you defined in your
>> applications by Regular expressions. it can be done in some major problems.
>>
>> On Wednesday, 13 September, 2023 at 3:51:29 pm UTC+6 [email protected]
>> wrote:
>>
>>> The characters are getting missed, even after fine-tuning.
>>> I never made any progress. I tried many different ways. Some specific
>>> characters are always missing from the OCR result.
>>>
>>> On Wednesday, September 13, 2023 at 12:49:20 PM UTC+3
>>> [email protected] wrote:
>>>
>>>> EasyOCR I think is best for ID cards or something like that image
>>>> process. but document images like books, here Tesseract is better than
>>>> EasyOCR. Even I didn't use EasyOCR. you can try it.
>>>>
>>>> I have added words of dictionaries but the result is the same.
>>>>
>>>> what kind of problem you have faced in fine-tuning in few new
>>>> characters as you said (*but, I failed in every possible way to
>>>> introduce a few new characters into the database.)*
>>>> On Wednesday, 13 September, 2023 at 3:33:48 pm UTC+6 [email protected]
>>>> wrote:
>>>>
>>>>> Yes, we are new to this. I find the instructions (the manual) very
>>>>> hard to follow. The video you linked above was really helpful to get
>>>>> started. My plan at the beginning was to fine tune the existing
>>>>> .traineddata. But, I failed in every possible way to introduce a few new
>>>>> characters into the database. That is why I started from scratch.
>>>>>
>>>>> Sure, I will follow Lorenzo's suggestion: will run more the
>>>>> iterations, and see if I can improve.
>>>>>
>>>>> Another areas we need to explore is usage of dictionaries actually.
>>>>> May be adding millions of words into the dictionary could help Tesseract.
>>>>> I
>>>>> don't have millions of words; but I am looking into some corpus to get
>>>>> more
>>>>> words into the dictionary.
>>>>>
>>>>> If this all fails, EasyOCR (and probably other similar open-source
>>>>> packages) is probably our next option to try on. Sure, sharing
>>>>> our experiences will be helpful. I will let you know if I made good
>>>>> progresses in any of these options.
>>>>> On Wednesday, September 13, 2023 at 12:19:48 PM UTC+3
>>>>> [email protected] wrote:
>>>>>
>>>>>> How is your training going for Bengali? It was nearly good but I
>>>>>> faced space problems between two words, some words are spaces but most
>>>>>> of
>>>>>> them have no space. I think is problem is in the dataset but I use the
>>>>>> default training dataset from Tesseract which is used in Ben That way I
>>>>>> am
>>>>>> confused so I have to explore more. by the way, you can try as Lorenzo
>>>>>> Blz said. Actually training from scratch is harder than
>>>>>> fine-tuning. so you can use different datasets to explore. if you
>>>>>> succeed.
>>>>>> please let me know how you have done this whole process. I'm also new
>>>>>> in
>>>>>> this field.
>>>>>> On Wednesday, 13 September, 2023 at 1:13:43 pm UTC+6
>>>>>> [email protected] wrote:
>>>>>>
>>>>>>> How is your training going for Bengali?
>>>>>>> I have been trying to train from scratch. I made about 64,000 lines
>>>>>>> of text (which produced about 255,000 files, in the end) and run the
>>>>>>> training for 150,000 iterations; getting 0.51 training error rate. I
>>>>>>> was
>>>>>>> hopping to get reasonable accuracy. Unfortunately, when I run the OCR
>>>>>>> using .traineddata, the accuracy is absolutely terrible. Do you think
>>>>>>> I
>>>>>>> made some mistakes, or that is an expected result?
>>>>>>>
>>>>>>> On Tuesday, September 12, 2023 at 11:15:25 PM UTC+3
>>>>>>> [email protected] wrote:
>>>>>>>
>>>>>>>> Yes, he doesn't mention all fonts but only one font. That way he
>>>>>>>> didn't use *MODEL_NAME in a separate **script **file script I
>>>>>>>> think.*
>>>>>>>>
>>>>>>>> Actually, here we teach all *tif, gt.txt, and .box files *which
>>>>>>>> are created by *MODEL_NAME I mean **eng, ben, oro flag or
>>>>>>>> language code *because when we first create *tif, gt.txt, and .box
>>>>>>>> files, *every file starts by *MODEL_NAME*. This *MODEL_NAME* we
>>>>>>>> selected on the training script for looping each tif, gt.txt, and .box
>>>>>>>> files which are created by *MODEL_NAME.*
>>>>>>>>
>>>>>>>> On Tuesday, 12 September, 2023 at 9:42:13 pm UTC+6
>>>>>>>> [email protected] wrote:
>>>>>>>>
>>>>>>>>> Yes, I am familiar with the video and have set up the folder
>>>>>>>>> structure as you did. Indeed, I have tried a number of fine-tuning
>>>>>>>>> with a
>>>>>>>>> single font following Gracia's video. But, your script is much
>>>>>>>>> better
>>>>>>>>> because supports multiple fonts. The whole improvement you made is
>>>>>>>>> brilliant; and very useful. It is all working for me.
>>>>>>>>> The only part that I didn't understand is the trick you used in
>>>>>>>>> your tesseract_train.py script. You see, I have been doing exactly to
>>>>>>>>> you
>>>>>>>>> did except this script.
>>>>>>>>>
>>>>>>>>> The scripts seems to have the trick of sending/teaching each of
>>>>>>>>> the fonts (iteratively) into the model. The script I have been using
>>>>>>>>> (which I get from Garcia) doesn't mention font at all.
>>>>>>>>>
>>>>>>>>> *TESSDATA_PREFIX=../tesseract/tessdata make training
>>>>>>>>> MODEL_NAME=oro TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000*
>>>>>>>>> Does it mean that my model does't train the fonts (even if the
>>>>>>>>> fonts have been included in the splitting process, in the other
>>>>>>>>> script)?
>>>>>>>>> On Monday, September 11, 2023 at 10:54:08 AM UTC+3
>>>>>>>>> [email protected] wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *import subprocess# List of font namesfont_names = ['ben']for
>>>>>>>>>> font in font_names: command =
>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata
>>>>>>>>>> make training MODEL_NAME={font} START_MODEL=ben
>>>>>>>>>> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000"*
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> * subprocess.run(command, shell=True) 1 . This command is for
>>>>>>>>>> training data that I have named '*tesseract_training*.py' inside
>>>>>>>>>> tesstrain folder.*
>>>>>>>>>> *2. root directory means your main training folder and inside it
>>>>>>>>>> as like langdata, tessearact, tesstrain folders. if you see this
>>>>>>>>>> tutorial
>>>>>>>>>> *https://www.youtube.com/watch?v=KE4xEzFGSU8 you will
>>>>>>>>>> understand better the folder structure. only I
>>>>>>>>>> created tesseract_training.py in tesstrain folder for training and
>>>>>>>>>> FontList.py file is the main path as *like langdata,
>>>>>>>>>> tessearact, tesstrain, and *split_training_text.py.
>>>>>>>>>> 3. first of all you have to put all fonts in your Linux fonts
>>>>>>>>>> folder. /usr/share/fonts/ then run: sudo apt update then sudo
>>>>>>>>>> fc-cache -fv
>>>>>>>>>>
>>>>>>>>>> after that, you have to add the exact font's name in FontList.py
>>>>>>>>>> file like me.
>>>>>>>>>> I have added two pic my folder structure. first is main
>>>>>>>>>> structure pic and the second is the Colopse tesstrain folder.
>>>>>>>>>>
>>>>>>>>>> I[image: Screenshot 2023-09-11 134947.png][image: Screenshot
>>>>>>>>>> 2023-09-11 135014.png]
>>>>>>>>>> On Monday, 11 September, 2023 at 12:50:03 pm UTC+6
>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>
>>>>>>>>>>> Thank you so much for putting out these brilliant scripts. They
>>>>>>>>>>> make the process much more efficient.
>>>>>>>>>>>
>>>>>>>>>>> I have one more question on the other script that you use to
>>>>>>>>>>> train.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *import subprocess# List of font namesfont_names = ['ben']for
>>>>>>>>>>> font in font_names: command =
>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata
>>>>>>>>>>> make training MODEL_NAME={font} START_MODEL=ben
>>>>>>>>>>> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000"*
>>>>>>>>>>> * subprocess.run(command, shell=True) *
>>>>>>>>>>>
>>>>>>>>>>> Do you have the name of fonts listed in file in the same/root
>>>>>>>>>>> directory?
>>>>>>>>>>> How do you setup the names of the fonts in the file, if you
>>>>>>>>>>> don't mind sharing it?
>>>>>>>>>>> On Monday, September 11, 2023 at 4:27:27 AM UTC+3
>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>
>>>>>>>>>>>> You can use the new script below. it's better than the previous
>>>>>>>>>>>> two scripts. You can create *tif, gt.txt, and .box files *by
>>>>>>>>>>>> multiple fonts and also use breakpoint if vs code close or
>>>>>>>>>>>> anything during
>>>>>>>>>>>> creating *tif, gt.txt, and .box files *then you can checkpoint
>>>>>>>>>>>> to navigate where you close vs code.
>>>>>>>>>>>>
>>>>>>>>>>>> command for *tif, gt.txt, and .box files *
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> import os
>>>>>>>>>>>> import random
>>>>>>>>>>>> import pathlib
>>>>>>>>>>>> import subprocess
>>>>>>>>>>>> import argparse
>>>>>>>>>>>> from FontList import FontList
>>>>>>>>>>>>
>>>>>>>>>>>> def create_training_data(training_text_file, font_list,
>>>>>>>>>>>> output_directory, start_line=None, end_line=None):
>>>>>>>>>>>> lines = []
>>>>>>>>>>>> with open(training_text_file, 'r') as input_file:
>>>>>>>>>>>> lines = input_file.readlines()
>>>>>>>>>>>>
>>>>>>>>>>>> if not os.path.exists(output_directory):
>>>>>>>>>>>> os.mkdir(output_directory)
>>>>>>>>>>>>
>>>>>>>>>>>> if start_line is None:
>>>>>>>>>>>> start_line = 0
>>>>>>>>>>>>
>>>>>>>>>>>> if end_line is None:
>>>>>>>>>>>> end_line = len(lines) - 1
>>>>>>>>>>>>
>>>>>>>>>>>> for font_name in font_list.fonts:
>>>>>>>>>>>> for line_index in range(start_line, end_line + 1):
>>>>>>>>>>>> line = lines[line_index].strip()
>>>>>>>>>>>>
>>>>>>>>>>>> training_text_file_name = pathlib.Path(
>>>>>>>>>>>> training_text_file).stem
>>>>>>>>>>>>
>>>>>>>>>>>> line_serial = f"{line_index:d}"
>>>>>>>>>>>>
>>>>>>>>>>>> line_gt_text = os.path.join(output_directory, f'{
>>>>>>>>>>>> training_text_file_name}_{line_serial}_{font_name.replace(" ",
>>>>>>>>>>>> "_")}.gt.txt')
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> with open(line_gt_text, 'w') as output_file:
>>>>>>>>>>>> output_file.writelines([line])
>>>>>>>>>>>>
>>>>>>>>>>>> file_base_name = f'{training_text_file_name}_{
>>>>>>>>>>>> line_serial}_{font_name.replace(" ", "_")}'
>>>>>>>>>>>> subprocess.run([
>>>>>>>>>>>> 'text2image',
>>>>>>>>>>>> f'--font={font_name}',
>>>>>>>>>>>> f'--text={line_gt_text}',
>>>>>>>>>>>> f'--outputbase={output_directory}/{
>>>>>>>>>>>> file_base_name}',
>>>>>>>>>>>> '--max_pages=1',
>>>>>>>>>>>> '--strip_unrenderable_words',
>>>>>>>>>>>> '--leading=36',
>>>>>>>>>>>> '--xsize=3600',
>>>>>>>>>>>> '--ysize=330',
>>>>>>>>>>>> '--char_spacing=1.0',
>>>>>>>>>>>> '--exposure=0',
>>>>>>>>>>>> '--unicharset_file=langdata/eng.unicharset',
>>>>>>>>>>>> ])
>>>>>>>>>>>>
>>>>>>>>>>>> if __name__ == "__main__":
>>>>>>>>>>>> parser = argparse.ArgumentParser()
>>>>>>>>>>>> parser.add_argument('--start', type=int, help='Starting
>>>>>>>>>>>> line count (inclusive)')
>>>>>>>>>>>> parser.add_argument('--end', type=int, help='Ending line
>>>>>>>>>>>> count (inclusive)')
>>>>>>>>>>>> args = parser.parse_args()
>>>>>>>>>>>>
>>>>>>>>>>>> training_text_file = 'langdata/eng.training_text'
>>>>>>>>>>>> output_directory = 'tesstrain/data/eng-ground-truth'
>>>>>>>>>>>>
>>>>>>>>>>>> font_list = FontList()
>>>>>>>>>>>>
>>>>>>>>>>>> create_training_data(training_text_file, font_list,
>>>>>>>>>>>> output_directory, args.start, args.end)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Then create a file called "FontList" in the root directory and
>>>>>>>>>>>> paste it.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> class FontList:
>>>>>>>>>>>> def __init__(self):
>>>>>>>>>>>> self.fonts = [
>>>>>>>>>>>> "Gerlick"
>>>>>>>>>>>> "Sagar Medium",
>>>>>>>>>>>> "Ekushey Lohit Normal",
>>>>>>>>>>>> "Charukola Round Head Regular, weight=433",
>>>>>>>>>>>> "Charukola Round Head Bold, weight=443",
>>>>>>>>>>>> "Ador Orjoma Unicode",
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> ]
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> then import in the above code,
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> *for breakpoint command:*
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> sudo python3 split_training_text.py --start 0 --end 11
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> change checkpoint according to you --start 0 --end 11.
>>>>>>>>>>>>
>>>>>>>>>>>> *and training checkpoint as you know already.*
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Monday, 11 September, 2023 at 1:22:34 am UTC+6
>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi mhalidu,
>>>>>>>>>>>>> the script you posted here seems much more extensive than you
>>>>>>>>>>>>> posted before:
>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/0e2880d9-64c0-4659-b497-902a5747caf4n%40googlegroups.com
>>>>>>>>>>>>> .
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have been using your earlier script. It is magical. How is
>>>>>>>>>>>>> this one different from the earlier one?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thank you for posting these scripts, by the way. It has saved
>>>>>>>>>>>>> my countless hours; by running multiple fonts in one sweep. I was
>>>>>>>>>>>>> not able
>>>>>>>>>>>>> to find any instruction on how to train for multiple fonts. The
>>>>>>>>>>>>> official
>>>>>>>>>>>>> manual is also unclear. YOUr script helped me to get started.
>>>>>>>>>>>>> On Wednesday, August 9, 2023 at 11:00:49 PM UTC+3
>>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> ok, I will try as you said.
>>>>>>>>>>>>>> one more thing, what's the role of the trained_text lines
>>>>>>>>>>>>>> will be? I have seen Bengali text are long words of lines. so I
>>>>>>>>>>>>>> wanna know
>>>>>>>>>>>>>> how many words or characters will be the better choice for the
>>>>>>>>>>>>>> train?
>>>>>>>>>>>>>> and '--xsize=3600','--ysize=350', will be according to words of
>>>>>>>>>>>>>> lines?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thursday, 10 August, 2023 at 1:10:14 am UTC+6 shree wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Include the default fonts also in your fine-tuning list of
>>>>>>>>>>>>>>> fonts and see if that helps.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Aug 9, 2023, 2:27 PM Ali hussain <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I have trained some new fonts by fine-tune methods for the
>>>>>>>>>>>>>>>> Bengali language in Tesseract 5 and I have used all official
>>>>>>>>>>>>>>>> trained_text
>>>>>>>>>>>>>>>> and tessdata_best and other things also. everything is good
>>>>>>>>>>>>>>>> but the
>>>>>>>>>>>>>>>> problem is the default font which was trained before that does
>>>>>>>>>>>>>>>> not convert
>>>>>>>>>>>>>>>> text like prev but my new fonts work well. I don't understand
>>>>>>>>>>>>>>>> why it's
>>>>>>>>>>>>>>>> happening. I share code based to understand what going on.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> *codes for creating tif, gt.txt, .box files:*
>>>>>>>>>>>>>>>> import os
>>>>>>>>>>>>>>>> import random
>>>>>>>>>>>>>>>> import pathlib
>>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>>> import argparse
>>>>>>>>>>>>>>>> from FontList import FontList
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> def read_line_count():
>>>>>>>>>>>>>>>> if os.path.exists('line_count.txt'):
>>>>>>>>>>>>>>>> with open('line_count.txt', 'r') as file:
>>>>>>>>>>>>>>>> return int(file.read())
>>>>>>>>>>>>>>>> return 0
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> def write_line_count(line_count):
>>>>>>>>>>>>>>>> with open('line_count.txt', 'w') as file:
>>>>>>>>>>>>>>>> file.write(str(line_count))
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> def create_training_data(training_text_file, font_list,
>>>>>>>>>>>>>>>> output_directory, start_line=None, end_line=None):
>>>>>>>>>>>>>>>> lines = []
>>>>>>>>>>>>>>>> with open(training_text_file, 'r') as input_file:
>>>>>>>>>>>>>>>> for line in input_file.readlines():
>>>>>>>>>>>>>>>> lines.append(line.strip())
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> if not os.path.exists(output_directory):
>>>>>>>>>>>>>>>> os.mkdir(output_directory)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> random.shuffle(lines)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> if start_line is None:
>>>>>>>>>>>>>>>> line_count = read_line_count() # Set the starting
>>>>>>>>>>>>>>>> line_count from the file
>>>>>>>>>>>>>>>> else:
>>>>>>>>>>>>>>>> line_count = start_line
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> if end_line is None:
>>>>>>>>>>>>>>>> end_line_count = len(lines) - 1 # Set the ending
>>>>>>>>>>>>>>>> line_count
>>>>>>>>>>>>>>>> else:
>>>>>>>>>>>>>>>> end_line_count = min(end_line, len(lines) - 1)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> for font in font_list.fonts: # Iterate through all
>>>>>>>>>>>>>>>> the fonts in the font_list
>>>>>>>>>>>>>>>> font_serial = 1
>>>>>>>>>>>>>>>> for line in lines:
>>>>>>>>>>>>>>>> training_text_file_name = pathlib.Path(
>>>>>>>>>>>>>>>> training_text_file).stem
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> # Generate a unique serial number for each line
>>>>>>>>>>>>>>>> line_serial = f"{line_count:d}"
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> # GT (Ground Truth) text filename
>>>>>>>>>>>>>>>> line_gt_text = os.path.join(output_directory, f
>>>>>>>>>>>>>>>> '{training_text_file_name}_{line_serial}.gt.txt')
>>>>>>>>>>>>>>>> with open(line_gt_text, 'w') as output_file:
>>>>>>>>>>>>>>>> output_file.writelines([line])
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> # Image filename
>>>>>>>>>>>>>>>> file_base_name = f'ben_{line_serial}' #
>>>>>>>>>>>>>>>> Unique filename for each font
>>>>>>>>>>>>>>>> subprocess.run([
>>>>>>>>>>>>>>>> 'text2image',
>>>>>>>>>>>>>>>> f'--font={font}',
>>>>>>>>>>>>>>>> f'--text={line_gt_text}',
>>>>>>>>>>>>>>>> f'--outputbase={output_directory}/{
>>>>>>>>>>>>>>>> file_base_name}',
>>>>>>>>>>>>>>>> '--max_pages=1',
>>>>>>>>>>>>>>>> '--strip_unrenderable_words',
>>>>>>>>>>>>>>>> '--leading=36',
>>>>>>>>>>>>>>>> '--xsize=3600',
>>>>>>>>>>>>>>>> '--ysize=350',
>>>>>>>>>>>>>>>> '--char_spacing=1.0',
>>>>>>>>>>>>>>>> '--exposure=0',
>>>>>>>>>>>>>>>> '--unicharset_file=langdata/ben.unicharset'
>>>>>>>>>>>>>>>> ,
>>>>>>>>>>>>>>>> ])
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> line_count += 1
>>>>>>>>>>>>>>>> font_serial += 1
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> # Reset font_serial for the next font iteration
>>>>>>>>>>>>>>>> font_serial = 1
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> write_line_count(line_count) # Update the line_count
>>>>>>>>>>>>>>>> in the file
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> if __name__ == "__main__":
>>>>>>>>>>>>>>>> parser = argparse.ArgumentParser()
>>>>>>>>>>>>>>>> parser.add_argument('--start', type=int, help='Starting
>>>>>>>>>>>>>>>> line count (inclusive)')
>>>>>>>>>>>>>>>> parser.add_argument('--end', type=int, help='Ending
>>>>>>>>>>>>>>>> line count (inclusive)')
>>>>>>>>>>>>>>>> args = parser.parse_args()
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> training_text_file = 'langdata/ben.training_text'
>>>>>>>>>>>>>>>> output_directory = 'tesstrain/data/ben-ground-truth'
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> # Create an instance of the FontList class
>>>>>>>>>>>>>>>> font_list = FontList()
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> create_training_data(training_text_file, font_list,
>>>>>>>>>>>>>>>> output_directory, args.start, args.end)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> *and for training code:*
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> # List of font names
>>>>>>>>>>>>>>>> font_names = ['ben']
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> for font in font_names:
>>>>>>>>>>>>>>>> command = f"TESSDATA_PREFIX=../tesseract/tessdata make
>>>>>>>>>>>>>>>> training MODEL_NAME={font} START_MODEL=ben
>>>>>>>>>>>>>>>> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000
>>>>>>>>>>>>>>>> LANG_TYPE=Indic"
>>>>>>>>>>>>>>>> subprocess.run(command, shell=True)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> any suggestion to identify to extract the problem.
>>>>>>>>>>>>>>>> thanks, everyone
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> You received this message because you are subscribed to the
>>>>>>>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails
>>>>>>>>>>>>>>>> from it, send an email to [email protected].
>>>>>>>>>>>>>>>> To view this discussion on the web visit
>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/5ffbaca4-fcfd-4e8a-8cd6-8709aee142a3n%40googlegroups.com.