I now get to 200000 iterations; and the error rate is stuck at 0.46. The result is absolutely trash: nowhere close to the default/Ray's training.
On Wednesday, September 13, 2023 at 2:47:05 PM UTC+3 mdalihu...@gmail.com wrote: > > after Tesseact recognizes text from images. then you can apply regex to > replace the wrong word with to correct word. > I'm not familiar with paddleOcr and scanTailor also. > > On Wednesday, 13 September, 2023 at 5:06:12 pm UTC+6 desal...@gmail.com > wrote: > >> At what stage are you doing the regex replacement? >> My process has been: Scan (tif)--> ScanTailor --> Tesseract --> pdf >> >> >EasyOCR I think is best for ID cards or something like that image >> process. but document images like books, here Tesseract is better than >> EasyOCR. >> >> How about paddleOcr?, are you familiar with it? >> >> On Wednesday, September 13, 2023 at 1:45:54 PM UTC+3 mdalihu...@gmail.com >> wrote: >> >>> I know what you mean. but in some cases, it helps me. I have faced >>> specific characters and words are always not recognized by Tesseract. That >>> way I use these regex to replace those characters and words if those >>> characters are incorrect. >>> >>> see what I have done: >>> >>> " ী": "ী", >>> " ্": " ", >>> " ে": " ", >>> জ্া: "জা", >>> " ": " ", >>> " ": " ", >>> " ": " ", >>> "্প": " ", >>> " য": "র্য", >>> য: "য", >>> " া": "া", >>> আা: "আ", >>> ম্ি: "মি", >>> স্ু: "সু", >>> "হূ ": "হূ", >>> " ণ": "ণ", >>> র্্: "র", >>> "চিন্ত ": "চিন্তা ", >>> ন্া: "না", >>> "সম ূর্ন": "সম্পূর্ণ", >>> On Wednesday, 13 September, 2023 at 4:18:22 pm UTC+6 desal...@gmail.com >>> wrote: >>> >>>> The problem for regex is that Tesseract is not consistent in its >>>> replacement. >>>> Think of the original training of English data doesn't contain the >>>> letter /u/. What does Tesseract do when it faces /u/ in actual processing?? >>>> In some cases, it replaces it with closely similar letters such as /v/ >>>> and /w/. In other cases, it completely removes it. That is what is >>>> happening with my case. Those characters re sometimes completely removed; >>>> other times, they are replaced by closely resembling characters. Because >>>> of >>>> this inconsistency, applying regex is very difficult. >>>> >>>> On Wednesday, September 13, 2023 at 1:02:01 PM UTC+3 >>>> mdalihu...@gmail.com wrote: >>>> >>>>> if Some specific characters or words are always missing from the OCR >>>>> result. then you can apply logic with the Regular expressions method on >>>>> your applications. After OCR, these specific characters or words will be >>>>> replaced by current characters or words that you defined in your >>>>> applications by Regular expressions. it can be done in some major >>>>> problems. >>>>> >>>>> On Wednesday, 13 September, 2023 at 3:51:29 pm UTC+6 >>>>> desal...@gmail.com wrote: >>>>> >>>>>> The characters are getting missed, even after fine-tuning. >>>>>> I never made any progress. I tried many different ways. Some >>>>>> specific characters are always missing from the OCR result. >>>>>> >>>>>> On Wednesday, September 13, 2023 at 12:49:20 PM UTC+3 >>>>>> mdalihu...@gmail.com wrote: >>>>>> >>>>>>> EasyOCR I think is best for ID cards or something like that image >>>>>>> process. but document images like books, here Tesseract is better than >>>>>>> EasyOCR. Even I didn't use EasyOCR. you can try it. >>>>>>> >>>>>>> I have added words of dictionaries but the result is the same. >>>>>>> >>>>>>> what kind of problem you have faced in fine-tuning in few new >>>>>>> characters as you said (*but, I failed in every possible way to >>>>>>> introduce a few new characters into the database.)* >>>>>>> On Wednesday, 13 September, 2023 at 3:33:48 pm UTC+6 >>>>>>> desal...@gmail.com wrote: >>>>>>> >>>>>>>> Yes, we are new to this. I find the instructions (the manual) very >>>>>>>> hard to follow. The video you linked above was really helpful to get >>>>>>>> started. My plan at the beginning was to fine tune the existing >>>>>>>> .traineddata. But, I failed in every possible way to introduce a few >>>>>>>> new >>>>>>>> characters into the database. That is why I started from scratch. >>>>>>>> >>>>>>>> Sure, I will follow Lorenzo's suggestion: will run more the >>>>>>>> iterations, and see if I can improve. >>>>>>>> >>>>>>>> Another areas we need to explore is usage of dictionaries actually. >>>>>>>> May be adding millions of words into the dictionary could help >>>>>>>> Tesseract. I >>>>>>>> don't have millions of words; but I am looking into some corpus to get >>>>>>>> more >>>>>>>> words into the dictionary. >>>>>>>> >>>>>>>> If this all fails, EasyOCR (and probably other similar open-source >>>>>>>> packages) is probably our next option to try on. Sure, sharing >>>>>>>> our experiences will be helpful. I will let you know if I made good >>>>>>>> progresses in any of these options. >>>>>>>> On Wednesday, September 13, 2023 at 12:19:48 PM UTC+3 >>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>> >>>>>>>>> How is your training going for Bengali? It was nearly good but I >>>>>>>>> faced space problems between two words, some words are spaces but >>>>>>>>> most of >>>>>>>>> them have no space. I think is problem is in the dataset but I use >>>>>>>>> the >>>>>>>>> default training dataset from Tesseract which is used in Ben That way >>>>>>>>> I am >>>>>>>>> confused so I have to explore more. by the way, you can try as >>>>>>>>> Lorenzo >>>>>>>>> Blz said. Actually training from scratch is harder than >>>>>>>>> fine-tuning. so you can use different datasets to explore. if you >>>>>>>>> succeed. >>>>>>>>> please let me know how you have done this whole process. I'm also >>>>>>>>> new in >>>>>>>>> this field. >>>>>>>>> On Wednesday, 13 September, 2023 at 1:13:43 pm UTC+6 >>>>>>>>> desal...@gmail.com wrote: >>>>>>>>> >>>>>>>>>> How is your training going for Bengali? >>>>>>>>>> I have been trying to train from scratch. I made about 64,000 >>>>>>>>>> lines of text (which produced about 255,000 files, in the end) and >>>>>>>>>> run the >>>>>>>>>> training for 150,000 iterations; getting 0.51 training error rate. I >>>>>>>>>> was >>>>>>>>>> hopping to get reasonable accuracy. Unfortunately, when I run the >>>>>>>>>> OCR >>>>>>>>>> using .traineddata, the accuracy is absolutely terrible. Do you >>>>>>>>>> think I >>>>>>>>>> made some mistakes, or that is an expected result? >>>>>>>>>> >>>>>>>>>> On Tuesday, September 12, 2023 at 11:15:25 PM UTC+3 >>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>> >>>>>>>>>>> Yes, he doesn't mention all fonts but only one font. That way >>>>>>>>>>> he didn't use *MODEL_NAME in a separate **script **file script >>>>>>>>>>> I think.* >>>>>>>>>>> >>>>>>>>>>> Actually, here we teach all *tif, gt.txt, and .box files *which >>>>>>>>>>> are created by *MODEL_NAME I mean **eng, ben, oro flag or >>>>>>>>>>> language code *because when we first create *tif, gt.txt, and >>>>>>>>>>> .box files, *every file starts by *MODEL_NAME*. This >>>>>>>>>>> *MODEL_NAME* we selected on the training script for looping >>>>>>>>>>> each tif, gt.txt, and .box files which are created by >>>>>>>>>>> *MODEL_NAME.* >>>>>>>>>>> >>>>>>>>>>> On Tuesday, 12 September, 2023 at 9:42:13 pm UTC+6 >>>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>>> >>>>>>>>>>>> Yes, I am familiar with the video and have set up the folder >>>>>>>>>>>> structure as you did. Indeed, I have tried a number of fine-tuning >>>>>>>>>>>> with a >>>>>>>>>>>> single font following Gracia's video. But, your script is much >>>>>>>>>>>> better >>>>>>>>>>>> because supports multiple fonts. The whole improvement you made is >>>>>>>>>>>> >>>>>>>>>>>> brilliant; and very useful. It is all working for me. >>>>>>>>>>>> The only part that I didn't understand is the trick you used in >>>>>>>>>>>> your tesseract_train.py script. You see, I have been doing exactly >>>>>>>>>>>> to you >>>>>>>>>>>> did except this script. >>>>>>>>>>>> >>>>>>>>>>>> The scripts seems to have the trick of sending/teaching each of >>>>>>>>>>>> the fonts (iteratively) into the model. The script I have been >>>>>>>>>>>> using >>>>>>>>>>>> (which I get from Garcia) doesn't mention font at all. >>>>>>>>>>>> >>>>>>>>>>>> *TESSDATA_PREFIX=../tesseract/tessdata make training >>>>>>>>>>>> MODEL_NAME=oro TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000* >>>>>>>>>>>> Does it mean that my model does't train the fonts (even if the >>>>>>>>>>>> fonts have been included in the splitting process, in the other >>>>>>>>>>>> script)? >>>>>>>>>>>> On Monday, September 11, 2023 at 10:54:08 AM UTC+3 >>>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> *import subprocess# List of font namesfont_names = ['ben']for >>>>>>>>>>>>> font in font_names: command = >>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata >>>>>>>>>>>>> make training MODEL_NAME={font} START_MODEL=ben >>>>>>>>>>>>> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000"* >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> * subprocess.run(command, shell=True) 1 . This command is >>>>>>>>>>>>> for training data that I have named '*tesseract_training*.py' >>>>>>>>>>>>> inside tesstrain folder.* >>>>>>>>>>>>> *2. root directory means your main training folder and inside >>>>>>>>>>>>> it as like langdata, tessearact, tesstrain folders. if you see >>>>>>>>>>>>> this >>>>>>>>>>>>> tutorial *https://www.youtube.com/watch?v=KE4xEzFGSU8 >>>>>>>>>>>>> you will understand better the folder structure. only I >>>>>>>>>>>>> created tesseract_training.py in tesstrain folder for training >>>>>>>>>>>>> and >>>>>>>>>>>>> FontList.py file is the main path as *like langdata, >>>>>>>>>>>>> tessearact, tesstrain, and *split_training_text.py. >>>>>>>>>>>>> 3. first of all you have to put all fonts in your Linux fonts >>>>>>>>>>>>> folder. /usr/share/fonts/ then run: sudo apt update then >>>>>>>>>>>>> sudo >>>>>>>>>>>>> fc-cache -fv >>>>>>>>>>>>> >>>>>>>>>>>>> after that, you have to add the exact font's name in >>>>>>>>>>>>> FontList.py file like me. >>>>>>>>>>>>> I have added two pic my folder structure. first is main >>>>>>>>>>>>> structure pic and the second is the Colopse tesstrain folder. >>>>>>>>>>>>> >>>>>>>>>>>>> I[image: Screenshot 2023-09-11 134947.png][image: Screenshot >>>>>>>>>>>>> 2023-09-11 135014.png] >>>>>>>>>>>>> On Monday, 11 September, 2023 at 12:50:03 pm UTC+6 >>>>>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Thank you so much for putting out these brilliant scripts. >>>>>>>>>>>>>> They make the process much more efficient. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I have one more question on the other script that you use to >>>>>>>>>>>>>> train. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> *import subprocess# List of font namesfont_names = ['ben']for >>>>>>>>>>>>>> font in font_names: command = >>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata >>>>>>>>>>>>>> make training MODEL_NAME={font} START_MODEL=ben >>>>>>>>>>>>>> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000"* >>>>>>>>>>>>>> * subprocess.run(command, shell=True) * >>>>>>>>>>>>>> >>>>>>>>>>>>>> Do you have the name of fonts listed in file in the same/root >>>>>>>>>>>>>> directory? >>>>>>>>>>>>>> How do you setup the names of the fonts in the file, if you >>>>>>>>>>>>>> don't mind sharing it? >>>>>>>>>>>>>> On Monday, September 11, 2023 at 4:27:27 AM UTC+3 >>>>>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> You can use the new script below. it's better than the >>>>>>>>>>>>>>> previous two scripts. You can create *tif, gt.txt, and >>>>>>>>>>>>>>> .box files *by multiple fonts and also use breakpoint if vs >>>>>>>>>>>>>>> code close or anything during creating *tif, gt.txt, and >>>>>>>>>>>>>>> .box files *then you can checkpoint to navigate where you >>>>>>>>>>>>>>> close vs code. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> command for *tif, gt.txt, and .box files * >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> import os >>>>>>>>>>>>>>> import random >>>>>>>>>>>>>>> import pathlib >>>>>>>>>>>>>>> import subprocess >>>>>>>>>>>>>>> import argparse >>>>>>>>>>>>>>> from FontList import FontList >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> def create_training_data(training_text_file, font_list, >>>>>>>>>>>>>>> output_directory, start_line=None, end_line=None): >>>>>>>>>>>>>>> lines = [] >>>>>>>>>>>>>>> with open(training_text_file, 'r') as input_file: >>>>>>>>>>>>>>> lines = input_file.readlines() >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> if not os.path.exists(output_directory): >>>>>>>>>>>>>>> os.mkdir(output_directory) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> if start_line is None: >>>>>>>>>>>>>>> start_line = 0 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> if end_line is None: >>>>>>>>>>>>>>> end_line = len(lines) - 1 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> for font_name in font_list.fonts: >>>>>>>>>>>>>>> for line_index in range(start_line, end_line + 1): >>>>>>>>>>>>>>> line = lines[line_index].strip() >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> training_text_file_name = pathlib.Path( >>>>>>>>>>>>>>> training_text_file).stem >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> line_serial = f"{line_index:d}" >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> line_gt_text = os.path.join(output_directory, f' >>>>>>>>>>>>>>> {training_text_file_name}_{line_serial}_{font_name.replace(" >>>>>>>>>>>>>>> ", "_")}.gt.txt') >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> with open(line_gt_text, 'w') as output_file: >>>>>>>>>>>>>>> output_file.writelines([line]) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> file_base_name = f'{training_text_file_name}_{ >>>>>>>>>>>>>>> line_serial}_{font_name.replace(" ", "_")}' >>>>>>>>>>>>>>> subprocess.run([ >>>>>>>>>>>>>>> 'text2image', >>>>>>>>>>>>>>> f'--font={font_name}', >>>>>>>>>>>>>>> f'--text={line_gt_text}', >>>>>>>>>>>>>>> f'--outputbase={output_directory}/{ >>>>>>>>>>>>>>> file_base_name}', >>>>>>>>>>>>>>> '--max_pages=1', >>>>>>>>>>>>>>> '--strip_unrenderable_words', >>>>>>>>>>>>>>> '--leading=36', >>>>>>>>>>>>>>> '--xsize=3600', >>>>>>>>>>>>>>> '--ysize=330', >>>>>>>>>>>>>>> '--char_spacing=1.0', >>>>>>>>>>>>>>> '--exposure=0', >>>>>>>>>>>>>>> '--unicharset_file=langdata/eng.unicharset', >>>>>>>>>>>>>>> ]) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> if __name__ == "__main__": >>>>>>>>>>>>>>> parser = argparse.ArgumentParser() >>>>>>>>>>>>>>> parser.add_argument('--start', type=int, help='Starting >>>>>>>>>>>>>>> line count (inclusive)') >>>>>>>>>>>>>>> parser.add_argument('--end', type=int, help='Ending >>>>>>>>>>>>>>> line count (inclusive)') >>>>>>>>>>>>>>> args = parser.parse_args() >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> training_text_file = 'langdata/eng.training_text' >>>>>>>>>>>>>>> output_directory = 'tesstrain/data/eng-ground-truth' >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> font_list = FontList() >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> create_training_data(training_text_file, font_list, >>>>>>>>>>>>>>> output_directory, args.start, args.end) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Then create a file called "FontList" in the root directory >>>>>>>>>>>>>>> and paste it. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> class FontList: >>>>>>>>>>>>>>> def __init__(self): >>>>>>>>>>>>>>> self.fonts = [ >>>>>>>>>>>>>>> "Gerlick" >>>>>>>>>>>>>>> "Sagar Medium", >>>>>>>>>>>>>>> "Ekushey Lohit Normal", >>>>>>>>>>>>>>> "Charukola Round Head Regular, weight=433", >>>>>>>>>>>>>>> "Charukola Round Head Bold, weight=443", >>>>>>>>>>>>>>> "Ador Orjoma Unicode", >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> ] >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> then import in the above code, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> *for breakpoint command:* >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> sudo python3 split_training_text.py --start 0 --end 11 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> change checkpoint according to you --start 0 --end 11. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> *and training checkpoint as you know already.* >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 1:22:34 am UTC+6 >>>>>>>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi mhalidu, >>>>>>>>>>>>>>>> the script you posted here seems much more extensive than >>>>>>>>>>>>>>>> you posted before: >>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/0e2880d9-64c0-4659-b497-902a5747caf4n%40googlegroups.com >>>>>>>>>>>>>>>> . >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I have been using your earlier script. It is magical. How >>>>>>>>>>>>>>>> is this one different from the earlier one? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thank you for posting these scripts, by the way. It has >>>>>>>>>>>>>>>> saved my countless hours; by running multiple fonts in one >>>>>>>>>>>>>>>> sweep. I was not >>>>>>>>>>>>>>>> able to find any instruction on how to train for multiple >>>>>>>>>>>>>>>> fonts. The >>>>>>>>>>>>>>>> official manual is also unclear. YOUr script helped me to get >>>>>>>>>>>>>>>> started. >>>>>>>>>>>>>>>> On Wednesday, August 9, 2023 at 11:00:49 PM UTC+3 >>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> ok, I will try as you said. >>>>>>>>>>>>>>>>> one more thing, what's the role of the trained_text lines >>>>>>>>>>>>>>>>> will be? I have seen Bengali text are long words of lines. so >>>>>>>>>>>>>>>>> I wanna know >>>>>>>>>>>>>>>>> how many words or characters will be the better choice for >>>>>>>>>>>>>>>>> the train? >>>>>>>>>>>>>>>>> and '--xsize=3600','--ysize=350', will be according to words >>>>>>>>>>>>>>>>> of lines? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Thursday, 10 August, 2023 at 1:10:14 am UTC+6 shree >>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Include the default fonts also in your fine-tuning list >>>>>>>>>>>>>>>>>> of fonts and see if that helps. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Wed, Aug 9, 2023, 2:27 PM Ali hussain < >>>>>>>>>>>>>>>>>> mdalihu...@gmail.com> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I have trained some new fonts by fine-tune methods for >>>>>>>>>>>>>>>>>>> the Bengali language in Tesseract 5 and I have used all >>>>>>>>>>>>>>>>>>> official >>>>>>>>>>>>>>>>>>> trained_text and tessdata_best and other things also. >>>>>>>>>>>>>>>>>>> everything is good >>>>>>>>>>>>>>>>>>> but the problem is the default font which was trained >>>>>>>>>>>>>>>>>>> before that does not >>>>>>>>>>>>>>>>>>> convert text like prev but my new fonts work well. I don't >>>>>>>>>>>>>>>>>>> understand why >>>>>>>>>>>>>>>>>>> it's happening. I share code based to understand what going >>>>>>>>>>>>>>>>>>> on. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> *codes for creating tif, gt.txt, .box files:* >>>>>>>>>>>>>>>>>>> import os >>>>>>>>>>>>>>>>>>> import random >>>>>>>>>>>>>>>>>>> import pathlib >>>>>>>>>>>>>>>>>>> import subprocess >>>>>>>>>>>>>>>>>>> import argparse >>>>>>>>>>>>>>>>>>> from FontList import FontList >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> def read_line_count(): >>>>>>>>>>>>>>>>>>> if os.path.exists('line_count.txt'): >>>>>>>>>>>>>>>>>>> with open('line_count.txt', 'r') as file: >>>>>>>>>>>>>>>>>>> return int(file.read()) >>>>>>>>>>>>>>>>>>> return 0 >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> def write_line_count(line_count): >>>>>>>>>>>>>>>>>>> with open('line_count.txt', 'w') as file: >>>>>>>>>>>>>>>>>>> file.write(str(line_count)) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> def create_training_data(training_text_file, font_list, >>>>>>>>>>>>>>>>>>> output_directory, start_line=None, end_line=None): >>>>>>>>>>>>>>>>>>> lines = [] >>>>>>>>>>>>>>>>>>> with open(training_text_file, 'r') as input_file: >>>>>>>>>>>>>>>>>>> for line in input_file.readlines(): >>>>>>>>>>>>>>>>>>> lines.append(line.strip()) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> if not os.path.exists(output_directory): >>>>>>>>>>>>>>>>>>> os.mkdir(output_directory) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> random.shuffle(lines) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> if start_line is None: >>>>>>>>>>>>>>>>>>> line_count = read_line_count() # Set the >>>>>>>>>>>>>>>>>>> starting line_count from the file >>>>>>>>>>>>>>>>>>> else: >>>>>>>>>>>>>>>>>>> line_count = start_line >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> if end_line is None: >>>>>>>>>>>>>>>>>>> end_line_count = len(lines) - 1 # Set the >>>>>>>>>>>>>>>>>>> ending line_count >>>>>>>>>>>>>>>>>>> else: >>>>>>>>>>>>>>>>>>> end_line_count = min(end_line, len(lines) - 1) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> for font in font_list.fonts: # Iterate through all >>>>>>>>>>>>>>>>>>> the fonts in the font_list >>>>>>>>>>>>>>>>>>> font_serial = 1 >>>>>>>>>>>>>>>>>>> for line in lines: >>>>>>>>>>>>>>>>>>> training_text_file_name = pathlib.Path( >>>>>>>>>>>>>>>>>>> training_text_file).stem >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> # Generate a unique serial number for each >>>>>>>>>>>>>>>>>>> line >>>>>>>>>>>>>>>>>>> line_serial = f"{line_count:d}" >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> # GT (Ground Truth) text filename >>>>>>>>>>>>>>>>>>> line_gt_text = os.path.join(output_directory, >>>>>>>>>>>>>>>>>>> f'{training_text_file_name}_{line_serial}.gt.txt') >>>>>>>>>>>>>>>>>>> with open(line_gt_text, 'w') as output_file: >>>>>>>>>>>>>>>>>>> output_file.writelines([line]) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> # Image filename >>>>>>>>>>>>>>>>>>> file_base_name = f'ben_{line_serial}' # >>>>>>>>>>>>>>>>>>> Unique filename for each font >>>>>>>>>>>>>>>>>>> subprocess.run([ >>>>>>>>>>>>>>>>>>> 'text2image', >>>>>>>>>>>>>>>>>>> f'--font={font}', >>>>>>>>>>>>>>>>>>> f'--text={line_gt_text}', >>>>>>>>>>>>>>>>>>> f'--outputbase={output_directory}/{ >>>>>>>>>>>>>>>>>>> file_base_name}', >>>>>>>>>>>>>>>>>>> '--max_pages=1', >>>>>>>>>>>>>>>>>>> '--strip_unrenderable_words', >>>>>>>>>>>>>>>>>>> '--leading=36', >>>>>>>>>>>>>>>>>>> '--xsize=3600', >>>>>>>>>>>>>>>>>>> '--ysize=350', >>>>>>>>>>>>>>>>>>> '--char_spacing=1.0', >>>>>>>>>>>>>>>>>>> '--exposure=0', >>>>>>>>>>>>>>>>>>> ' >>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/ben.unicharset', >>>>>>>>>>>>>>>>>>> ]) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> line_count += 1 >>>>>>>>>>>>>>>>>>> font_serial += 1 >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> # Reset font_serial for the next font iteration >>>>>>>>>>>>>>>>>>> font_serial = 1 >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> write_line_count(line_count) # Update the >>>>>>>>>>>>>>>>>>> line_count in the file >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> if __name__ == "__main__": >>>>>>>>>>>>>>>>>>> parser = argparse.ArgumentParser() >>>>>>>>>>>>>>>>>>> parser.add_argument('--start', type=int, help='Starting >>>>>>>>>>>>>>>>>>> line count (inclusive)') >>>>>>>>>>>>>>>>>>> parser.add_argument('--end', type=int, help='Ending >>>>>>>>>>>>>>>>>>> line count (inclusive)') >>>>>>>>>>>>>>>>>>> args = parser.parse_args() >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> training_text_file = 'langdata/ben.training_text' >>>>>>>>>>>>>>>>>>> output_directory = 'tesstrain/data/ben-ground-truth' >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> # Create an instance of the FontList class >>>>>>>>>>>>>>>>>>> font_list = FontList() >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> create_training_data(training_text_file, font_list, >>>>>>>>>>>>>>>>>>> output_directory, args.start, args.end) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> *and for training code:* >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> import subprocess >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> # List of font names >>>>>>>>>>>>>>>>>>> font_names = ['ben'] >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> for font in font_names: >>>>>>>>>>>>>>>>>>> command = f"TESSDATA_PREFIX=../tesseract/tessdata >>>>>>>>>>>>>>>>>>> make training MODEL_NAME={font} START_MODEL=ben >>>>>>>>>>>>>>>>>>> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000 >>>>>>>>>>>>>>>>>>> LANG_TYPE=Indic" >>>>>>>>>>>>>>>>>>> subprocess.run(command, shell=True) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> any suggestion to identify to extract the problem. >>>>>>>>>>>>>>>>>>> thanks, everyone >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>> You received this message because you are subscribed to >>>>>>>>>>>>>>>>>>> the Google Groups "tesseract-ocr" group. >>>>>>>>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails >>>>>>>>>>>>>>>>>>> from it, send an email to >>>>>>>>>>>>>>>>>>> tesseract-oc...@googlegroups.com. >>>>>>>>>>>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>>>>>>>>>>> . >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2a9184e1-8816-4ecc-93e7-60df799df300n%40googlegroups.com.