you are training from Tessearact default text data or your own collected text data? On Thursday, 14 September, 2023 at 12:19:53 am UTC+6 desal...@gmail.com wrote:
> I now get to 200000 iterations; and the error rate is stuck at 0.46. The > result is absolutely trash: nowhere close to the default/Ray's training. > > On Wednesday, September 13, 2023 at 2:47:05 PM UTC+3 mdalihu...@gmail.com > wrote: > >> >> after Tesseact recognizes text from images. then you can apply regex to >> replace the wrong word with to correct word. >> I'm not familiar with paddleOcr and scanTailor also. >> >> On Wednesday, 13 September, 2023 at 5:06:12 pm UTC+6 desal...@gmail.com >> wrote: >> >>> At what stage are you doing the regex replacement? >>> My process has been: Scan (tif)--> ScanTailor --> Tesseract --> pdf >>> >>> >EasyOCR I think is best for ID cards or something like that image >>> process. but document images like books, here Tesseract is better than >>> EasyOCR. >>> >>> How about paddleOcr?, are you familiar with it? >>> >>> On Wednesday, September 13, 2023 at 1:45:54 PM UTC+3 >>> mdalihu...@gmail.com wrote: >>> >>>> I know what you mean. but in some cases, it helps me. I have faced >>>> specific characters and words are always not recognized by Tesseract. That >>>> way I use these regex to replace those characters and words if those >>>> characters are incorrect. >>>> >>>> see what I have done: >>>> >>>> " ী": "ী", >>>> " ্": " ", >>>> " ে": " ", >>>> জ্া: "জা", >>>> " ": " ", >>>> " ": " ", >>>> " ": " ", >>>> "্প": " ", >>>> " য": "র্য", >>>> য: "য", >>>> " া": "া", >>>> আা: "আ", >>>> ম্ি: "মি", >>>> স্ু: "সু", >>>> "হূ ": "হূ", >>>> " ণ": "ণ", >>>> র্্: "র", >>>> "চিন্ত ": "চিন্তা ", >>>> ন্া: "না", >>>> "সম ূর্ন": "সম্পূর্ণ", >>>> On Wednesday, 13 September, 2023 at 4:18:22 pm UTC+6 desal...@gmail.com >>>> wrote: >>>> >>>>> The problem for regex is that Tesseract is not consistent in its >>>>> replacement. >>>>> Think of the original training of English data doesn't contain the >>>>> letter /u/. What does Tesseract do when it faces /u/ in actual >>>>> processing?? >>>>> In some cases, it replaces it with closely similar letters such as /v/ >>>>> and /w/. In other cases, it completely removes it. That is what is >>>>> happening with my case. Those characters re sometimes completely removed; >>>>> other times, they are replaced by closely resembling characters. Because >>>>> of >>>>> this inconsistency, applying regex is very difficult. >>>>> >>>>> On Wednesday, September 13, 2023 at 1:02:01 PM UTC+3 >>>>> mdalihu...@gmail.com wrote: >>>>> >>>>>> if Some specific characters or words are always missing from the OCR >>>>>> result. then you can apply logic with the Regular expressions method on >>>>>> your applications. After OCR, these specific characters or words will be >>>>>> replaced by current characters or words that you defined in your >>>>>> applications by Regular expressions. it can be done in some major >>>>>> problems. >>>>>> >>>>>> On Wednesday, 13 September, 2023 at 3:51:29 pm UTC+6 >>>>>> desal...@gmail.com wrote: >>>>>> >>>>>>> The characters are getting missed, even after fine-tuning. >>>>>>> I never made any progress. I tried many different ways. Some >>>>>>> specific characters are always missing from the OCR result. >>>>>>> >>>>>>> On Wednesday, September 13, 2023 at 12:49:20 PM UTC+3 >>>>>>> mdalihu...@gmail.com wrote: >>>>>>> >>>>>>>> EasyOCR I think is best for ID cards or something like that image >>>>>>>> process. but document images like books, here Tesseract is better than >>>>>>>> EasyOCR. Even I didn't use EasyOCR. you can try it. >>>>>>>> >>>>>>>> I have added words of dictionaries but the result is the same. >>>>>>>> >>>>>>>> what kind of problem you have faced in fine-tuning in few new >>>>>>>> characters as you said (*but, I failed in every possible way to >>>>>>>> introduce a few new characters into the database.)* >>>>>>>> On Wednesday, 13 September, 2023 at 3:33:48 pm UTC+6 >>>>>>>> desal...@gmail.com wrote: >>>>>>>> >>>>>>>>> Yes, we are new to this. I find the instructions (the manual) very >>>>>>>>> hard to follow. The video you linked above was really helpful to get >>>>>>>>> started. My plan at the beginning was to fine tune the existing >>>>>>>>> .traineddata. But, I failed in every possible way to introduce a few >>>>>>>>> new >>>>>>>>> characters into the database. That is why I started from scratch. >>>>>>>>> >>>>>>>>> Sure, I will follow Lorenzo's suggestion: will run more the >>>>>>>>> iterations, and see if I can improve. >>>>>>>>> >>>>>>>>> Another areas we need to explore is usage of dictionaries >>>>>>>>> actually. May be adding millions of words into the dictionary could >>>>>>>>> help >>>>>>>>> Tesseract. I don't have millions of words; but I am looking into some >>>>>>>>> corpus to get more words into the dictionary. >>>>>>>>> >>>>>>>>> If this all fails, EasyOCR (and probably other similar open-source >>>>>>>>> packages) is probably our next option to try on. Sure, sharing >>>>>>>>> our experiences will be helpful. I will let you know if I made good >>>>>>>>> progresses in any of these options. >>>>>>>>> On Wednesday, September 13, 2023 at 12:19:48 PM UTC+3 >>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>> >>>>>>>>>> How is your training going for Bengali? It was nearly good but I >>>>>>>>>> faced space problems between two words, some words are spaces but >>>>>>>>>> most of >>>>>>>>>> them have no space. I think is problem is in the dataset but I use >>>>>>>>>> the >>>>>>>>>> default training dataset from Tesseract which is used in Ben That >>>>>>>>>> way I am >>>>>>>>>> confused so I have to explore more. by the way, you can try as >>>>>>>>>> Lorenzo >>>>>>>>>> Blz said. Actually training from scratch is harder than >>>>>>>>>> fine-tuning. so you can use different datasets to explore. if you >>>>>>>>>> succeed. >>>>>>>>>> please let me know how you have done this whole process. I'm also >>>>>>>>>> new in >>>>>>>>>> this field. >>>>>>>>>> On Wednesday, 13 September, 2023 at 1:13:43 pm UTC+6 >>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>> >>>>>>>>>>> How is your training going for Bengali? >>>>>>>>>>> I have been trying to train from scratch. I made about 64,000 >>>>>>>>>>> lines of text (which produced about 255,000 files, in the end) and >>>>>>>>>>> run the >>>>>>>>>>> training for 150,000 iterations; getting 0.51 training error rate. >>>>>>>>>>> I was >>>>>>>>>>> hopping to get reasonable accuracy. Unfortunately, when I run the >>>>>>>>>>> OCR >>>>>>>>>>> using .traineddata, the accuracy is absolutely terrible. Do you >>>>>>>>>>> think I >>>>>>>>>>> made some mistakes, or that is an expected result? >>>>>>>>>>> >>>>>>>>>>> On Tuesday, September 12, 2023 at 11:15:25 PM UTC+3 >>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>> >>>>>>>>>>>> Yes, he doesn't mention all fonts but only one font. That way >>>>>>>>>>>> he didn't use *MODEL_NAME in a separate **script **file script >>>>>>>>>>>> I think.* >>>>>>>>>>>> >>>>>>>>>>>> Actually, here we teach all *tif, gt.txt, and .box files *which >>>>>>>>>>>> are created by *MODEL_NAME I mean **eng, ben, oro flag or >>>>>>>>>>>> language code *because when we first create *tif, gt.txt, and >>>>>>>>>>>> .box files, *every file starts by *MODEL_NAME*. This >>>>>>>>>>>> *MODEL_NAME* we selected on the training script for looping >>>>>>>>>>>> each tif, gt.txt, and .box files which are created by >>>>>>>>>>>> *MODEL_NAME.* >>>>>>>>>>>> >>>>>>>>>>>> On Tuesday, 12 September, 2023 at 9:42:13 pm UTC+6 >>>>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Yes, I am familiar with the video and have set up the folder >>>>>>>>>>>>> structure as you did. Indeed, I have tried a number of >>>>>>>>>>>>> fine-tuning with a >>>>>>>>>>>>> single font following Gracia's video. But, your script is much >>>>>>>>>>>>> better >>>>>>>>>>>>> because supports multiple fonts. The whole improvement you made >>>>>>>>>>>>> is >>>>>>>>>>>>> brilliant; and very useful. It is all working for me. >>>>>>>>>>>>> The only part that I didn't understand is the trick you used >>>>>>>>>>>>> in your tesseract_train.py script. You see, I have been doing >>>>>>>>>>>>> exactly to >>>>>>>>>>>>> you did except this script. >>>>>>>>>>>>> >>>>>>>>>>>>> The scripts seems to have the trick of sending/teaching each >>>>>>>>>>>>> of the fonts (iteratively) into the model. The script I have been >>>>>>>>>>>>> using >>>>>>>>>>>>> (which I get from Garcia) doesn't mention font at all. >>>>>>>>>>>>> >>>>>>>>>>>>> *TESSDATA_PREFIX=../tesseract/tessdata make training >>>>>>>>>>>>> MODEL_NAME=oro TESSDATA=../tesseract/tessdata >>>>>>>>>>>>> MAX_ITERATIONS=10000* >>>>>>>>>>>>> Does it mean that my model does't train the fonts (even if the >>>>>>>>>>>>> fonts have been included in the splitting process, in the other >>>>>>>>>>>>> script)? >>>>>>>>>>>>> On Monday, September 11, 2023 at 10:54:08 AM UTC+3 >>>>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> *import subprocess# List of font namesfont_names = ['ben']for >>>>>>>>>>>>>> font in font_names: command = >>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata >>>>>>>>>>>>>> make training MODEL_NAME={font} START_MODEL=ben >>>>>>>>>>>>>> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000"* >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> * subprocess.run(command, shell=True) 1 . This command is >>>>>>>>>>>>>> for training data that I have named '*tesseract_training*.py' >>>>>>>>>>>>>> inside tesstrain folder.* >>>>>>>>>>>>>> *2. root directory means your main training folder and inside >>>>>>>>>>>>>> it as like langdata, tessearact, tesstrain folders. if you see >>>>>>>>>>>>>> this >>>>>>>>>>>>>> tutorial *https://www.youtube.com/watch?v=KE4xEzFGSU8 >>>>>>>>>>>>>> you will understand better the folder structure. only I >>>>>>>>>>>>>> created tesseract_training.py in tesstrain folder for training >>>>>>>>>>>>>> and >>>>>>>>>>>>>> FontList.py file is the main path as *like langdata, >>>>>>>>>>>>>> tessearact, tesstrain, and *split_training_text.py. >>>>>>>>>>>>>> 3. first of all you have to put all fonts in your Linux fonts >>>>>>>>>>>>>> folder. /usr/share/fonts/ then run: sudo apt update >>>>>>>>>>>>>> then sudo fc-cache -fv >>>>>>>>>>>>>> >>>>>>>>>>>>>> after that, you have to add the exact font's name in >>>>>>>>>>>>>> FontList.py file like me. >>>>>>>>>>>>>> I have added two pic my folder structure. first is main >>>>>>>>>>>>>> structure pic and the second is the Colopse tesstrain folder. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I[image: Screenshot 2023-09-11 134947.png][image: Screenshot >>>>>>>>>>>>>> 2023-09-11 135014.png] >>>>>>>>>>>>>> On Monday, 11 September, 2023 at 12:50:03 pm UTC+6 >>>>>>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thank you so much for putting out these brilliant scripts. >>>>>>>>>>>>>>> They make the process much more efficient. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I have one more question on the other script that you use to >>>>>>>>>>>>>>> train. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> *import subprocess# List of font namesfont_names = >>>>>>>>>>>>>>> ['ben']for font in font_names: command = >>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make training >>>>>>>>>>>>>>> MODEL_NAME={font} >>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata >>>>>>>>>>>>>>> MAX_ITERATIONS=10000"* >>>>>>>>>>>>>>> * subprocess.run(command, shell=True) * >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Do you have the name of fonts listed in file in the >>>>>>>>>>>>>>> same/root directory? >>>>>>>>>>>>>>> How do you setup the names of the fonts in the file, if you >>>>>>>>>>>>>>> don't mind sharing it? >>>>>>>>>>>>>>> On Monday, September 11, 2023 at 4:27:27 AM UTC+3 >>>>>>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> You can use the new script below. it's better than the >>>>>>>>>>>>>>>> previous two scripts. You can create *tif, gt.txt, and >>>>>>>>>>>>>>>> .box files *by multiple fonts and also use breakpoint if >>>>>>>>>>>>>>>> vs code close or anything during creating *tif, gt.txt, >>>>>>>>>>>>>>>> and .box files *then you can checkpoint to navigate where >>>>>>>>>>>>>>>> you close vs code. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> command for *tif, gt.txt, and .box files * >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> import os >>>>>>>>>>>>>>>> import random >>>>>>>>>>>>>>>> import pathlib >>>>>>>>>>>>>>>> import subprocess >>>>>>>>>>>>>>>> import argparse >>>>>>>>>>>>>>>> from FontList import FontList >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> def create_training_data(training_text_file, font_list, >>>>>>>>>>>>>>>> output_directory, start_line=None, end_line=None): >>>>>>>>>>>>>>>> lines = [] >>>>>>>>>>>>>>>> with open(training_text_file, 'r') as input_file: >>>>>>>>>>>>>>>> lines = input_file.readlines() >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> if not os.path.exists(output_directory): >>>>>>>>>>>>>>>> os.mkdir(output_directory) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> if start_line is None: >>>>>>>>>>>>>>>> start_line = 0 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> if end_line is None: >>>>>>>>>>>>>>>> end_line = len(lines) - 1 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> for font_name in font_list.fonts: >>>>>>>>>>>>>>>> for line_index in range(start_line, end_line + 1): >>>>>>>>>>>>>>>> line = lines[line_index].strip() >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> training_text_file_name = pathlib.Path( >>>>>>>>>>>>>>>> training_text_file).stem >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> line_serial = f"{line_index:d}" >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> line_gt_text = os.path.join(output_directory, f >>>>>>>>>>>>>>>> '{training_text_file_name}_{line_serial}_{ >>>>>>>>>>>>>>>> font_name.replace(" ", "_")}.gt.txt') >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> with open(line_gt_text, 'w') as output_file: >>>>>>>>>>>>>>>> output_file.writelines([line]) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> file_base_name = f'{training_text_file_name}_{ >>>>>>>>>>>>>>>> line_serial}_{font_name.replace(" ", "_")}' >>>>>>>>>>>>>>>> subprocess.run([ >>>>>>>>>>>>>>>> 'text2image', >>>>>>>>>>>>>>>> f'--font={font_name}', >>>>>>>>>>>>>>>> f'--text={line_gt_text}', >>>>>>>>>>>>>>>> f'--outputbase={output_directory}/{ >>>>>>>>>>>>>>>> file_base_name}', >>>>>>>>>>>>>>>> '--max_pages=1', >>>>>>>>>>>>>>>> '--strip_unrenderable_words', >>>>>>>>>>>>>>>> '--leading=36', >>>>>>>>>>>>>>>> '--xsize=3600', >>>>>>>>>>>>>>>> '--ysize=330', >>>>>>>>>>>>>>>> '--char_spacing=1.0', >>>>>>>>>>>>>>>> '--exposure=0', >>>>>>>>>>>>>>>> '--unicharset_file=langdata/eng.unicharset' >>>>>>>>>>>>>>>> , >>>>>>>>>>>>>>>> ]) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> if __name__ == "__main__": >>>>>>>>>>>>>>>> parser = argparse.ArgumentParser() >>>>>>>>>>>>>>>> parser.add_argument('--start', type=int, help='Starting >>>>>>>>>>>>>>>> line count (inclusive)') >>>>>>>>>>>>>>>> parser.add_argument('--end', type=int, help='Ending >>>>>>>>>>>>>>>> line count (inclusive)') >>>>>>>>>>>>>>>> args = parser.parse_args() >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> training_text_file = 'langdata/eng.training_text' >>>>>>>>>>>>>>>> output_directory = 'tesstrain/data/eng-ground-truth' >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> font_list = FontList() >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> create_training_data(training_text_file, font_list, >>>>>>>>>>>>>>>> output_directory, args.start, args.end) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Then create a file called "FontList" in the root directory >>>>>>>>>>>>>>>> and paste it. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> class FontList: >>>>>>>>>>>>>>>> def __init__(self): >>>>>>>>>>>>>>>> self.fonts = [ >>>>>>>>>>>>>>>> "Gerlick" >>>>>>>>>>>>>>>> "Sagar Medium", >>>>>>>>>>>>>>>> "Ekushey Lohit Normal", >>>>>>>>>>>>>>>> "Charukola Round Head Regular, weight=433", >>>>>>>>>>>>>>>> "Charukola Round Head Bold, weight=443", >>>>>>>>>>>>>>>> "Ador Orjoma Unicode", >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> ] >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> then import in the above code, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> *for breakpoint command:* >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> sudo python3 split_training_text.py --start 0 --end 11 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> change checkpoint according to you --start 0 --end 11. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> *and training checkpoint as you know already.* >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 1:22:34 am UTC+6 >>>>>>>>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi mhalidu, >>>>>>>>>>>>>>>>> the script you posted here seems much more extensive than >>>>>>>>>>>>>>>>> you posted before: >>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/0e2880d9-64c0-4659-b497-902a5747caf4n%40googlegroups.com >>>>>>>>>>>>>>>>> . >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I have been using your earlier script. It is magical. How >>>>>>>>>>>>>>>>> is this one different from the earlier one? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thank you for posting these scripts, by the way. It has >>>>>>>>>>>>>>>>> saved my countless hours; by running multiple fonts in one >>>>>>>>>>>>>>>>> sweep. I was not >>>>>>>>>>>>>>>>> able to find any instruction on how to train for multiple >>>>>>>>>>>>>>>>> fonts. The >>>>>>>>>>>>>>>>> official manual is also unclear. YOUr script helped me to get >>>>>>>>>>>>>>>>> started. >>>>>>>>>>>>>>>>> On Wednesday, August 9, 2023 at 11:00:49 PM UTC+3 >>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> ok, I will try as you said. >>>>>>>>>>>>>>>>>> one more thing, what's the role of the trained_text lines >>>>>>>>>>>>>>>>>> will be? I have seen Bengali text are long words of lines. >>>>>>>>>>>>>>>>>> so I wanna know >>>>>>>>>>>>>>>>>> how many words or characters will be the better choice for >>>>>>>>>>>>>>>>>> the train? >>>>>>>>>>>>>>>>>> and '--xsize=3600','--ysize=350', will be according to >>>>>>>>>>>>>>>>>> words of lines? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Thursday, 10 August, 2023 at 1:10:14 am UTC+6 shree >>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Include the default fonts also in your fine-tuning list >>>>>>>>>>>>>>>>>>> of fonts and see if that helps. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Wed, Aug 9, 2023, 2:27 PM Ali hussain < >>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I have trained some new fonts by fine-tune methods for >>>>>>>>>>>>>>>>>>>> the Bengali language in Tesseract 5 and I have used all >>>>>>>>>>>>>>>>>>>> official >>>>>>>>>>>>>>>>>>>> trained_text and tessdata_best and other things also. >>>>>>>>>>>>>>>>>>>> everything is good >>>>>>>>>>>>>>>>>>>> but the problem is the default font which was trained >>>>>>>>>>>>>>>>>>>> before that does not >>>>>>>>>>>>>>>>>>>> convert text like prev but my new fonts work well. I don't >>>>>>>>>>>>>>>>>>>> understand why >>>>>>>>>>>>>>>>>>>> it's happening. I share code based to understand what >>>>>>>>>>>>>>>>>>>> going on. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> *codes for creating tif, gt.txt, .box files:* >>>>>>>>>>>>>>>>>>>> import os >>>>>>>>>>>>>>>>>>>> import random >>>>>>>>>>>>>>>>>>>> import pathlib >>>>>>>>>>>>>>>>>>>> import subprocess >>>>>>>>>>>>>>>>>>>> import argparse >>>>>>>>>>>>>>>>>>>> from FontList import FontList >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> def read_line_count(): >>>>>>>>>>>>>>>>>>>> if os.path.exists('line_count.txt'): >>>>>>>>>>>>>>>>>>>> with open('line_count.txt', 'r') as file: >>>>>>>>>>>>>>>>>>>> return int(file.read()) >>>>>>>>>>>>>>>>>>>> return 0 >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> def write_line_count(line_count): >>>>>>>>>>>>>>>>>>>> with open('line_count.txt', 'w') as file: >>>>>>>>>>>>>>>>>>>> file.write(str(line_count)) >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> def create_training_data(training_text_file, font_list, >>>>>>>>>>>>>>>>>>>> output_directory, start_line=None, end_line=None): >>>>>>>>>>>>>>>>>>>> lines = [] >>>>>>>>>>>>>>>>>>>> with open(training_text_file, 'r') as input_file: >>>>>>>>>>>>>>>>>>>> for line in input_file.readlines(): >>>>>>>>>>>>>>>>>>>> lines.append(line.strip()) >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> if not os.path.exists(output_directory): >>>>>>>>>>>>>>>>>>>> os.mkdir(output_directory) >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> random.shuffle(lines) >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> if start_line is None: >>>>>>>>>>>>>>>>>>>> line_count = read_line_count() # Set the >>>>>>>>>>>>>>>>>>>> starting line_count from the file >>>>>>>>>>>>>>>>>>>> else: >>>>>>>>>>>>>>>>>>>> line_count = start_line >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> if end_line is None: >>>>>>>>>>>>>>>>>>>> end_line_count = len(lines) - 1 # Set the >>>>>>>>>>>>>>>>>>>> ending line_count >>>>>>>>>>>>>>>>>>>> else: >>>>>>>>>>>>>>>>>>>> end_line_count = min(end_line, len(lines) - 1) >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> for font in font_list.fonts: # Iterate through >>>>>>>>>>>>>>>>>>>> all the fonts in the font_list >>>>>>>>>>>>>>>>>>>> font_serial = 1 >>>>>>>>>>>>>>>>>>>> for line in lines: >>>>>>>>>>>>>>>>>>>> training_text_file_name = pathlib.Path( >>>>>>>>>>>>>>>>>>>> training_text_file).stem >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> # Generate a unique serial number for each >>>>>>>>>>>>>>>>>>>> line >>>>>>>>>>>>>>>>>>>> line_serial = f"{line_count:d}" >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> # GT (Ground Truth) text filename >>>>>>>>>>>>>>>>>>>> line_gt_text = os.path.join( >>>>>>>>>>>>>>>>>>>> output_directory, f'{training_text_file_name}_{ >>>>>>>>>>>>>>>>>>>> line_serial}.gt.txt') >>>>>>>>>>>>>>>>>>>> with open(line_gt_text, 'w') as >>>>>>>>>>>>>>>>>>>> output_file: >>>>>>>>>>>>>>>>>>>> output_file.writelines([line]) >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> # Image filename >>>>>>>>>>>>>>>>>>>> file_base_name = f'ben_{line_serial}' # >>>>>>>>>>>>>>>>>>>> Unique filename for each font >>>>>>>>>>>>>>>>>>>> subprocess.run([ >>>>>>>>>>>>>>>>>>>> 'text2image', >>>>>>>>>>>>>>>>>>>> f'--font={font}', >>>>>>>>>>>>>>>>>>>> f'--text={line_gt_text}', >>>>>>>>>>>>>>>>>>>> f'--outputbase={output_directory}/{ >>>>>>>>>>>>>>>>>>>> file_base_name}', >>>>>>>>>>>>>>>>>>>> '--max_pages=1', >>>>>>>>>>>>>>>>>>>> '--strip_unrenderable_words', >>>>>>>>>>>>>>>>>>>> '--leading=36', >>>>>>>>>>>>>>>>>>>> '--xsize=3600', >>>>>>>>>>>>>>>>>>>> '--ysize=350', >>>>>>>>>>>>>>>>>>>> '--char_spacing=1.0', >>>>>>>>>>>>>>>>>>>> '--exposure=0', >>>>>>>>>>>>>>>>>>>> ' >>>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/ben.unicharset', >>>>>>>>>>>>>>>>>>>> ]) >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> line_count += 1 >>>>>>>>>>>>>>>>>>>> font_serial += 1 >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> # Reset font_serial for the next font iteration >>>>>>>>>>>>>>>>>>>> font_serial = 1 >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> write_line_count(line_count) # Update the >>>>>>>>>>>>>>>>>>>> line_count in the file >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> if __name__ == "__main__": >>>>>>>>>>>>>>>>>>>> parser = argparse.ArgumentParser() >>>>>>>>>>>>>>>>>>>> parser.add_argument('--start', type=int, >>>>>>>>>>>>>>>>>>>> help='Starting >>>>>>>>>>>>>>>>>>>> line count (inclusive)') >>>>>>>>>>>>>>>>>>>> parser.add_argument('--end', type=int, help='Ending >>>>>>>>>>>>>>>>>>>> line count (inclusive)') >>>>>>>>>>>>>>>>>>>> args = parser.parse_args() >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> training_text_file = 'langdata/ben.training_text' >>>>>>>>>>>>>>>>>>>> output_directory = 'tesstrain/data/ben-ground-truth >>>>>>>>>>>>>>>>>>>> ' >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> # Create an instance of the FontList class >>>>>>>>>>>>>>>>>>>> font_list = FontList() >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> create_training_data(training_text_file, >>>>>>>>>>>>>>>>>>>> font_list, output_directory, args.start, args.end) >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> *and for training code:* >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> import subprocess >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> # List of font names >>>>>>>>>>>>>>>>>>>> font_names = ['ben'] >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> for font in font_names: >>>>>>>>>>>>>>>>>>>> command = f"TESSDATA_PREFIX=../tesseract/tessdata >>>>>>>>>>>>>>>>>>>> make training MODEL_NAME={font} START_MODEL=ben >>>>>>>>>>>>>>>>>>>> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000 >>>>>>>>>>>>>>>>>>>> LANG_TYPE=Indic" >>>>>>>>>>>>>>>>>>>> subprocess.run(command, shell=True) >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> any suggestion to identify to extract the problem. >>>>>>>>>>>>>>>>>>>> thanks, everyone >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>> You received this message because you are subscribed to >>>>>>>>>>>>>>>>>>>> the Google Groups "tesseract-ocr" group. >>>>>>>>>>>>>>>>>>>> To unsubscribe from this group and stop receiving >>>>>>>>>>>>>>>>>>>> emails from it, send an email to >>>>>>>>>>>>>>>>>>>> tesseract-oc...@googlegroups.com. >>>>>>>>>>>>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>>>>>>>>>>>> . >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com.