I was using my own text On Thu, Sep 14, 2023, 6:58 AM Ali hussain <mdalihussain...@gmail.com> wrote:
> you are training from Tessearact default text data or your own collected > text data? > On Thursday, 14 September, 2023 at 12:19:53 am UTC+6 desal...@gmail.com > wrote: > >> I now get to 200000 iterations; and the error rate is stuck at 0.46. The >> result is absolutely trash: nowhere close to the default/Ray's training. >> >> On Wednesday, September 13, 2023 at 2:47:05 PM UTC+3 mdalihu...@gmail.com >> wrote: >> >>> >>> after Tesseact recognizes text from images. then you can apply regex to >>> replace the wrong word with to correct word. >>> I'm not familiar with paddleOcr and scanTailor also. >>> >>> On Wednesday, 13 September, 2023 at 5:06:12 pm UTC+6 desal...@gmail.com >>> wrote: >>> >>>> At what stage are you doing the regex replacement? >>>> My process has been: Scan (tif)--> ScanTailor --> Tesseract --> pdf >>>> >>>> >EasyOCR I think is best for ID cards or something like that image >>>> process. but document images like books, here Tesseract is better than >>>> EasyOCR. >>>> >>>> How about paddleOcr?, are you familiar with it? >>>> >>>> On Wednesday, September 13, 2023 at 1:45:54 PM UTC+3 >>>> mdalihu...@gmail.com wrote: >>>> >>>>> I know what you mean. but in some cases, it helps me. I have faced >>>>> specific characters and words are always not recognized by Tesseract. That >>>>> way I use these regex to replace those characters and words if those >>>>> characters are incorrect. >>>>> >>>>> see what I have done: >>>>> >>>>> " ী": "ী", >>>>> " ্": " ", >>>>> " ে": " ", >>>>> জ্া: "জা", >>>>> " ": " ", >>>>> " ": " ", >>>>> " ": " ", >>>>> "্প": " ", >>>>> " য": "র্য", >>>>> য: "য", >>>>> " া": "া", >>>>> আা: "আ", >>>>> ম্ি: "মি", >>>>> স্ু: "সু", >>>>> "হূ ": "হূ", >>>>> " ণ": "ণ", >>>>> র্্: "র", >>>>> "চিন্ত ": "চিন্তা ", >>>>> ন্া: "না", >>>>> "সম ূর্ন": "সম্পূর্ণ", >>>>> On Wednesday, 13 September, 2023 at 4:18:22 pm UTC+6 >>>>> desal...@gmail.com wrote: >>>>> >>>>>> The problem for regex is that Tesseract is not consistent in its >>>>>> replacement. >>>>>> Think of the original training of English data doesn't contain the >>>>>> letter /u/. What does Tesseract do when it faces /u/ in actual >>>>>> processing?? >>>>>> In some cases, it replaces it with closely similar letters such as >>>>>> /v/ and /w/. In other cases, it completely removes it. That is what is >>>>>> happening with my case. Those characters re sometimes completely removed; >>>>>> other times, they are replaced by closely resembling characters. Because >>>>>> of >>>>>> this inconsistency, applying regex is very difficult. >>>>>> >>>>>> On Wednesday, September 13, 2023 at 1:02:01 PM UTC+3 >>>>>> mdalihu...@gmail.com wrote: >>>>>> >>>>>>> if Some specific characters or words are always missing from the OCR >>>>>>> result. then you can apply logic with the Regular expressions method on >>>>>>> your applications. After OCR, these specific characters or words will be >>>>>>> replaced by current characters or words that you defined in your >>>>>>> applications by Regular expressions. it can be done in some major >>>>>>> problems. >>>>>>> >>>>>>> On Wednesday, 13 September, 2023 at 3:51:29 pm UTC+6 >>>>>>> desal...@gmail.com wrote: >>>>>>> >>>>>>>> The characters are getting missed, even after fine-tuning. >>>>>>>> I never made any progress. I tried many different ways. Some >>>>>>>> specific characters are always missing from the OCR result. >>>>>>>> >>>>>>>> On Wednesday, September 13, 2023 at 12:49:20 PM UTC+3 >>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>> >>>>>>>>> EasyOCR I think is best for ID cards or something like that image >>>>>>>>> process. but document images like books, here Tesseract is better than >>>>>>>>> EasyOCR. Even I didn't use EasyOCR. you can try it. >>>>>>>>> >>>>>>>>> I have added words of dictionaries but the result is the same. >>>>>>>>> >>>>>>>>> what kind of problem you have faced in fine-tuning in few new >>>>>>>>> characters as you said (*but, I failed in every possible way to >>>>>>>>> introduce a few new characters into the database.)* >>>>>>>>> On Wednesday, 13 September, 2023 at 3:33:48 pm UTC+6 >>>>>>>>> desal...@gmail.com wrote: >>>>>>>>> >>>>>>>>>> Yes, we are new to this. I find the instructions (the manual) >>>>>>>>>> very hard to follow. The video you linked above was really helpful >>>>>>>>>> to get >>>>>>>>>> started. My plan at the beginning was to fine tune the existing >>>>>>>>>> .traineddata. But, I failed in every possible way to introduce a few >>>>>>>>>> new >>>>>>>>>> characters into the database. That is why I started from scratch. >>>>>>>>>> >>>>>>>>>> Sure, I will follow Lorenzo's suggestion: will run more the >>>>>>>>>> iterations, and see if I can improve. >>>>>>>>>> >>>>>>>>>> Another areas we need to explore is usage of dictionaries >>>>>>>>>> actually. May be adding millions of words into the dictionary could >>>>>>>>>> help >>>>>>>>>> Tesseract. I don't have millions of words; but I am looking into some >>>>>>>>>> corpus to get more words into the dictionary. >>>>>>>>>> >>>>>>>>>> If this all fails, EasyOCR (and probably other similar >>>>>>>>>> open-source packages) is probably our next option to try on. Sure, >>>>>>>>>> sharing >>>>>>>>>> our experiences will be helpful. I will let you know if I made good >>>>>>>>>> progresses in any of these options. >>>>>>>>>> On Wednesday, September 13, 2023 at 12:19:48 PM UTC+3 >>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>> >>>>>>>>>>> How is your training going for Bengali? It was nearly good but >>>>>>>>>>> I faced space problems between two words, some words are spaces but >>>>>>>>>>> most of >>>>>>>>>>> them have no space. I think is problem is in the dataset but I use >>>>>>>>>>> the >>>>>>>>>>> default training dataset from Tesseract which is used in Ben That >>>>>>>>>>> way I am >>>>>>>>>>> confused so I have to explore more. by the way, you can try as >>>>>>>>>>> Lorenzo >>>>>>>>>>> Blz said. Actually training from scratch is harder than >>>>>>>>>>> fine-tuning. so you can use different datasets to explore. if you >>>>>>>>>>> succeed. >>>>>>>>>>> please let me know how you have done this whole process. I'm also >>>>>>>>>>> new in >>>>>>>>>>> this field. >>>>>>>>>>> On Wednesday, 13 September, 2023 at 1:13:43 pm UTC+6 >>>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>>> >>>>>>>>>>>> How is your training going for Bengali? >>>>>>>>>>>> I have been trying to train from scratch. I made about 64,000 >>>>>>>>>>>> lines of text (which produced about 255,000 files, in the end) and >>>>>>>>>>>> run the >>>>>>>>>>>> training for 150,000 iterations; getting 0.51 training error rate. >>>>>>>>>>>> I was >>>>>>>>>>>> hopping to get reasonable accuracy. Unfortunately, when I run the >>>>>>>>>>>> OCR >>>>>>>>>>>> using .traineddata, the accuracy is absolutely terrible. Do you >>>>>>>>>>>> think I >>>>>>>>>>>> made some mistakes, or that is an expected result? >>>>>>>>>>>> >>>>>>>>>>>> On Tuesday, September 12, 2023 at 11:15:25 PM UTC+3 >>>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Yes, he doesn't mention all fonts but only one font. That way >>>>>>>>>>>>> he didn't use *MODEL_NAME in a separate **script **file >>>>>>>>>>>>> script I think.* >>>>>>>>>>>>> >>>>>>>>>>>>> Actually, here we teach all *tif, gt.txt, and .box files *which >>>>>>>>>>>>> are created by *MODEL_NAME I mean **eng, ben, oro flag or >>>>>>>>>>>>> language code *because when we first create *tif, gt.txt, and >>>>>>>>>>>>> .box files, *every file starts by *MODEL_NAME*. This >>>>>>>>>>>>> *MODEL_NAME* we selected on the training script for looping >>>>>>>>>>>>> each tif, gt.txt, and .box files which are created by >>>>>>>>>>>>> *MODEL_NAME.* >>>>>>>>>>>>> >>>>>>>>>>>>> On Tuesday, 12 September, 2023 at 9:42:13 pm UTC+6 >>>>>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Yes, I am familiar with the video and have set up the folder >>>>>>>>>>>>>> structure as you did. Indeed, I have tried a number of >>>>>>>>>>>>>> fine-tuning with a >>>>>>>>>>>>>> single font following Gracia's video. But, your script is much >>>>>>>>>>>>>> better >>>>>>>>>>>>>> because supports multiple fonts. The whole improvement you made >>>>>>>>>>>>>> is >>>>>>>>>>>>>> brilliant; and very useful. It is all working for me. >>>>>>>>>>>>>> The only part that I didn't understand is the trick you used >>>>>>>>>>>>>> in your tesseract_train.py script. You see, I have been doing >>>>>>>>>>>>>> exactly to >>>>>>>>>>>>>> you did except this script. >>>>>>>>>>>>>> >>>>>>>>>>>>>> The scripts seems to have the trick of sending/teaching each >>>>>>>>>>>>>> of the fonts (iteratively) into the model. The script I have >>>>>>>>>>>>>> been using >>>>>>>>>>>>>> (which I get from Garcia) doesn't mention font at all. >>>>>>>>>>>>>> >>>>>>>>>>>>>> *TESSDATA_PREFIX=../tesseract/tessdata make training >>>>>>>>>>>>>> MODEL_NAME=oro TESSDATA=../tesseract/tessdata >>>>>>>>>>>>>> MAX_ITERATIONS=10000* >>>>>>>>>>>>>> Does it mean that my model does't train the fonts (even >>>>>>>>>>>>>> if the fonts have been included in the splitting process, in the >>>>>>>>>>>>>> other >>>>>>>>>>>>>> script)? >>>>>>>>>>>>>> On Monday, September 11, 2023 at 10:54:08 AM UTC+3 >>>>>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> *import subprocess# List of font namesfont_names = >>>>>>>>>>>>>>> ['ben']for font in font_names: command = >>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make training >>>>>>>>>>>>>>> MODEL_NAME={font} >>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata >>>>>>>>>>>>>>> MAX_ITERATIONS=10000"* >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> * subprocess.run(command, shell=True) 1 . This command is >>>>>>>>>>>>>>> for training data that I have named '*tesseract_training*.py' >>>>>>>>>>>>>>> inside tesstrain folder.* >>>>>>>>>>>>>>> *2. root directory means your main training folder and >>>>>>>>>>>>>>> inside it as like langdata, tessearact, tesstrain folders. if >>>>>>>>>>>>>>> you see this >>>>>>>>>>>>>>> tutorial *https://www.youtube.com/watch?v=KE4xEzFGSU8 >>>>>>>>>>>>>>> you will understand better the folder structure. only I >>>>>>>>>>>>>>> created tesseract_training.py in tesstrain folder for training >>>>>>>>>>>>>>> and >>>>>>>>>>>>>>> FontList.py file is the main path as *like langdata, >>>>>>>>>>>>>>> tessearact, tesstrain, and *split_training_text.py. >>>>>>>>>>>>>>> 3. first of all you have to put all fonts in your Linux >>>>>>>>>>>>>>> fonts folder. /usr/share/fonts/ then run: sudo apt >>>>>>>>>>>>>>> update then sudo fc-cache -fv >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> after that, you have to add the exact font's name in >>>>>>>>>>>>>>> FontList.py file like me. >>>>>>>>>>>>>>> I have added two pic my folder structure. first is main >>>>>>>>>>>>>>> structure pic and the second is the Colopse tesstrain folder. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I[image: Screenshot 2023-09-11 134947.png][image: >>>>>>>>>>>>>>> Screenshot 2023-09-11 135014.png] >>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 12:50:03 pm UTC+6 >>>>>>>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thank you so much for putting out these brilliant scripts. >>>>>>>>>>>>>>>> They make the process much more efficient. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I have one more question on the other script that you use >>>>>>>>>>>>>>>> to train. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> *import subprocess# List of font namesfont_names = >>>>>>>>>>>>>>>> ['ben']for font in font_names: command = >>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make training >>>>>>>>>>>>>>>> MODEL_NAME={font} >>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata >>>>>>>>>>>>>>>> MAX_ITERATIONS=10000"* >>>>>>>>>>>>>>>> * subprocess.run(command, shell=True) * >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Do you have the name of fonts listed in file in the >>>>>>>>>>>>>>>> same/root directory? >>>>>>>>>>>>>>>> How do you setup the names of the fonts in the file, if you >>>>>>>>>>>>>>>> don't mind sharing it? >>>>>>>>>>>>>>>> On Monday, September 11, 2023 at 4:27:27 AM UTC+3 >>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> You can use the new script below. it's better than the >>>>>>>>>>>>>>>>> previous two scripts. You can create *tif, gt.txt, and >>>>>>>>>>>>>>>>> .box files *by multiple fonts and also use breakpoint if >>>>>>>>>>>>>>>>> vs code close or anything during creating *tif, gt.txt, >>>>>>>>>>>>>>>>> and .box files *then you can checkpoint to navigate where >>>>>>>>>>>>>>>>> you close vs code. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> command for *tif, gt.txt, and .box files * >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> import os >>>>>>>>>>>>>>>>> import random >>>>>>>>>>>>>>>>> import pathlib >>>>>>>>>>>>>>>>> import subprocess >>>>>>>>>>>>>>>>> import argparse >>>>>>>>>>>>>>>>> from FontList import FontList >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> def create_training_data(training_text_file, font_list, >>>>>>>>>>>>>>>>> output_directory, start_line=None, end_line=None): >>>>>>>>>>>>>>>>> lines = [] >>>>>>>>>>>>>>>>> with open(training_text_file, 'r') as input_file: >>>>>>>>>>>>>>>>> lines = input_file.readlines() >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> if not os.path.exists(output_directory): >>>>>>>>>>>>>>>>> os.mkdir(output_directory) >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> if start_line is None: >>>>>>>>>>>>>>>>> start_line = 0 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> if end_line is None: >>>>>>>>>>>>>>>>> end_line = len(lines) - 1 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> for font_name in font_list.fonts: >>>>>>>>>>>>>>>>> for line_index in range(start_line, end_line + 1): >>>>>>>>>>>>>>>>> line = lines[line_index].strip() >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> training_text_file_name = pathlib.Path( >>>>>>>>>>>>>>>>> training_text_file).stem >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> line_serial = f"{line_index:d}" >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> line_gt_text = os.path.join(output_directory, >>>>>>>>>>>>>>>>> f'{training_text_file_name}_{line_serial}_{ >>>>>>>>>>>>>>>>> font_name.replace(" ", "_")}.gt.txt') >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> with open(line_gt_text, 'w') as output_file: >>>>>>>>>>>>>>>>> output_file.writelines([line]) >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> file_base_name = f'{training_text_file_name}_{ >>>>>>>>>>>>>>>>> line_serial}_{font_name.replace(" ", "_")}' >>>>>>>>>>>>>>>>> subprocess.run([ >>>>>>>>>>>>>>>>> 'text2image', >>>>>>>>>>>>>>>>> f'--font={font_name}', >>>>>>>>>>>>>>>>> f'--text={line_gt_text}', >>>>>>>>>>>>>>>>> f'--outputbase={output_directory}/{ >>>>>>>>>>>>>>>>> file_base_name}', >>>>>>>>>>>>>>>>> '--max_pages=1', >>>>>>>>>>>>>>>>> '--strip_unrenderable_words', >>>>>>>>>>>>>>>>> '--leading=36', >>>>>>>>>>>>>>>>> '--xsize=3600', >>>>>>>>>>>>>>>>> '--ysize=330', >>>>>>>>>>>>>>>>> '--char_spacing=1.0', >>>>>>>>>>>>>>>>> '--exposure=0', >>>>>>>>>>>>>>>>> '--unicharset_file=langdata/eng.unicharset >>>>>>>>>>>>>>>>> ', >>>>>>>>>>>>>>>>> ]) >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> if __name__ == "__main__": >>>>>>>>>>>>>>>>> parser = argparse.ArgumentParser() >>>>>>>>>>>>>>>>> parser.add_argument('--start', type=int, help='Starting >>>>>>>>>>>>>>>>> line count (inclusive)') >>>>>>>>>>>>>>>>> parser.add_argument('--end', type=int, help='Ending >>>>>>>>>>>>>>>>> line count (inclusive)') >>>>>>>>>>>>>>>>> args = parser.parse_args() >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> training_text_file = 'langdata/eng.training_text' >>>>>>>>>>>>>>>>> output_directory = 'tesstrain/data/eng-ground-truth' >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> font_list = FontList() >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> create_training_data(training_text_file, font_list, >>>>>>>>>>>>>>>>> output_directory, args.start, args.end) >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Then create a file called "FontList" in the root directory >>>>>>>>>>>>>>>>> and paste it. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> class FontList: >>>>>>>>>>>>>>>>> def __init__(self): >>>>>>>>>>>>>>>>> self.fonts = [ >>>>>>>>>>>>>>>>> "Gerlick" >>>>>>>>>>>>>>>>> "Sagar Medium", >>>>>>>>>>>>>>>>> "Ekushey Lohit Normal", >>>>>>>>>>>>>>>>> "Charukola Round Head Regular, weight=433", >>>>>>>>>>>>>>>>> "Charukola Round Head Bold, weight=443", >>>>>>>>>>>>>>>>> "Ador Orjoma Unicode", >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> ] >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> then import in the above code, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> *for breakpoint command:* >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> sudo python3 split_training_text.py --start 0 --end 11 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> change checkpoint according to you --start 0 --end 11. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> *and training checkpoint as you know already.* >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 1:22:34 am UTC+6 >>>>>>>>>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hi mhalidu, >>>>>>>>>>>>>>>>>> the script you posted here seems much more extensive than >>>>>>>>>>>>>>>>>> you posted before: >>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/0e2880d9-64c0-4659-b497-902a5747caf4n%40googlegroups.com >>>>>>>>>>>>>>>>>> . >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I have been using your earlier script. It is magical. How >>>>>>>>>>>>>>>>>> is this one different from the earlier one? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thank you for posting these scripts, by the way. It has >>>>>>>>>>>>>>>>>> saved my countless hours; by running multiple fonts in one >>>>>>>>>>>>>>>>>> sweep. I was not >>>>>>>>>>>>>>>>>> able to find any instruction on how to train for multiple >>>>>>>>>>>>>>>>>> fonts. The >>>>>>>>>>>>>>>>>> official manual is also unclear. YOUr script helped me to >>>>>>>>>>>>>>>>>> get started. >>>>>>>>>>>>>>>>>> On Wednesday, August 9, 2023 at 11:00:49 PM UTC+3 >>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> ok, I will try as you said. >>>>>>>>>>>>>>>>>>> one more thing, what's the role of the trained_text >>>>>>>>>>>>>>>>>>> lines will be? I have seen Bengali text are long words of >>>>>>>>>>>>>>>>>>> lines. so I wanna >>>>>>>>>>>>>>>>>>> know how many words or characters will be the better choice >>>>>>>>>>>>>>>>>>> for the train? >>>>>>>>>>>>>>>>>>> and '--xsize=3600','--ysize=350', will be according to >>>>>>>>>>>>>>>>>>> words of lines? >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Thursday, 10 August, 2023 at 1:10:14 am UTC+6 shree >>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Include the default fonts also in your fine-tuning list >>>>>>>>>>>>>>>>>>>> of fonts and see if that helps. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Wed, Aug 9, 2023, 2:27 PM Ali hussain < >>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I have trained some new fonts by fine-tune methods for >>>>>>>>>>>>>>>>>>>>> the Bengali language in Tesseract 5 and I have used all >>>>>>>>>>>>>>>>>>>>> official >>>>>>>>>>>>>>>>>>>>> trained_text and tessdata_best and other things also. >>>>>>>>>>>>>>>>>>>>> everything is good >>>>>>>>>>>>>>>>>>>>> but the problem is the default font which was trained >>>>>>>>>>>>>>>>>>>>> before that does not >>>>>>>>>>>>>>>>>>>>> convert text like prev but my new fonts work well. I >>>>>>>>>>>>>>>>>>>>> don't understand why >>>>>>>>>>>>>>>>>>>>> it's happening. I share code based to understand what >>>>>>>>>>>>>>>>>>>>> going on. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> *codes for creating tif, gt.txt, .box files:* >>>>>>>>>>>>>>>>>>>>> import os >>>>>>>>>>>>>>>>>>>>> import random >>>>>>>>>>>>>>>>>>>>> import pathlib >>>>>>>>>>>>>>>>>>>>> import subprocess >>>>>>>>>>>>>>>>>>>>> import argparse >>>>>>>>>>>>>>>>>>>>> from FontList import FontList >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> def read_line_count(): >>>>>>>>>>>>>>>>>>>>> if os.path.exists('line_count.txt'): >>>>>>>>>>>>>>>>>>>>> with open('line_count.txt', 'r') as file: >>>>>>>>>>>>>>>>>>>>> return int(file.read()) >>>>>>>>>>>>>>>>>>>>> return 0 >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> def write_line_count(line_count): >>>>>>>>>>>>>>>>>>>>> with open('line_count.txt', 'w') as file: >>>>>>>>>>>>>>>>>>>>> file.write(str(line_count)) >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> def create_training_data(training_text_file, font_list, >>>>>>>>>>>>>>>>>>>>> output_directory, start_line=None, end_line=None): >>>>>>>>>>>>>>>>>>>>> lines = [] >>>>>>>>>>>>>>>>>>>>> with open(training_text_file, 'r') as input_file: >>>>>>>>>>>>>>>>>>>>> for line in input_file.readlines(): >>>>>>>>>>>>>>>>>>>>> lines.append(line.strip()) >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> if not os.path.exists(output_directory): >>>>>>>>>>>>>>>>>>>>> os.mkdir(output_directory) >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> random.shuffle(lines) >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> if start_line is None: >>>>>>>>>>>>>>>>>>>>> line_count = read_line_count() # Set the >>>>>>>>>>>>>>>>>>>>> starting line_count from the file >>>>>>>>>>>>>>>>>>>>> else: >>>>>>>>>>>>>>>>>>>>> line_count = start_line >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> if end_line is None: >>>>>>>>>>>>>>>>>>>>> end_line_count = len(lines) - 1 # Set the >>>>>>>>>>>>>>>>>>>>> ending line_count >>>>>>>>>>>>>>>>>>>>> else: >>>>>>>>>>>>>>>>>>>>> end_line_count = min(end_line, len(lines) - 1) >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> for font in font_list.fonts: # Iterate through >>>>>>>>>>>>>>>>>>>>> all the fonts in the font_list >>>>>>>>>>>>>>>>>>>>> font_serial = 1 >>>>>>>>>>>>>>>>>>>>> for line in lines: >>>>>>>>>>>>>>>>>>>>> training_text_file_name = pathlib.Path( >>>>>>>>>>>>>>>>>>>>> training_text_file).stem >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> # Generate a unique serial number for >>>>>>>>>>>>>>>>>>>>> each line >>>>>>>>>>>>>>>>>>>>> line_serial = f"{line_count:d}" >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> # GT (Ground Truth) text filename >>>>>>>>>>>>>>>>>>>>> line_gt_text = os.path.join( >>>>>>>>>>>>>>>>>>>>> output_directory, f'{training_text_file_name}_{ >>>>>>>>>>>>>>>>>>>>> line_serial}.gt.txt') >>>>>>>>>>>>>>>>>>>>> with open(line_gt_text, 'w') as >>>>>>>>>>>>>>>>>>>>> output_file: >>>>>>>>>>>>>>>>>>>>> output_file.writelines([line]) >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> # Image filename >>>>>>>>>>>>>>>>>>>>> file_base_name = f'ben_{line_serial}' # >>>>>>>>>>>>>>>>>>>>> Unique filename for each font >>>>>>>>>>>>>>>>>>>>> subprocess.run([ >>>>>>>>>>>>>>>>>>>>> 'text2image', >>>>>>>>>>>>>>>>>>>>> f'--font={font}', >>>>>>>>>>>>>>>>>>>>> f'--text={line_gt_text}', >>>>>>>>>>>>>>>>>>>>> f'--outputbase={output_directory}/{ >>>>>>>>>>>>>>>>>>>>> file_base_name}', >>>>>>>>>>>>>>>>>>>>> '--max_pages=1', >>>>>>>>>>>>>>>>>>>>> '--strip_unrenderable_words', >>>>>>>>>>>>>>>>>>>>> '--leading=36', >>>>>>>>>>>>>>>>>>>>> '--xsize=3600', >>>>>>>>>>>>>>>>>>>>> '--ysize=350', >>>>>>>>>>>>>>>>>>>>> '--char_spacing=1.0', >>>>>>>>>>>>>>>>>>>>> '--exposure=0', >>>>>>>>>>>>>>>>>>>>> ' >>>>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/ben.unicharset', >>>>>>>>>>>>>>>>>>>>> ]) >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> line_count += 1 >>>>>>>>>>>>>>>>>>>>> font_serial += 1 >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> # Reset font_serial for the next font >>>>>>>>>>>>>>>>>>>>> iteration >>>>>>>>>>>>>>>>>>>>> font_serial = 1 >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> write_line_count(line_count) # Update the >>>>>>>>>>>>>>>>>>>>> line_count in the file >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> if __name__ == "__main__": >>>>>>>>>>>>>>>>>>>>> parser = argparse.ArgumentParser() >>>>>>>>>>>>>>>>>>>>> parser.add_argument('--start', type=int, >>>>>>>>>>>>>>>>>>>>> help='Starting >>>>>>>>>>>>>>>>>>>>> line count (inclusive)') >>>>>>>>>>>>>>>>>>>>> parser.add_argument('--end', type=int, help='Ending >>>>>>>>>>>>>>>>>>>>> line count (inclusive)') >>>>>>>>>>>>>>>>>>>>> args = parser.parse_args() >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> training_text_file = 'langdata/ben.training_text' >>>>>>>>>>>>>>>>>>>>> output_directory = ' >>>>>>>>>>>>>>>>>>>>> tesstrain/data/ben-ground-truth' >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> # Create an instance of the FontList class >>>>>>>>>>>>>>>>>>>>> font_list = FontList() >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> create_training_data(training_text_file, >>>>>>>>>>>>>>>>>>>>> font_list, output_directory, args.start, args.end) >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> *and for training code:* >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> import subprocess >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> # List of font names >>>>>>>>>>>>>>>>>>>>> font_names = ['ben'] >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> for font in font_names: >>>>>>>>>>>>>>>>>>>>> command = f"TESSDATA_PREFIX=../tesseract/tessdata >>>>>>>>>>>>>>>>>>>>> make training MODEL_NAME={font} START_MODEL=ben >>>>>>>>>>>>>>>>>>>>> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000 >>>>>>>>>>>>>>>>>>>>> LANG_TYPE=Indic" >>>>>>>>>>>>>>>>>>>>> subprocess.run(command, shell=True) >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> any suggestion to identify to extract the problem. >>>>>>>>>>>>>>>>>>>>> thanks, everyone >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>> You received this message because you are subscribed >>>>>>>>>>>>>>>>>>>>> to the Google Groups "tesseract-ocr" group. >>>>>>>>>>>>>>>>>>>>> To unsubscribe from this group and stop receiving >>>>>>>>>>>>>>>>>>>>> emails from it, send an email to >>>>>>>>>>>>>>>>>>>>> tesseract-oc...@googlegroups.com. >>>>>>>>>>>>>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com >>>>>>>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>>>>>>>>>>>>> . >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CA%2BLi4kBP9bTZnbjaY3H3ZkbT5Fu36nqZJKK209Zsku20sDH06g%40mail.gmail.com.