you faced Can't encode transcription? if you faced how you have solved this?
On Thursday, 14 September, 2023 at 10:51:52 am UTC+6 elvi...@gmail.com wrote: > I was using my own text > > On Thu, Sep 14, 2023, 6:58 AM Ali hussain <mdalihu...@gmail.com> wrote: > >> you are training from Tessearact default text data or your own collected >> text data? >> On Thursday, 14 September, 2023 at 12:19:53 am UTC+6 desal...@gmail.com >> wrote: >> >>> I now get to 200000 iterations; and the error rate is stuck at 0.46. The >>> result is absolutely trash: nowhere close to the default/Ray's training. >>> >>> On Wednesday, September 13, 2023 at 2:47:05 PM UTC+3 >>> mdalihu...@gmail.com wrote: >>> >>>> >>>> after Tesseact recognizes text from images. then you can apply regex to >>>> replace the wrong word with to correct word. >>>> I'm not familiar with paddleOcr and scanTailor also. >>>> >>>> On Wednesday, 13 September, 2023 at 5:06:12 pm UTC+6 desal...@gmail.com >>>> wrote: >>>> >>>>> At what stage are you doing the regex replacement? >>>>> My process has been: Scan (tif)--> ScanTailor --> Tesseract --> pdf >>>>> >>>>> >EasyOCR I think is best for ID cards or something like that image >>>>> process. but document images like books, here Tesseract is better than >>>>> EasyOCR. >>>>> >>>>> How about paddleOcr?, are you familiar with it? >>>>> >>>>> On Wednesday, September 13, 2023 at 1:45:54 PM UTC+3 >>>>> mdalihu...@gmail.com wrote: >>>>> >>>>>> I know what you mean. but in some cases, it helps me. I have faced >>>>>> specific characters and words are always not recognized by Tesseract. >>>>>> That >>>>>> way I use these regex to replace those characters and words if those >>>>>> characters are incorrect. >>>>>> >>>>>> see what I have done: >>>>>> >>>>>> " ী": "ী", >>>>>> " ্": " ", >>>>>> " ে": " ", >>>>>> জ্া: "জা", >>>>>> " ": " ", >>>>>> " ": " ", >>>>>> " ": " ", >>>>>> "্প": " ", >>>>>> " য": "র্য", >>>>>> য: "য", >>>>>> " া": "া", >>>>>> আা: "আ", >>>>>> ম্ি: "মি", >>>>>> স্ু: "সু", >>>>>> "হূ ": "হূ", >>>>>> " ণ": "ণ", >>>>>> র্্: "র", >>>>>> "চিন্ত ": "চিন্তা ", >>>>>> ন্া: "না", >>>>>> "সম ূর্ন": "সম্পূর্ণ", >>>>>> On Wednesday, 13 September, 2023 at 4:18:22 pm UTC+6 >>>>>> desal...@gmail.com wrote: >>>>>> >>>>>>> The problem for regex is that Tesseract is not consistent in its >>>>>>> replacement. >>>>>>> Think of the original training of English data doesn't contain the >>>>>>> letter /u/. What does Tesseract do when it faces /u/ in actual >>>>>>> processing?? >>>>>>> In some cases, it replaces it with closely similar letters such as >>>>>>> /v/ and /w/. In other cases, it completely removes it. That is what is >>>>>>> happening with my case. Those characters re sometimes completely >>>>>>> removed; >>>>>>> other times, they are replaced by closely resembling characters. >>>>>>> Because of >>>>>>> this inconsistency, applying regex is very difficult. >>>>>>> >>>>>>> On Wednesday, September 13, 2023 at 1:02:01 PM UTC+3 >>>>>>> mdalihu...@gmail.com wrote: >>>>>>> >>>>>>>> if Some specific characters or words are always missing from the >>>>>>>> OCR result. then you can apply logic with the Regular expressions >>>>>>>> method >>>>>>>> on your applications. After OCR, these specific characters or words >>>>>>>> will be >>>>>>>> replaced by current characters or words that you defined in your >>>>>>>> applications by Regular expressions. it can be done in some major >>>>>>>> problems. >>>>>>>> >>>>>>>> On Wednesday, 13 September, 2023 at 3:51:29 pm UTC+6 >>>>>>>> desal...@gmail.com wrote: >>>>>>>> >>>>>>>>> The characters are getting missed, even after fine-tuning. >>>>>>>>> I never made any progress. I tried many different ways. Some >>>>>>>>> specific characters are always missing from the OCR result. >>>>>>>>> >>>>>>>>> On Wednesday, September 13, 2023 at 12:49:20 PM UTC+3 >>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>> >>>>>>>>>> EasyOCR I think is best for ID cards or something like that image >>>>>>>>>> process. but document images like books, here Tesseract is better >>>>>>>>>> than >>>>>>>>>> EasyOCR. Even I didn't use EasyOCR. you can try it. >>>>>>>>>> >>>>>>>>>> I have added words of dictionaries but the result is the same. >>>>>>>>>> >>>>>>>>>> what kind of problem you have faced in fine-tuning in few new >>>>>>>>>> characters as you said (*but, I failed in every possible way to >>>>>>>>>> introduce a few new characters into the database.)* >>>>>>>>>> On Wednesday, 13 September, 2023 at 3:33:48 pm UTC+6 >>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>> >>>>>>>>>>> Yes, we are new to this. I find the instructions (the manual) >>>>>>>>>>> very hard to follow. The video you linked above was really helpful >>>>>>>>>>> to get >>>>>>>>>>> started. My plan at the beginning was to fine tune the existing >>>>>>>>>>> .traineddata. But, I failed in every possible way to introduce a >>>>>>>>>>> few new >>>>>>>>>>> characters into the database. That is why I started from scratch. >>>>>>>>>>> >>>>>>>>>>> Sure, I will follow Lorenzo's suggestion: will run more the >>>>>>>>>>> iterations, and see if I can improve. >>>>>>>>>>> >>>>>>>>>>> Another areas we need to explore is usage of dictionaries >>>>>>>>>>> actually. May be adding millions of words into the dictionary could >>>>>>>>>>> help >>>>>>>>>>> Tesseract. I don't have millions of words; but I am looking into >>>>>>>>>>> some >>>>>>>>>>> corpus to get more words into the dictionary. >>>>>>>>>>> >>>>>>>>>>> If this all fails, EasyOCR (and probably other similar >>>>>>>>>>> open-source packages) is probably our next option to try on. Sure, >>>>>>>>>>> sharing >>>>>>>>>>> our experiences will be helpful. I will let you know if I made good >>>>>>>>>>> progresses in any of these options. >>>>>>>>>>> On Wednesday, September 13, 2023 at 12:19:48 PM UTC+3 >>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>> >>>>>>>>>>>> How is your training going for Bengali? It was nearly good but >>>>>>>>>>>> I faced space problems between two words, some words are spaces >>>>>>>>>>>> but most of >>>>>>>>>>>> them have no space. I think is problem is in the dataset but I use >>>>>>>>>>>> the >>>>>>>>>>>> default training dataset from Tesseract which is used in Ben That >>>>>>>>>>>> way I am >>>>>>>>>>>> confused so I have to explore more. by the way, you can try as >>>>>>>>>>>> Lorenzo >>>>>>>>>>>> Blz said. Actually training from scratch is harder than >>>>>>>>>>>> fine-tuning. so you can use different datasets to explore. if you >>>>>>>>>>>> succeed. >>>>>>>>>>>> please let me know how you have done this whole process. I'm also >>>>>>>>>>>> new in >>>>>>>>>>>> this field. >>>>>>>>>>>> On Wednesday, 13 September, 2023 at 1:13:43 pm UTC+6 >>>>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>>>> >>>>>>>>>>>>> How is your training going for Bengali? >>>>>>>>>>>>> I have been trying to train from scratch. I made about 64,000 >>>>>>>>>>>>> lines of text (which produced about 255,000 files, in the end) >>>>>>>>>>>>> and run the >>>>>>>>>>>>> training for 150,000 iterations; getting 0.51 training error >>>>>>>>>>>>> rate. I was >>>>>>>>>>>>> hopping to get reasonable accuracy. Unfortunately, when I run the >>>>>>>>>>>>> OCR >>>>>>>>>>>>> using .traineddata, the accuracy is absolutely terrible. Do you >>>>>>>>>>>>> think I >>>>>>>>>>>>> made some mistakes, or that is an expected result? >>>>>>>>>>>>> >>>>>>>>>>>>> On Tuesday, September 12, 2023 at 11:15:25 PM UTC+3 >>>>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Yes, he doesn't mention all fonts but only one font. That >>>>>>>>>>>>>> way he didn't use *MODEL_NAME in a separate **script **file >>>>>>>>>>>>>> script I think.* >>>>>>>>>>>>>> >>>>>>>>>>>>>> Actually, here we teach all *tif, gt.txt, and .box files *which >>>>>>>>>>>>>> are created by *MODEL_NAME I mean **eng, ben, oro flag or >>>>>>>>>>>>>> language code *because when we first create *tif, gt.txt, >>>>>>>>>>>>>> and .box files, *every file starts by *MODEL_NAME*. This >>>>>>>>>>>>>> *MODEL_NAME* we selected on the training script for looping >>>>>>>>>>>>>> each tif, gt.txt, and .box files which are created by >>>>>>>>>>>>>> *MODEL_NAME.* >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Tuesday, 12 September, 2023 at 9:42:13 pm UTC+6 >>>>>>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Yes, I am familiar with the video and have set up the folder >>>>>>>>>>>>>>> structure as you did. Indeed, I have tried a number of >>>>>>>>>>>>>>> fine-tuning with a >>>>>>>>>>>>>>> single font following Gracia's video. But, your script is much >>>>>>>>>>>>>>> better >>>>>>>>>>>>>>> because supports multiple fonts. The whole improvement you made >>>>>>>>>>>>>>> is >>>>>>>>>>>>>>> brilliant; and very useful. It is all working for me. >>>>>>>>>>>>>>> The only part that I didn't understand is the trick you used >>>>>>>>>>>>>>> in your tesseract_train.py script. You see, I have been doing >>>>>>>>>>>>>>> exactly to >>>>>>>>>>>>>>> you did except this script. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The scripts seems to have the trick of sending/teaching each >>>>>>>>>>>>>>> of the fonts (iteratively) into the model. The script I have >>>>>>>>>>>>>>> been using >>>>>>>>>>>>>>> (which I get from Garcia) doesn't mention font at all. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> *TESSDATA_PREFIX=../tesseract/tessdata make training >>>>>>>>>>>>>>> MODEL_NAME=oro TESSDATA=../tesseract/tessdata >>>>>>>>>>>>>>> MAX_ITERATIONS=10000* >>>>>>>>>>>>>>> Does it mean that my model does't train the fonts (even >>>>>>>>>>>>>>> if the fonts have been included in the splitting process, in >>>>>>>>>>>>>>> the other >>>>>>>>>>>>>>> script)? >>>>>>>>>>>>>>> On Monday, September 11, 2023 at 10:54:08 AM UTC+3 >>>>>>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> *import subprocess# List of font namesfont_names = >>>>>>>>>>>>>>>> ['ben']for font in font_names: command = >>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make training >>>>>>>>>>>>>>>> MODEL_NAME={font} >>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata >>>>>>>>>>>>>>>> MAX_ITERATIONS=10000"* >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> * subprocess.run(command, shell=True) 1 . This command >>>>>>>>>>>>>>>> is for training data that I have named >>>>>>>>>>>>>>>> '*tesseract_training*.py' >>>>>>>>>>>>>>>> inside tesstrain folder.* >>>>>>>>>>>>>>>> *2. root directory means your main training folder and >>>>>>>>>>>>>>>> inside it as like langdata, tessearact, tesstrain folders. if >>>>>>>>>>>>>>>> you see this >>>>>>>>>>>>>>>> tutorial *https://www.youtube.com/watch?v=KE4xEzFGSU8 >>>>>>>>>>>>>>>> you will understand better the folder structure. only I >>>>>>>>>>>>>>>> created tesseract_training.py in tesstrain folder for training >>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>> FontList.py file is the main path as *like langdata, >>>>>>>>>>>>>>>> tessearact, tesstrain, and *split_training_text.py. >>>>>>>>>>>>>>>> 3. first of all you have to put all fonts in your Linux >>>>>>>>>>>>>>>> fonts folder. /usr/share/fonts/ then run: sudo apt >>>>>>>>>>>>>>>> update then sudo fc-cache -fv >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> after that, you have to add the exact font's name in >>>>>>>>>>>>>>>> FontList.py file like me. >>>>>>>>>>>>>>>> I have added two pic my folder structure. first is main >>>>>>>>>>>>>>>> structure pic and the second is the Colopse tesstrain folder. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I[image: Screenshot 2023-09-11 134947.png][image: >>>>>>>>>>>>>>>> Screenshot 2023-09-11 135014.png] >>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 12:50:03 pm UTC+6 >>>>>>>>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thank you so much for putting out these brilliant scripts. >>>>>>>>>>>>>>>>> They make the process much more efficient. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I have one more question on the other script that you use >>>>>>>>>>>>>>>>> to train. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> *import subprocess# List of font namesfont_names = >>>>>>>>>>>>>>>>> ['ben']for font in font_names: command = >>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make training >>>>>>>>>>>>>>>>> MODEL_NAME={font} >>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata >>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000"* >>>>>>>>>>>>>>>>> * subprocess.run(command, shell=True) * >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Do you have the name of fonts listed in file in the >>>>>>>>>>>>>>>>> same/root directory? >>>>>>>>>>>>>>>>> How do you setup the names of the fonts in the file, if >>>>>>>>>>>>>>>>> you don't mind sharing it? >>>>>>>>>>>>>>>>> On Monday, September 11, 2023 at 4:27:27 AM UTC+3 >>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> You can use the new script below. it's better than the >>>>>>>>>>>>>>>>>> previous two scripts. You can create *tif, gt.txt, and >>>>>>>>>>>>>>>>>> .box files *by multiple fonts and also use breakpoint if >>>>>>>>>>>>>>>>>> vs code close or anything during creating *tif, gt.txt, >>>>>>>>>>>>>>>>>> and .box files *then you can checkpoint to navigate >>>>>>>>>>>>>>>>>> where you close vs code. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> command for *tif, gt.txt, and .box files * >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> import os >>>>>>>>>>>>>>>>>> import random >>>>>>>>>>>>>>>>>> import pathlib >>>>>>>>>>>>>>>>>> import subprocess >>>>>>>>>>>>>>>>>> import argparse >>>>>>>>>>>>>>>>>> from FontList import FontList >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> def create_training_data(training_text_file, font_list, >>>>>>>>>>>>>>>>>> output_directory, start_line=None, end_line=None): >>>>>>>>>>>>>>>>>> lines = [] >>>>>>>>>>>>>>>>>> with open(training_text_file, 'r') as input_file: >>>>>>>>>>>>>>>>>> lines = input_file.readlines() >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> if not os.path.exists(output_directory): >>>>>>>>>>>>>>>>>> os.mkdir(output_directory) >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> if start_line is None: >>>>>>>>>>>>>>>>>> start_line = 0 >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> if end_line is None: >>>>>>>>>>>>>>>>>> end_line = len(lines) - 1 >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> for font_name in font_list.fonts: >>>>>>>>>>>>>>>>>> for line_index in range(start_line, end_line + 1 >>>>>>>>>>>>>>>>>> ): >>>>>>>>>>>>>>>>>> line = lines[line_index].strip() >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> training_text_file_name = pathlib.Path( >>>>>>>>>>>>>>>>>> training_text_file).stem >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> line_serial = f"{line_index:d}" >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> line_gt_text = os.path.join(output_directory, >>>>>>>>>>>>>>>>>> f'{training_text_file_name}_{line_serial}_{ >>>>>>>>>>>>>>>>>> font_name.replace(" ", "_")}.gt.txt') >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> with open(line_gt_text, 'w') as output_file: >>>>>>>>>>>>>>>>>> output_file.writelines([line]) >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> file_base_name = f'{training_text_file_name}_ >>>>>>>>>>>>>>>>>> {line_serial}_{font_name.replace(" ", "_")}' >>>>>>>>>>>>>>>>>> subprocess.run([ >>>>>>>>>>>>>>>>>> 'text2image', >>>>>>>>>>>>>>>>>> f'--font={font_name}', >>>>>>>>>>>>>>>>>> f'--text={line_gt_text}', >>>>>>>>>>>>>>>>>> f'--outputbase={output_directory}/{ >>>>>>>>>>>>>>>>>> file_base_name}', >>>>>>>>>>>>>>>>>> '--max_pages=1', >>>>>>>>>>>>>>>>>> '--strip_unrenderable_words', >>>>>>>>>>>>>>>>>> '--leading=36', >>>>>>>>>>>>>>>>>> '--xsize=3600', >>>>>>>>>>>>>>>>>> '--ysize=330', >>>>>>>>>>>>>>>>>> '--char_spacing=1.0', >>>>>>>>>>>>>>>>>> '--exposure=0', >>>>>>>>>>>>>>>>>> ' >>>>>>>>>>>>>>>>>> --unicharset_file=langdata/eng.unicharset', >>>>>>>>>>>>>>>>>> ]) >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> if __name__ == "__main__": >>>>>>>>>>>>>>>>>> parser = argparse.ArgumentParser() >>>>>>>>>>>>>>>>>> parser.add_argument('--start', type=int, help='Starting >>>>>>>>>>>>>>>>>> line count (inclusive)') >>>>>>>>>>>>>>>>>> parser.add_argument('--end', type=int, help='Ending >>>>>>>>>>>>>>>>>> line count (inclusive)') >>>>>>>>>>>>>>>>>> args = parser.parse_args() >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> training_text_file = 'langdata/eng.training_text' >>>>>>>>>>>>>>>>>> output_directory = 'tesstrain/data/eng-ground-truth' >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> font_list = FontList() >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> create_training_data(training_text_file, font_list, >>>>>>>>>>>>>>>>>> output_directory, args.start, args.end) >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Then create a file called "FontList" in the root >>>>>>>>>>>>>>>>>> directory and paste it. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> class FontList: >>>>>>>>>>>>>>>>>> def __init__(self): >>>>>>>>>>>>>>>>>> self.fonts = [ >>>>>>>>>>>>>>>>>> "Gerlick" >>>>>>>>>>>>>>>>>> "Sagar Medium", >>>>>>>>>>>>>>>>>> "Ekushey Lohit Normal", >>>>>>>>>>>>>>>>>> "Charukola Round Head Regular, weight=433", >>>>>>>>>>>>>>>>>> "Charukola Round Head Bold, weight=443", >>>>>>>>>>>>>>>>>> "Ador Orjoma Unicode", >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> ] >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> then import in the above code, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> *for breakpoint command:* >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> sudo python3 split_training_text.py --start 0 --end 11 >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> change checkpoint according to you --start 0 --end 11. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> *and training checkpoint as you know already.* >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 1:22:34 am UTC+6 >>>>>>>>>>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Hi mhalidu, >>>>>>>>>>>>>>>>>>> the script you posted here seems much more extensive >>>>>>>>>>>>>>>>>>> than you posted before: >>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/0e2880d9-64c0-4659-b497-902a5747caf4n%40googlegroups.com >>>>>>>>>>>>>>>>>>> . >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I have been using your earlier script. It is magical. >>>>>>>>>>>>>>>>>>> How is this one different from the earlier one? >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thank you for posting these scripts, by the way. It has >>>>>>>>>>>>>>>>>>> saved my countless hours; by running multiple fonts in one >>>>>>>>>>>>>>>>>>> sweep. I was not >>>>>>>>>>>>>>>>>>> able to find any instruction on how to train for multiple >>>>>>>>>>>>>>>>>>> fonts. The >>>>>>>>>>>>>>>>>>> official manual is also unclear. YOUr script helped me to >>>>>>>>>>>>>>>>>>> get started. >>>>>>>>>>>>>>>>>>> On Wednesday, August 9, 2023 at 11:00:49 PM UTC+3 >>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> ok, I will try as you said. >>>>>>>>>>>>>>>>>>>> one more thing, what's the role of the trained_text >>>>>>>>>>>>>>>>>>>> lines will be? I have seen Bengali text are long words of >>>>>>>>>>>>>>>>>>>> lines. so I wanna >>>>>>>>>>>>>>>>>>>> know how many words or characters will be the better >>>>>>>>>>>>>>>>>>>> choice for the train? >>>>>>>>>>>>>>>>>>>> and '--xsize=3600','--ysize=350', will be according to >>>>>>>>>>>>>>>>>>>> words of lines? >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Thursday, 10 August, 2023 at 1:10:14 am UTC+6 shree >>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Include the default fonts also in your fine-tuning >>>>>>>>>>>>>>>>>>>>> list of fonts and see if that helps. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Wed, Aug 9, 2023, 2:27 PM Ali hussain < >>>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> I have trained some new fonts by fine-tune methods >>>>>>>>>>>>>>>>>>>>>> for the Bengali language in Tesseract 5 and I have used >>>>>>>>>>>>>>>>>>>>>> all official >>>>>>>>>>>>>>>>>>>>>> trained_text and tessdata_best and other things also. >>>>>>>>>>>>>>>>>>>>>> everything is good >>>>>>>>>>>>>>>>>>>>>> but the problem is the default font which was trained >>>>>>>>>>>>>>>>>>>>>> before that does not >>>>>>>>>>>>>>>>>>>>>> convert text like prev but my new fonts work well. I >>>>>>>>>>>>>>>>>>>>>> don't understand why >>>>>>>>>>>>>>>>>>>>>> it's happening. I share code based to understand what >>>>>>>>>>>>>>>>>>>>>> going on. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> *codes for creating tif, gt.txt, .box files:* >>>>>>>>>>>>>>>>>>>>>> import os >>>>>>>>>>>>>>>>>>>>>> import random >>>>>>>>>>>>>>>>>>>>>> import pathlib >>>>>>>>>>>>>>>>>>>>>> import subprocess >>>>>>>>>>>>>>>>>>>>>> import argparse >>>>>>>>>>>>>>>>>>>>>> from FontList import FontList >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> def read_line_count(): >>>>>>>>>>>>>>>>>>>>>> if os.path.exists('line_count.txt'): >>>>>>>>>>>>>>>>>>>>>> with open('line_count.txt', 'r') as file: >>>>>>>>>>>>>>>>>>>>>> return int(file.read()) >>>>>>>>>>>>>>>>>>>>>> return 0 >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> def write_line_count(line_count): >>>>>>>>>>>>>>>>>>>>>> with open('line_count.txt', 'w') as file: >>>>>>>>>>>>>>>>>>>>>> file.write(str(line_count)) >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> def create_training_data(training_text_file, >>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, start_line=None, >>>>>>>>>>>>>>>>>>>>>> end_line=None): >>>>>>>>>>>>>>>>>>>>>> lines = [] >>>>>>>>>>>>>>>>>>>>>> with open(training_text_file, 'r') as input_file: >>>>>>>>>>>>>>>>>>>>>> for line in input_file.readlines(): >>>>>>>>>>>>>>>>>>>>>> lines.append(line.strip()) >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> if not os.path.exists(output_directory): >>>>>>>>>>>>>>>>>>>>>> os.mkdir(output_directory) >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> random.shuffle(lines) >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> if start_line is None: >>>>>>>>>>>>>>>>>>>>>> line_count = read_line_count() # Set the >>>>>>>>>>>>>>>>>>>>>> starting line_count from the file >>>>>>>>>>>>>>>>>>>>>> else: >>>>>>>>>>>>>>>>>>>>>> line_count = start_line >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> if end_line is None: >>>>>>>>>>>>>>>>>>>>>> end_line_count = len(lines) - 1 # Set the >>>>>>>>>>>>>>>>>>>>>> ending line_count >>>>>>>>>>>>>>>>>>>>>> else: >>>>>>>>>>>>>>>>>>>>>> end_line_count = min(end_line, len(lines) - 1 >>>>>>>>>>>>>>>>>>>>>> ) >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> for font in font_list.fonts: # Iterate through >>>>>>>>>>>>>>>>>>>>>> all the fonts in the font_list >>>>>>>>>>>>>>>>>>>>>> font_serial = 1 >>>>>>>>>>>>>>>>>>>>>> for line in lines: >>>>>>>>>>>>>>>>>>>>>> training_text_file_name = pathlib.Path( >>>>>>>>>>>>>>>>>>>>>> training_text_file).stem >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> # Generate a unique serial number for >>>>>>>>>>>>>>>>>>>>>> each line >>>>>>>>>>>>>>>>>>>>>> line_serial = f"{line_count:d}" >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> # GT (Ground Truth) text filename >>>>>>>>>>>>>>>>>>>>>> line_gt_text = os.path.join( >>>>>>>>>>>>>>>>>>>>>> output_directory, f'{training_text_file_name}_{ >>>>>>>>>>>>>>>>>>>>>> line_serial}.gt.txt') >>>>>>>>>>>>>>>>>>>>>> with open(line_gt_text, 'w') as >>>>>>>>>>>>>>>>>>>>>> output_file: >>>>>>>>>>>>>>>>>>>>>> output_file.writelines([line]) >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> # Image filename >>>>>>>>>>>>>>>>>>>>>> file_base_name = f'ben_{line_serial}' # >>>>>>>>>>>>>>>>>>>>>> Unique filename for each font >>>>>>>>>>>>>>>>>>>>>> subprocess.run([ >>>>>>>>>>>>>>>>>>>>>> 'text2image', >>>>>>>>>>>>>>>>>>>>>> f'--font={font}', >>>>>>>>>>>>>>>>>>>>>> f'--text={line_gt_text}', >>>>>>>>>>>>>>>>>>>>>> f'--outputbase={output_directory}/{ >>>>>>>>>>>>>>>>>>>>>> file_base_name}', >>>>>>>>>>>>>>>>>>>>>> '--max_pages=1', >>>>>>>>>>>>>>>>>>>>>> '--strip_unrenderable_words', >>>>>>>>>>>>>>>>>>>>>> '--leading=36', >>>>>>>>>>>>>>>>>>>>>> '--xsize=3600', >>>>>>>>>>>>>>>>>>>>>> '--ysize=350', >>>>>>>>>>>>>>>>>>>>>> '--char_spacing=1.0', >>>>>>>>>>>>>>>>>>>>>> '--exposure=0', >>>>>>>>>>>>>>>>>>>>>> ' >>>>>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/ben.unicharset', >>>>>>>>>>>>>>>>>>>>>> ]) >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> line_count += 1 >>>>>>>>>>>>>>>>>>>>>> font_serial += 1 >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> # Reset font_serial for the next font >>>>>>>>>>>>>>>>>>>>>> iteration >>>>>>>>>>>>>>>>>>>>>> font_serial = 1 >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> write_line_count(line_count) # Update the >>>>>>>>>>>>>>>>>>>>>> line_count in the file >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> if __name__ == "__main__": >>>>>>>>>>>>>>>>>>>>>> parser = argparse.ArgumentParser() >>>>>>>>>>>>>>>>>>>>>> parser.add_argument('--start', type=int, >>>>>>>>>>>>>>>>>>>>>> help='Starting >>>>>>>>>>>>>>>>>>>>>> line count (inclusive)') >>>>>>>>>>>>>>>>>>>>>> parser.add_argument('--end', type=int, help='Ending >>>>>>>>>>>>>>>>>>>>>> line count (inclusive)') >>>>>>>>>>>>>>>>>>>>>> args = parser.parse_args() >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> training_text_file = 'langdata/ben.training_text' >>>>>>>>>>>>>>>>>>>>>> output_directory = ' >>>>>>>>>>>>>>>>>>>>>> tesstrain/data/ben-ground-truth' >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> # Create an instance of the FontList class >>>>>>>>>>>>>>>>>>>>>> font_list = FontList() >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> create_training_data(training_text_file, >>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, args.start, args.end) >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> *and for training code:* >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> import subprocess >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> # List of font names >>>>>>>>>>>>>>>>>>>>>> font_names = ['ben'] >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> for font in font_names: >>>>>>>>>>>>>>>>>>>>>> command = f"TESSDATA_PREFIX=../tesseract/tessdata >>>>>>>>>>>>>>>>>>>>>> make training MODEL_NAME={font} START_MODEL=ben >>>>>>>>>>>>>>>>>>>>>> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000 >>>>>>>>>>>>>>>>>>>>>> LANG_TYPE=Indic" >>>>>>>>>>>>>>>>>>>>>> subprocess.run(command, shell=True) >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> any suggestion to identify to extract the problem. >>>>>>>>>>>>>>>>>>>>>> thanks, everyone >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>> You received this message because you are subscribed >>>>>>>>>>>>>>>>>>>>>> to the Google Groups "tesseract-ocr" group. >>>>>>>>>>>>>>>>>>>>>> To unsubscribe from this group and stop receiving >>>>>>>>>>>>>>>>>>>>>> emails from it, send an email to >>>>>>>>>>>>>>>>>>>>>> tesseract-oc...@googlegroups.com. >>>>>>>>>>>>>>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>>>>>>>>>>>>>> . >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-oc...@googlegroups.com. >> > To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e28e6112-ac6b-47d3-bd4e-e1c4c13dfa47n%40googlegroups.com.