Re: [tesseract-ocr] Facing issues with unicharset when trying to automate model training

Zdenko Podobny Mon, 28 Apr 2025 00:59:26 -0700

❌ Training failed: Command '['make', 'unicharset', 'lists', 'proto-model',
> 'tesseract-langdata', 'training', 'MODEL_NAME=jpn1', 'START_MODEL=jpn',
> 'TESSDATA=C:/Users/Chan Jian Sen/Documents/TesseractFineT  Training failed:
> Command



Why are you showing us so much python code if your shell command fails? Or
does it work in the terminal?

) BTW:

> for fname in os.listdir(tessdata_path):
>         if fname.startswith(model_prefix) and fname.endswith(
> ".traineddata"):
>             suffix = fname[len(model_prefix):-len(".traineddata")]


What is your python version? 2.x? Have you heard about `pathlib` (or
`glob`)?

Zdenko


po 28. 4. 2025 o 9:43 Jiansen Chan <jianse...@gmail.com> napísal(a):

> My goal is to automate model training in tesseract OCR for Japanese words.
> The user should just paste ground truth files and picture files into a
> particular folder, and then use that data to train a new model. this
> process should be able to be carried out multiple times. Every single time
> data is added to the folder I expect an automated model training.
>
> However, this is the error that i run into when I try to run automated
> tesseract training on VSCode. What I did is that I had a script that uses
> watchdog to detect newly added .tif/.png files alongside their
> corresponding .gt.txt files into a particular folder (from which the model
> is supposed to treat as training data and use it to train). The watcher
> file looks something like this:
>
> (watcher.py)
>
> import time
> import os
> from watchdog.observers import Observer
> from watchdog.events import FileSystemEventHandler
> from pathlib import Path
> from training.tesseract_training import run_tesseract_training
> from training.training_model_utils import get_latest_and_next_model
> WATCHED_FOLDER = r"C:\Users\Chan Jian Sen\Documents\ocr-japanese\I
> NPUT_TRAINING_DATA"  #ground truth put here
> tesstrain_dir = r"C:\Users\Chan Jian Sen\Documents\TesseractFineTuningJpn5
> \tesstrain"
>
> class TrainingInputHandler(FileSystemEventHandler):
>
>     def on_modified(self, event):
>         self.check_and_trigger_training()
>
>     def on_created(self, event):
>         self.check_and_trigger_training()
>
>     def check_and_trigger_training(self):
>         files = os.listdir(WATCHED_FOLDER)
>         pngs = {Path(f).stem for f in files if f.endswith('.png')}
>         gts = {Path(f).stem for f in files if f.endswith('.gt.txt')}
>         common = pngs & gts
>
>         if len(common) == 0:
>             print("⏳ Waiting for matching .png and .gt.txt pairs...")
>
>
>         tessdata_path = r"C:\Users\Chan Jian Sen\Documents\T
> esseractFineTuningJpn5\tessdata"
>         start_model, new_model = get_latest_and_next_model(tessdata_path)
>
>         print(f"🔁 Using {start_model} as base, training new model: {
> new_model}")  #problem here is the the old model they saw it as jpn and
> the new model as jpn1
>
>         run_tesseract_training(tesstrain_dir, new_model, start_model) #the
> first parameter MUST be your tesstrain folder
>         observer.stop()
>
> if __name__ == "__main__":
>     print(f"👀 Watching training data folder: {WATCHED_FOLDER}")
>     event_handler = TrainingInputHandler()
>     observer = Observer()
>     observer.schedule(event_handler, WATCHED_FOLDER, recursive=False)
>     observer.start()
>
>     try:
>         while observer.is_alive():
>             time.sleep(1)
>     except KeyboardInterrupt:
>         observer.stop()
>     observer.join()
>
>
>
>
> To generate a new model name (since I want to automate model training), i
> also have these functions here:
> (training_model_utils.py)
> import os
>
> def get_model_names(tessdata_path, model_prefix="jpn"):
>     models = []
>     for fname in os.listdir(tessdata_path):
>         if fname.startswith(model_prefix) and fname.endswith(
> ".traineddata"):
>             suffix = fname[len(model_prefix):-len(".traineddata")]
>             if suffix == "":
>                 models.append((0, "jpn"))
>             elif suffix.isdigit():
>                 models.append((int(suffix), f"{model_prefix}{suffix}"))
>     models.sort()
>     return models
>
> def get_latest_and_next_model(tessdata_path, model_prefix="jpn"):
>     models = get_model_names(tessdata_path, model_prefix)
>     if not models:
>         return model_prefix, f"{model_prefix}2"
>     latest = models[-1][1]
>     next_num = models[-1][0] + 1
>     next_model = f"{model_prefix}{next_num}" if next_num > 0 else f"{
> model_prefix}2"
>     return latest, next_model
>
> I also coded the make training procedure into VSCode, with a python script
> that calls for it.  This code snippet below is meant to run the tesseract
> training.
> (tesseract_training.py)
> import subprocess
> import os
>
> def run_tesseract_training(training_dir, model_name, start_model,
> max_iterations=4000): #previously start model is jpn
>     """
>     Run the full Tesseract tesstrain workflow including unicharset and
> langdata.
>     """
>     tessdata_path = r"C:\Users\Chan Jian Sen\Documents\T
> esseractFineTuningJpn5\tessdata"
>     # Important: replace backslashes with forward slashes
>     tessdata_path = tessdata_path.replace("\\", "/")
>     command = [
>         "make",
>         "unicharset", "lists", "proto-model", "tesseract-langdata",
> "training",
>         f"MODEL_NAME={model_name}",
>         f"START_MODEL={start_model}",
>         f"TESSDATA={tessdata_path}",  # Adjust path depending on where
> your .traineddata are
>         f"GROUND_TRUTH_DIR={training_dir}",
>         f"MAX_ITERATIONS={max_iterations}",
>         "LEARNING_RATE=0.001"
>     ]
>
>     print("🚀 Running full Tesseract training pipeline...")
>     try:
>         subprocess.run(command, cwd=r"C:\Users\Chan Jian Sen\Documents\T
> esseractFineTuningJpn5\tesstrain", shell=True, check=True)
>         print(f"✅ Training complete: {model_name}.traineddata generated.")
>     except subprocess.CalledProcessError as e:
>         print(f"❌ Training failed: {e}")
>
> However, when I run the code an issue appears, and I'm not sure how to
> deal with it:
>
>
> PS C:\Users\Chan Jian Sen\Documents\ocr-japanese>  c:; cd 'c:\Users\Chan
> Jian Sen\Documents\ocr-japanese'; & 'c:\Users\Chan J
> Sen\.vscode\extensions\ms-python.debugpy-2025.6.0-win32-x64\bundled\libs\debugpy\launcher'
> '58725' '--' 'C:\Users\Chan Jian S
> 👀 Watching training data folder: C:\Users\Chan Jian
> Sen\Documents\ocr-japanese\INPUT_TRAINING_DATA
> ⏳ Waiting for matching .png and .gt.txt pairs...
> 🔁 Using jpn as base, training new model: jpn1
> 🚀 Running full Tesseract training pipeline...
> You are using make version: 4.4.1
> Makefile:438: *** mixed implicit and normal rules: deprecated syntax
> combine_tessdata -u C:/Users/Chan Jian
> Sen/Documents/TesseractFineTuningJpn5/tessdata/jpn.traineddata data/jpn/jpn1
> 👀 Watching training data folder: C:\Users\Chan Jian
> Sen\Documents\ocr-japan👀 Watching training data folder: C:\Users\Chan Jian
> Sen\Documents\ocr-japanese\INPUT_TRAINING_DATA
> 👀 Watching training data folder: C:\Users\Chan Jian
> Sen\Documents\ocr-japan👀 Watching training data folder: C:\Users\Chan Jian
> Sen\Documents\ocr-japanese\INPUT_TRAINING_DATA
> ⏳ Waiting for matching .png and .gt.txt pairs...
> 🔁 Using jpn as base, training new model: jpn1
> 🚀 Running full Tesseract training pipeline...
> You are using make version: 4.4.1
> Makefile:438: *** mixed implicit and normal rules: deprecated syntax
> combine_tessdata -u C:/Users/Chan Jian
> Sen/Documents/TesseractFineTuningJpn5/tessdata/jpn.traineddata data/jpn/jpn1
> Failed to read C:/Users/Chan
> make: *** [Makefile:207: data/jpn/jpn1.lstm-unicharset] Error 1
> ❌ Training failed: Command '['make', 'unicharset', 'lists', 'proto-model',
> 'tesseract-langdata', 'training', 'MODEL_NAME=jpn1', 'START_MODEL=jpn',
> 'TESSDATA=C:/Users/Chan Jian Sen/Documents/TesseractFineT  Training failed:
> Command '['make', 'unicharset', 'lists', 'proto-model', ATIONS=4000',
> 'LEARNING_RATE=0.001']' returned no
> uningJpn5/tessdata', 'GROUND_TRUTH_DIR=C:\\Users\\Chan Jian
> Sen\\Documents\\TesseractFineTuningJpn5\\tesstrain', 'MAX_ITERATIONS=4000',
> 'LEARNING_RATE=0.001']' returned non-zero exit status 2.
>
> (Yellow parts is the error). Would greatly appreciate for any help given!
> Sorry if it looks complicated hahah
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion visit
> https://groups.google.com/d/msgid/tesseract-ocr/1d1b27e3-fd8d-43c5-a801-50cfcaa196efn%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/1d1b27e3-fd8d-43c5-a801-50cfcaa196efn%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xM7ybauxKOe%3DEJN-ObF7QxPk9bG67yMYrm3mAG4NYVuQ%40mail.gmail.com.

Re: [tesseract-ocr] Facing issues with unicharset when trying to automate model training

Reply via email to