[tesseract-ocr] Re: Train Tesseract with my own Data

Yaofu Zhou Tue, 21 May 2024 02:15:46 -0700

Hi. You seem to be missing a lot of input. Please take a look at Tesstrain 
<https://github.com/tesseract-ocr/tesstrain>, and particularly its 
Makefile, so that you know what is involved in the training process. I 
would go over the official documentation of Tesstrain and run "make help" 
to see the input needed. One of the items, among many, that you have not 
specified is the CNN-LSTM network specs, which you can ask GPT/Claude to 
explain to you.

Furthermore, you can use GPT or Claude to digest the Makefile for you so 
that you know what binaries are invoked during different steps of the 
training process. Once you find the binaries involved, you can do something 
like "lstmtraining --help" for each binary and check for the complete list 
of options, some of which are not specified in the Tesstrain Makefile.

Once you digest the Makefile of Tesstrain, it will become clear to you 
that, as messy as it may be, it is just an ugly wrapper to run various 
Tesseract binaries in sequence, which is similar to what you were trying to 
achieve. Then, you can (use GPT/Claude to) tailor the Makefile for you and 
even turn it into an equivalent Python script for easier modifications. 
This is almost certainly necessary if your training set is very large.

On Monday, April 22, 2024 at 2:08:09 PM UTC-4 testc...@gmail.com wrote:

> Hi,
> i am trying to train a tesseract model with my own data. This is my code : 
> import os
>
> # Pfade konfigurieren
> TRAIN_DATA_DIR = "./data1"
> TRAIN_LISTFILE = "./trainingsliste.txt"
> OUTPUT_DIR = "./output"
> TRAINEDDATA = "./tesseract-4.1/tessdata/deu.traineddata"
> # Prüfe notwendige Pfade
> if not os.path.exists(TRAIN_DATA_DIR) or not 
> os.path.exists(TRAIN_LISTFILE) or not os.path.exists(TRAINEDDATA):
>     raise FileNotFoundError("Ein oder mehrere benötigte 
> Verzeichnisse/Dateien fehlen.")
>
> # Ausgabeverzeichnis erstellen, falls nicht vorhanden
> if not os.path.exists(OUTPUT_DIR):
>     os.makedirs(OUTPUT_DIR)
>
>
> # Trainingskonfiguration
> MAX_ITERATIONS = 200
> os.environ['OMP_THREAD_LIMIT'] = '16'
>
> # Trainingsbefehl
> command = f'lstmtraining --model_output {OUTPUT_DIR}/font_name 
> --traineddata {TRAINEDDATA} --train_listfile {TRAIN_LISTFILE} 
> --max_iterations {MAX_ITERATIONS}'
> result = os.system(command + " > train_output.txt 2>&1")
> print("Ausgeführter Befehl:", command)
>
> if result != 0:
>     with open('train_output.txt', 'r') as file:
>         output = file.read()
>     print("Fehler beim Training:", output)
>     raise Exception("Fehler beim Starten des Trainingsprozesses.") and 
> this is the error: Must specify an input layer as the first layer, not !!
> Failed to create network from spec: 
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a797e9fb-b3e6-41f1-bb83-f2fb445e8238n%40googlegroups.com.

[tesseract-ocr] Re: Train Tesseract with my own Data

Reply via email to