[tesseract-ocr] Tesseract training with Custom Dataset

Ishak DÖLEK Fri, 18 Apr 2025 01:38:24 -0700

Hello,

I am writing to inquire about the possibility of training a Tesseract model
using my custom dataset. This dataset consists of Arabic image lines paired
with corresponding Latin-based text lines.


Specifically, I have the following questions:

Is it possible to train Tesseract with a dataset where the images contain
right-to-left (RTL) Arabic script and the corresponding text lines are
left-to-right (LTR) Latin-based text? I am sharing the attached example.

If training with such a dataset is possible, are there any specific
documents or tutorials available that outline the process? Any guidance on
how to structure the training data and the training commands would be
greatly appreciated.

Thank you for your time and assistance. I look forward to your guidance on
this matter.



make LANG_TYPE=RTL MODEL_NAME=ara GROUND_TRUTH_DIR=data/ara-ground-truth
PSM=13 TESSDATA=/tessdata EPOCHS=20 training


Sincerely,
Ishak Dölek

-- 
Dr. İshak Dölek
Mina AR-GE, Kurucu Ortak
ishakdole...@gmail.com <atakanh...@gmail.com>
is...@osmanlica.com
ishakdo...@subu.edu.tr <atakan.k...@istanbul.edu.tr>
https://ishakdolek.github.io

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAA%3DdkubGBEpdCOHP0RBKXjgc3zSz%3DExhS-2PmhOWv2LFiXeH_w%40mail.gmail.com.

Allahu akbaru kabiran wa‑l‑hamdu li‑llahi hamdan kaṯiran fa‑subhana llah 
wa‑bi‑hamdi bukratan wa‑asilan lam yattahiḏ sahibatan wa‑la waladan

[tesseract-ocr] Tesseract training with Custom Dataset

Reply via email to