Hello, I am writing to inquire about the possibility of training a Tesseract model using my custom dataset. This dataset consists of Arabic image lines paired with corresponding Latin-based text lines.
Specifically, I have the following questions: Is it possible to train Tesseract with a dataset where the images contain right-to-left (RTL) Arabic script and the corresponding text lines are left-to-right (LTR) Latin-based text? I am sharing the attached example. If training with such a dataset is possible, are there any specific documents or tutorials available that outline the process? Any guidance on how to structure the training data and the training commands would be greatly appreciated. Thank you for your time and assistance. I look forward to your guidance on this matter. make LANG_TYPE=RTL MODEL_NAME=ara GROUND_TRUTH_DIR=data/ara-ground-truth PSM=13 TESSDATA=/tessdata EPOCHS=20 training Sincerely, Ishak Dölek -- Dr. İshak Dölek Mina AR-GE, Kurucu Ortak ishakdole...@gmail.com <atakanh...@gmail.com> is...@osmanlica.com ishakdo...@subu.edu.tr <atakan.k...@istanbul.edu.tr> https://ishakdolek.github.io -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/CAA%3DdkubGBEpdCOHP0RBKXjgc3zSz%3DExhS-2PmhOWv2LFiXeH_w%40mail.gmail.com.
Allahu akbaru kabiran wa‑l‑hamdu li‑llahi hamdan kaṯiran fa‑subhana llah wa‑bi‑hamdi bukratan wa‑asilan lam yattahiḏ sahibatan wa‑la waladan