*Dear all, * I'm currently trying to use the python wrapper for Tesseract (pytesseract) to correct the rotation, in terms of multiple of 90 degrees, of images about Tamil newspapers. Specifically, I want to use pytesseract.image_to_osd(binary, config = '--oem 0 -l tam--psm 0') to find the orientation OSD data of the individual images so as to correct them. I tried --oem 0, 1, 2, 3 and all of them did not work even after using the legacy engine.
Error for --oem 0 and 2: File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\pytesseract\pytesseract.py", line 284, in run_tesseract raise TesseractError(proc.returncode, get_errors(error_string)) pytesseract.pytesseract.TesseractError: (1, "Warning, detects only orientation with -l tam Error: Tesseract (legacy) engine requested, but components are not present in C:\\Program Files\\Tesseract-OCR\\tessdata/tam.traineddata!! Failed loading language 'tam' Tesseract couldn't load any languages! Could not initialize tesseract.") Error for --oem 1 and 3: File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\pytesseract\pytesseract.py", line 284, in run_tesseract raise TesseractError(proc.returncode, get_errors(error_string)) pytesseract.pytesseract.TesseractError: (1, 'Warning, detects only orientation with -l tam Error, OSD requires a model for the legacy engine') Indeed, legacy engine for Tamil is needed for this task, and I used the tam.traineddata in this <https://github.com/tesseract-ocr/tessdata> <https://github.com/tesseract-ocr/tessdata>legacy+LSTM repository. However, as you can see at the bottom of the page, it says "The legacy tesseract models (--oem 0) have been removed for Indic and Arabic script language files." Legacy fra and eng packs works perfectly when I do pytesseract.image_to_osd(binary, config = '--oem 0 -l fra --psm 0') pytesseract.image_to_osd(binary, config = '--oem 2 -l fra --psm 0') and pytesseract.image_to_osd(binary, config = '--oem 0 -l eng --psm 0') pytesseract.image_to_osd(binary, config = '--oem 2 -l eng --psm 0') The output looks like this: Page number: 0 Orientation in degrees: 270 Rotate: 90 Orientation confidence: 0.89 Script: Latin Script confidence: 8.38 I guess the reason legacy Tamil pack is removed is that Tamil legacy engine worked poorly. However, since I'm only trying to get the orientation of texts in binarized images, would it be possible for you to give me access to its legacy model? If this is not possible, are there any other suggestions from you to help me with my case? Thanks for reading this email in your busy schedule and have a great day! *Sincerely,* *Siyou* -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/6be2a77c-20ca-4854-bc36-2a4fd9754036n%40googlegroups.com.