[tesseract-ocr] Need Tamil Language Legacy model for OSD

Captain Odyssey Tue, 05 Nov 2024 10:56:42 -0800

*Dear all, *

I'm currently trying to use the python wrapper for Tesseract (pytesseract) 
to correct the rotation, in terms of multiple of 90 degrees, of images 
about Tamil newspapers. Specifically, I want to use 
pytesseract.image_to_osd(binary, config = '--oem 0 -l tam--psm 0') to find 
the orientation OSD data of the individual images so as to correct them. I 
tried --oem 0, 1, 2, 3 and all of them did not work even after using the 
legacy engine.


Error for --oem 0 and 2:
  File 
"C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\pytesseract\pytesseract.py",
 
line 284, in run_tesseract
    raise TesseractError(proc.returncode, get_errors(error_string))
pytesseract.pytesseract.TesseractError: (1, "Warning, detects only 
orientation with -l tam Error: Tesseract (legacy) engine requested, but 
components are not present in C:\\Program 
Files\\Tesseract-OCR\\tessdata/tam.traineddata!! Failed loading language 
'tam' Tesseract couldn't load any languages! Could not initialize 
tesseract.")

Error for --oem 1 and 3:
  File 
"C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\pytesseract\pytesseract.py",
 
line 284, in run_tesseract
    raise TesseractError(proc.returncode, get_errors(error_string))
pytesseract.pytesseract.TesseractError: (1, 'Warning, detects only 
orientation with -l tam Error, OSD requires a model for the legacy engine')

Indeed, legacy engine for Tamil is needed for this task, and I used the 
tam.traineddata in this <https://github.com/tesseract-ocr/tessdata>  
<https://github.com/tesseract-ocr/tessdata>legacy+LSTM repository. However, 
as you can see at the bottom of the page, it says "The legacy tesseract 
models (--oem 0) have been removed for Indic and Arabic script language 
files."

Legacy fra and eng packs works perfectly when I do
pytesseract.image_to_osd(binary, config = '--oem 0 -l fra --psm 0')
pytesseract.image_to_osd(binary, config = '--oem 2 -l fra --psm 0')
and
pytesseract.image_to_osd(binary, config = '--oem 0 -l eng --psm 0')
pytesseract.image_to_osd(binary, config = '--oem 2 -l eng --psm 0')

The output looks like this:
Page number: 0
Orientation in degrees: 270
Rotate: 90
Orientation confidence: 0.89
Script: Latin
Script confidence: 8.38

I guess the reason legacy Tamil pack is removed is that Tamil legacy engine 
worked poorly. However, since I'm only trying to get the orientation of 
texts in binarized images, would it be possible for you to give me access 
to its legacy model? If this is not possible, are there any other 
suggestions from you to help me with my case?

Thanks for reading this email in your busy schedule and have a great day!

*Sincerely,*
*Siyou*

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion visit 
https://groups.google.com/d/msgid/tesseract-ocr/6be2a77c-20ca-4854-bc36-2a4fd9754036n%40googlegroups.com.

[tesseract-ocr] Need Tamil Language Legacy model for OSD

Reply via email to