[tesseract-ocr] Re: tesseract whitelist not working

Tom Morris Mon, 07 Apr 2025 11:21:45 -0700

That looks like it's probably a character encoding issue with how 
pytesseract constructs/uses its command line. You might try putting the 
what list in a config file and passing that instead to work around the 
issue.


You don't mention what language model(s) you are using. If you are using 
eng+grc, you might try script/Latin+script/Greek to see if it improves 
things.

Tom

On Tuesday, April 1, 2025 at 9:22:12 PM UTC-4 kylefo...@gmail.com wrote:

> I'm using Tesseract with Python because it's too difficult to OCR when the 
> languages are mixed between the Greek alphabet and the Latin alphabet.  I 
> was hoping that the whitelist feature would solve that problem.  But this 
> is not the case.  When I input the following whitelist, 
>
>
> αςερτυθιοπλκξηγφδσζχψωβνμΣΕΡΤΥΘΙΟΠΛΚΞΗΓΦΔΣΑΖΧΨΩΒΝΜΑΖΧΨΩΒΝΜABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz1234567890\/?<>{}[]()*&,;.:-+=|1234567890
>
>
> I get a reasonably good output for the Latin characters, but the Greek 
> text is only roughly 75% accurate.  for example, here is an output
>
>
>
> Contracted nouns and adjectives in -ους from -οος 63
> Adjectives of material in -ots from -εος 64
> Nouns in ts, -εως and -υς/-υ, -εως 65
>
> But the correct output should be οῦς not -ots
>
>
> However, even if the accuracy were 100%, that whitelist will not solve my 
> problem because it does not use the diacritics.  So when I use a whitelist 
> with diacritics, such as
>
>
> "ΆᾺΑἉἊἍἋἌᾍᾈᾌᾎᾉAΒΔΗΉἩἨἮἯἬἫἭἪῌᾞᾟᾜᾘᾙῊἜἚἝἛἘἙΈΕΓΙῚἾἿἽἻἺἼἹἸΊIΚΧΞΛΜΝὩὨῼὭὫὬὪὯὮΩΏ
> ὉὈὊὋὌὍΟΌῸῺᾨᾩᾯᾮᾪᾫᾬᾭΠΦΨῬΡΣΤΘὝὛὙΎΥὟΖᾅᾳᾇᾄᾂᾀᾷᾆᾴᾲἇἆἂἄἅἃάᾶὰαἁἀααᾁᾃβδέὲἕἓἒἔἑἐε
> ἠῆᾖἧᾔᾐᾑἥἣᾕἡἦῄῂῇᾗηῃήὴἤἢᾒᾓγϊῖιἰἶἴἲἱΐῒὶίἷἵἳῗιικχλμνὁᾦὀοῷὧω
> ὠᾡὦῳῶὡᾠᾧῴῲὢὤὥὣᾤᾢὅὃὄὂόὸώὼᾣᾥπφψῤῥρςστθὖϋὗῧὐὑυῦὔὒύὺὓὕῢΰυυϝξζΑΖΧΨΩΒΝΜABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz1234567890\/?<>{}[]()*&,;.:-+=|1234567890
>  
> "
>
> I get the output:
>
> ΝΕΗΟΓΑΑΠΚ
> Α
> ΑΟΗΠΓΠΟΠ
> ΑΟΕΠΓ
> ΑΕΠΓΟ
> ΑΠ
> ἸΑΓΝΠΑΟΕΕ
> ΡΟΡΟΠ
> ΑΙΟΓΠΊ
> ΠΟΙΠΕΟΓΠΓΕΠΟΏΡΒΡ
> ΑΓ Ι
> ΙΠΠΠΠΊΒΠ
>
> I've tried locating the characters that are messing things up but there 
> are too many.  But it is certainly not any of these characters: 
> \/?<>{}[]()*&,;.:-+=|
>
> The image I'm trying to scan is uploaded.  here is the exact python code 
> I'm using:
>
> ```
> import pytesseract
> custom_oem_psm_config = '--oem 3 --psm 6 -c tessedit_char_whitelist="{}"'
> .format(
> "ΆᾺΑἉἊἍἋἌᾍᾈᾌᾎᾉAΒΔΗΉἩἨἮἯἬἫἭἪῌᾞᾟᾜᾘᾙῊἜἚἝἛἘἙΈΕΓΙῚἾἿἽἻἺἼἹἸΊIΚΧΞΛΜΝὩὨῼὭὫὬὪὯὮΩΏ
> ὉὈὊὋὌὍΟΌῸῺᾨᾩᾯᾮᾪᾫᾬᾭΠΦΨῬΡΣΤΘὝὛὙΎΥὟΖᾅᾳᾇᾄᾂᾀᾷᾆᾴᾲἇἆἂἄἅἃάᾶὰαἁἀααᾁᾃβδέὲἕἓἒἔἑἐε
> ἠῆᾖἧᾔᾐᾑἥἣᾕἡἦῄῂῇᾗηῃήὴἤἢᾒᾓγϊῖιἰἶἴἲἱΐῒὶίἷἵἳῗιικχλμνὁᾦὀοῷὧω
> ὠᾡὦῳῶὡᾠᾧῴῲὢὤὥὣᾤᾢὅὃὄὂόὸώὼᾣᾥπφψῤῥρςστθὖϋὗῧὐὑυῦὔὒύὺὓὕῢΰυυϝξζΑΖΧΨΩΒΝΜABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz1234567890\/?<>{}[]()*&,;.:-+=|
>  
> "
> )
> str4 = pytesseract.image_to_string(img1, config=custom_oem_psm_config,lang
> ='eng+ell')
> print(str4)
> ```
>
> I'm using pytesseract 0.3.13 and I have tesseract 5.3.8 installed.
>
>
>
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion visit 
https://groups.google.com/d/msgid/tesseract-ocr/d09da4fc-0b4e-4f92-acb0-89275fca875dn%40googlegroups.com.

[tesseract-ocr] Re: tesseract whitelist not working

Reply via email to