[tesseract-ocr] Re: use of unicharambigs

'Isidore Paris' via tesseract-ocr Sun, 26 Mar 2023 14:35:11 -0700

Ciao,

Thanks for sharing!
I have the same problem with script / Fraktur.traineddata, which is far 
better than simple "frk.traineddata, but I found there was in the wordlist 
and in the unicharset all European accented characters (French, Italian and 
Spanish: âêîôû, æ, œ, àèìòù, áéíóúñ, ¡ ¿ [and relatives CAPS] and other  
useless characters: € Þ) which are absolutely unknown in old German.
Could it be that for Tesseract, "Fraktur" is not only for German language?


I solved my problem of ">" and "<" by modifying the unicharset file, and 
replacing *in the first column only*, these characters by "ck" and "ch" (I 
also tried to modify the 2 fields after the # ["# ck [63 6b"], but it made 
no difference).
I tried the same modification on "ô" and "ó" to get "o" but it doesn't 
work, even with a modified word list where I cancelled all words with these 
letters.

I also noticed that the word list seems to have absolutely no effect: 
changing the list (replace "best"-list by "lstm "-list) doesn't change 
anything on the result…

Best regards,
Isidore.

Il giorno lunedì 20 marzo 2023 alle 19:53:01 UTC+1 andrea....@gmail.com ha 
scritto:

> Hi,
>
> no, unicharambigs is not used by LSTM files. It was used in the legacy 
> mode.
>
> I'm having similar problems with the ancient greek best traineddata: 
> unfortunately it has been trained with some non standard characters (ά έ ή 
> ί ό ύ ώ, instead of  ά έ ή ί ό ύ ώ). I tried fine tuning the 
> grc.traineddata, but without very much success, so, for the time being, I'm 
> producing hocr files, post-process them and then use hocr-pdf to create a 
> searchable pdf.
>
>
> best,
> andrea
> On Monday, March 13, 2023 at 5:13:33 PM UTC+1 Isidore Paris wrote:
>
>> Hi,
>> I'm doing some frk text recognition, and in my results, I have a great 
>> number of " > ". Each one should be replaced by " ck ".
>> I updated my frk.traineddata file (from tessdata_best repository) with a 
>> frk.unicharambigs file (I tried both formats v1 and v2) but absolutely 
>> nothing changed.
>> I also tried the parameter " -c use_ambigs_for_adaption=1 " to see if 
>> maybe it was needed, but still nothing changed, not a single character (> 
>> and = and / are all still there).
>>
>> Here is the content of my v2 frk.unicharambigs file:
>> v2
>> > ck 1
>> = - 1
>> / - 1
>>
>> Does unicharambigs not work with LSTM files? Or did I miss some 
>> particular or special step?
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/8afec85e-fcc4-4357-877e-9c177f887686n%40googlegroups.com.

[tesseract-ocr] Re: use of unicharambigs

Reply via email to