Ciao, Thanks for sharing! I have the same problem with script / Fraktur.traineddata, which is far better than simple "frk.traineddata, but I found there was in the wordlist and in the unicharset all European accented characters (French, Italian and Spanish: âêîôû, æ, œ, àèìòù, áéíóúñ, ¡ ¿ [and relatives CAPS] and other useless characters: € Þ) which are absolutely unknown in old German. Could it be that for Tesseract, "Fraktur" is not only for German language?
I solved my problem of ">" and "<" by modifying the unicharset file, and replacing *in the first column only*, these characters by "ck" and "ch" (I also tried to modify the 2 fields after the # ["# ck [63 6b"], but it made no difference). I tried the same modification on "ô" and "ó" to get "o" but it doesn't work, even with a modified word list where I cancelled all words with these letters. I also noticed that the word list seems to have absolutely no effect: changing the list (replace "best"-list by "lstm "-list) doesn't change anything on the result… Best regards, Isidore. Il giorno lunedì 20 marzo 2023 alle 19:53:01 UTC+1 andrea....@gmail.com ha scritto: > Hi, > > no, unicharambigs is not used by LSTM files. It was used in the legacy > mode. > > I'm having similar problems with the ancient greek best traineddata: > unfortunately it has been trained with some non standard characters (ά έ ή > ί ό ύ ώ, instead of ά έ ή ί ό ύ ώ). I tried fine tuning the > grc.traineddata, but without very much success, so, for the time being, I'm > producing hocr files, post-process them and then use hocr-pdf to create a > searchable pdf. > > > best, > andrea > On Monday, March 13, 2023 at 5:13:33 PM UTC+1 Isidore Paris wrote: > >> Hi, >> I'm doing some frk text recognition, and in my results, I have a great >> number of " > ". Each one should be replaced by " ck ". >> I updated my frk.traineddata file (from tessdata_best repository) with a >> frk.unicharambigs file (I tried both formats v1 and v2) but absolutely >> nothing changed. >> I also tried the parameter " -c use_ambigs_for_adaption=1 " to see if >> maybe it was needed, but still nothing changed, not a single character (> >> and = and / are all still there). >> >> Here is the content of my v2 frk.unicharambigs file: >> v2 >> > ck 1 >> = - 1 >> / - 1 >> >> Does unicharambigs not work with LSTM files? Or did I miss some >> particular or special step? >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/8afec85e-fcc4-4357-877e-9c177f887686n%40googlegroups.com.