On Tue, Mar 13, 2012, kobi zamir wrote about "Re: Unicode in C": > imho because hspell only use hebrew, it can internally continue to use > hebrew only charset without nikud iso-8859-8 (or with nikud win-1255).
I agree, and this has been my feeling all along. By using iso-8859-8 internally (and for the basic word lookup, an even more optimized 5-bit encoding) instead of utf-8, Hspell's memory usage is at least halved. > it will be helpful if hspell will give the user convenience functions. this > functions will that take utf-8 and return utf-8. the functions will convert > the utf-8 to the hebrew only coding that hspell will use internally. So I guess that you're also in the UTF-8 camp. That's also the direction I'm leaning. But the question is - will one day after Hspell gets a UTF-8 API, people start complaining why it doesn't have a UTF-16, UTF-32, or some other sort of API? And don't answer "if they want UTF-16, let them use iconv to convert UTF-16 to UTF-8 and back" - after all they can do this now with ISO-8859-8 (and like you said, Enchant is doing exactly that) and still people complain ;-) > p.s. > i will be happy if hspell will give easy to use functions for using the > library lingual info. in current version of hspell using lingual info is > very hard. see: > http://code.google.com/p/hspell-gir/source/browse/src/hspell-gir.vala I agree that the linginfo (aka morphological analyzer) C API needs an overhaul. Out of embarrasment, it's not even documented in hspell(3) :-) It could also have been implemented more efficiently (memory-wise) than it is. But following the maxim "If it ain't broken, don't fix it", we haven't touched this code in years :( P.S. Looking at http://code.google.com/p/hspell-gir/, I see that hspell-gui has a bug: it claims that החתול might mean ה+חתול with the second word being in construct form (סמיכוך). But this isn't a valid split - the construct form cannot be preceded by the definite article (ה) - and Hspell knows this (try running hspell -al or going to the demo at http://www.cs.technion.ac.il/~danken/cgi-bin/hspell.cgi to check). Similarly, הירוק only has one legal meaning ("the green") and the two other meanings listed in the png on your site are *wrong*. So it appears something is wrong with your word splitting code? This is surprising if you're using libhspell... I didn't look at your code to see where it went wrong. Nadav. -- Nadav Har'El | Tuesday, Mar 13 2012, n...@math.technion.ac.il |----------------------------------------- Phone +972-523-790466, ICQ 13349191 |And now for some feedback: http://nadav.harel.org.il |EEEEEEEEEEEEEEEEEEEEEEEEEEE _______________________________________________ Linux-il mailing list Linux-il@cs.huji.ac.il http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il