On 02/10/2015 01:01 PM, Aleix Pol wrote:
I like the idea. Have you checked whether ICU provides something like this? They might...
To approach this more broadly: The basic problem here is that not every character code point in Unicode stands for a single phoneme; in the examples I mentioned a character can be syllable or more. This makes a check on the number of characters a poor check for information content, since less than three characters can easily pack enough phonetic information to make distinct words (consider e.g. tonal languages that exploit dimensions of audio for encoding semantic value English does not, too). I figured we're not the first ones to hit this problem, so I did some basic research on whether the character database has enough metadata for a scoring algorithm and whether there's a well-established scoring algorithm around (including looking at ICU). So far I haven't found anything beyond the basic idea of exploiting the character classification to assign average phoneme counts -- but the good news is that as soon as we centralize this into one implementation, we're free to improve it later on (which I would expect to do as I hang out more in this problem space in the future). This is also why I want to keep the API very minimal for now, either this: bool isMinimalSearchableLength(QString) or at most: bool isMinimalSearchableLength(QString, int approximateLength = 3) Naming subject to improvements of course, but approximate- Length here would be like "think of it like a phoneme", which is what the implementation would approximate scoring for, allowing the dev to override the target. I'm not sure I even want to expose that second parameter though since it constrains the behavior of the impl. This is also something I'd really like feedback for though: As a dev, how would you want to use it?
Aleix
Cheers, Eike _______________________________________________ Kde-frameworks-devel mailing list Kde-frameworks-devel@kde.org https://mail.kde.org/mailman/listinfo/kde-frameworks-devel