Hi. I've attached a patch to contrib/unaccent as outlined in my review the other day. I'm familiar with multiple languages in which modifiers are separate characters (but not Arabic), so I decided to try a quick test because I was curious.
I added a line containing only U+0940 (DEVANAGARI VOWEL SIGN II) to unaccent.rules, and tried the following (the argument to unaccent is U+0915 U+0940, and the result is U+0915): ams=# select unaccent('unaccent','की '); unaccent ---------- क (1 row) So the patch works fine: it correctly removes the modifier. To add a test, however, it would be necessary to add this modifier to unaccent.rules. But if we're adding one modifier to unaccent.rules, we really should add them all. I have nowhere near the motivation needed to add all the Devanagari modifiers, let alone any of the other languages I know, and even if I did, it still wouldn't address Mohammad's use case. (As a separate matter, it's not clear to me if stripping these modifiers using unaccent is something everyone will want to do.) So, though I'm not fond of saying it, perhaps the right thing to do is to forget my earlier objection (that the patch didn't have tests), and just commit as-is. It's a pretty straightforward patch, and it works. I'm setting this as ready for committer. -- अभजत "unaccented in three languages" മനന-সন
diff --git a/contrib/unaccent/unaccent.c b/contrib/unaccent/unaccent.c index a337df6..c485a41 100644 --- a/contrib/unaccent/unaccent.c +++ b/contrib/unaccent/unaccent.c @@ -105,15 +105,16 @@ initTrie(char *filename) while ((line = tsearch_readline(&trst)) != NULL) { /* - * The format of each line must be "src trg" where src and trg - * are sequences of one or more non-whitespace characters, - * separated by whitespace. Whitespace at start or end of - * line is ignored. + * The format of each line must be "src" or "src trg", + * where src and trg are sequences of one or more + * non-whitespace characters, separated by whitespace. + * Whitespace at start or end of line is ignored. If trg + * is omitted, an empty string is used as a replacement. */ int state; char *ptr; char *src = NULL; - char *trg = NULL; + char *trg = ""; int ptrlen; int srclen = 0; int trglen = 0; @@ -160,6 +161,10 @@ initTrie(char *filename) } } + /* It's OK to have a valid src and empty trg. */ + if (state > 0 && trglen == 0) + state = 5; + if (state >= 3) rootTrie = placeChar(rootTrie, (unsigned char *) src, srclen,
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers