On Mon, 2014-08-25 at 17:41 +0300, Heikki Linnakangas wrote: > Actually, that gets optimized to a constant in the planner:
Oops, thank you (and Tom). > your patch seems to be about 2x-3x as slow as unpatched master. So this > needs some optimization. A couple of ideas: I didn't see anywhere near that kind of regression. On unpatched master, with your test case, I saw it stabilize to about 680ms. With similar-escape-1, I saw about 775ms (15% regression). Are those at all close to your numbers? Is there a chance you used an unoptimized build for one of them, or left asserts enabled? > 1. If the escape string is in fact a single-byte character, you can > proceed with the loop just as it is today, without the pg_mblen calls. > > 2. Since pg_mblen() will always return an integer between 1-6, it would > probably be faster to replace the memcpy() and memcmp() calls with > simple for-loops iterating byte-by-byte. > > In very brief testing, with the 1. change above, the performance with > this patch is back to what it's without the patch. See attached. The particular patch has a mistake: the first branch is always taken because pg_mblen() won't return 0. It's also fairly ugly to set mblen in the test for the branch that doesn't use it. Attached a patch implementing the same idea though: only use the multibyte path if *both* the escape char and the current character from the pattern are multibyte. I also changed the comment to more clearly state the behavior upon which we're relying. I hope what I said is accurate. Regards, Jeff Davis
*** a/src/backend/utils/adt/regexp.c --- b/src/backend/utils/adt/regexp.c *************** *** 688,698 **** similar_escape(PG_FUNCTION_ARGS) elen = VARSIZE_ANY_EXHDR(esc_text); if (elen == 0) e = NULL; /* no escape character */ ! else if (elen != 1) ! ereport(ERROR, ! (errcode(ERRCODE_INVALID_ESCAPE_SEQUENCE), ! errmsg("invalid escape string"), ! errhint("Escape string must be empty or one character."))); } /*---------- --- 688,703 ---- elen = VARSIZE_ANY_EXHDR(esc_text); if (elen == 0) e = NULL; /* no escape character */ ! else ! { ! int escape_mblen = pg_mbstrlen_with_len(e, elen); ! ! if (escape_mblen > 1) ! ereport(ERROR, ! (errcode(ERRCODE_INVALID_ESCAPE_SEQUENCE), ! errmsg("invalid escape string"), ! errhint("Escape string must be empty or one character."))); ! } } /*---------- *************** *** 724,729 **** similar_escape(PG_FUNCTION_ARGS) --- 729,782 ---- { char pchar = *p; + /* + * If both the escape character and the current character from the + * pattern are multi-byte, we need to take the slow path. + * + * But if one of them is single-byte, we can process the the pattern + * one byte at a time, ignoring multi-byte characters. (This works + * because all server-encodings have the property that a valid + * multi-byte character representation cannot contain the + * representation of a valid single-byte character.) + */ + + if (elen > 1) + { + int mblen = pg_mblen(p); + if (mblen > 1) + { + /* slow, multi-byte path */ + if (afterescape) + { + *r++ = '\\'; + memcpy(r, p, mblen); + r += mblen; + afterescape = false; + } + else if (e && elen == mblen && memcmp(e, p, mblen) == 0) + { + /* SQL99 escape character; do not send to output */ + afterescape = true; + } + else + { + /* + * We know it's a multi-byte character, so we don't need + * to do all the comparisons to single-byte characters + * that we do below. + */ + memcpy(r, p, mblen); + r += mblen; + } + + p += mblen; + plen -= mblen; + + continue; + } + } + + /* fast path */ if (afterescape) { if (pchar == '"' && !incharclass) /* for SUBSTRING patterns */
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers