I wrote:
>> I think most likely we should report this to Snowball upstream
>> and see what they think is an appropriate fix.

> Done at [1], and I pushed the other fixes.  Thanks again for the report!

The upstream recommendation, which seems pretty sane to me, is to
simply reject any string exceeding some threshold length as not
possibly being a word.  Apparently it's common to use thresholds
as small as 64 bytes, but in the attached I used 1000 bytes.

                        regards, tom lane

diff --git a/src/backend/snowball/dict_snowball.c b/src/backend/snowball/dict_snowball.c
index 68c9213f69..aaf4ff72b6 100644
--- a/src/backend/snowball/dict_snowball.c
+++ b/src/backend/snowball/dict_snowball.c
@@ -272,11 +272,25 @@ dsnowball_lexize(PG_FUNCTION_ARGS)
 	DictSnowball *d = (DictSnowball *) PG_GETARG_POINTER(0);
 	char	   *in = (char *) PG_GETARG_POINTER(1);
 	int32		len = PG_GETARG_INT32(2);
-	char	   *txt = lowerstr_with_len(in, len);
 	TSLexeme   *res = palloc0(sizeof(TSLexeme) * 2);
+	char	   *txt;
 
+	/*
+	 * Reject strings exceeding 1000 bytes, as they're surely not words in any
+	 * human language.  This restriction avoids wasting cycles on stuff like
+	 * base64-encoded data, and it protects us against possible inefficiency
+	 * or misbehavior in the stemmers (for example, the Turkish stemmer has an
+	 * indefinite recursion so it can crash on long-enough strings).
+	 */
+	if (len <= 0 || len > 1000)
+		PG_RETURN_POINTER(res);
+
+	txt = lowerstr_with_len(in, len);
+
+	/* txt is probably not zero-length now, but we'll check anyway */
 	if (*txt == '\0' || searchstoplist(&(d->stoplist), txt))
 	{
+		/* empty or stopword, so reject */
 		pfree(txt);
 	}
 	else

Reply via email to