On Wed, Aug 31, 2022 at 6:57 AM Tom Lane <t...@sss.pgh.pa.us> wrote:

> I wrote:
> > The upstream recommendation, which seems pretty sane to me, is to
> > simply reject any string exceeding some threshold length as not
> > possibly being a word.  Apparently it's common to use thresholds
> > as small as 64 bytes, but in the attached I used 1000 bytes.
>
> On further thought: that coding treats anything longer than 1000
> bytes as a stopword, but maybe we should just accept it unmodified.
> The manual says "A Snowball dictionary recognizes everything, whether
> or not it is able to simplify the word".  While "recognizes" formally
> includes the case of "recognizes as a stopword", people might find
> this behavior surprising.  We could alternatively do it as attached,
> which accepts overlength words but does nothing to them except
> case-fold.  This is closer to the pre-patch behavior, but gives up
> the opportunity to avoid useless downstream processing of long words.


This patch looks good to me. It avoids overly-long words (> 1000 bytes)
going through the stemmer so the stack overflow issue in Turkish stemmer
should not exist any more.

Thanks
Richard

Reply via email to