On Wed, Dec 11, 2019 at 01:58:03PM +0100, Matus UHLAR - fantomas wrote: > > My question was, if there's a bug in the bayes code, causing it to eat too > much of memory. Both ~750B per token with file-based bayes or ~600B per > token in redis-based BAYES looks like too much for me.
Not so much a bug, but we should probably add some internal limit to parsed tokens (10000?) - a normal message would not contain more tokens. At those counts the per token memory usage is irrelevant (but we could look at optimizing it too). Just need to be careful not to create a loophole for spammers (filling up few 50k parts with random short tokens, so last part won't be tokenized at all?) Created a bug so it won't be forgotten: https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7776