>On Wed, Dec 11, 2019 at 10:53:04AM +0100, Matus UHLAR - fantomas wrote:
>>On 11.12.19 11:43, Henrik K wrote:
>>>Wow 6 million tokens.. :-)
>>>
>>>I assume the big uuencoded blob content-type is text/* since it's tokenized?

>>yes, I mentioned that in previous mails. ~15M file, uuencoded in ~20M mail.
>>
>>grep -c '^M' spamassassin-memory-error-<...>
>>329312
>>
>>One of former mails mentioned that 20M mail should use ~700M of RAM. 6M
>>tokens eating about 4G of RAM means ~750B per token, is that fine?

On 11.12.19 12:07, Henrik K wrote:
>I'm pretty sure the Bayes code does many dumb things with the tokens
>that result in much memory usage for abnormal cases like this.

On Wed, Dec 11, 2019 at 01:12:46PM +0100, Matus UHLAR - fantomas wrote:
but apparently nobody notices...

On 11.12.19 14:22, Henrik K wrote:
How many people even scan 20MB mails?  Pretty much nobody.  It's not safe to
do until SA 3.4.3 version as you can see.  Before this, I know atleast
Amavisd-new could be configured to truncate large messages before feeding to
SA, which was somewhat safe to do.

I have raised the limits years ago to see how it goes.  During the time, I
have received multiple many-MB spams, most of them hit BAYES_99, and without
it they would became FNs.

This is about the second time it caused problems - the first first time
happened on very slow machine, scanning took too much time.

My question was, if there's a bug in the bayes code, causing it to eat too
much of memory.  Both ~750B per token with file-based bayes or ~600B per
token in redis-based BAYES looks like too much for me.


--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
  One OS to rule them all, One OS to find them,
One OS to bring them all and into darkness bind them

Reply via email to