On Wed, Dec 11, 2019 at 10:53:04AM +0100, Matus UHLAR - fantomas wrote:
On 11.12.19 11:43, Henrik K wrote:
>Wow 6 million tokens.. :-)
>
>I assume the big uuencoded blob content-type is text/* since it's tokenized?

yes, I mentioned that in previous mails. ~15M file, uuencoded in ~20M mail.

grep -c '^M' spamassassin-memory-error-<...>
329312

One of former mails mentioned that 20M mail should use ~700M of RAM. 6M
tokens eating about 4G of RAM means ~750B per token, is that fine?

On 11.12.19 12:07, Henrik K wrote:
I'm pretty sure the Bayes code does many dumb things with the tokens
that result in much memory usage for abnormal cases like this.

but apparently nobody notices...

>This will be mitigated in 3.4.3, since it will only use max 50k of the body
>text (body_part_scan_size).

will it prefer test parts and try to avoid uuencoded or base64 parts?
(or maybe decode them?)

There is no change in how parts are processed.  As before, "body" is
concatenated result of all textual parts.  But in 3.4.3 atleast each part is
truncated to 50k.  If there are several parts then it's 50+50k etc..

I understand such change apparently should not be done in minor version.

Well, I tried on currently unused machine with 16G of RAM, moved bayes DB
there (scanning on account without bayes was fast even on the original one,
with lower, maybe mentioned ~700M memory usage).

scanning took 17 minutes topping on 4.8G mem.

when I have tried to check with redis (copied bayes DB there), scanning
topped on 3.8G but took 29 minutes (???), even with repeated test.

I understand I probably push too far, but you never know in advance.

I also understand redis is great with parallel scanning.



I include logs from scanning on filesystem bayes, including places where
biggest differencies are:


Dec 11 10:45:42.261 [12972] dbg: logger: adding facilities: all
...
Dec 11 10:45:43.969 [12972] dbg: message: ---- MIME PARSER END ----
Dec 11 10:45:44.038 [12972] dbg: message: no encoding detected
Dec 11 10:46:10.379 [12972] dbg: plugin: 
Mail::SpamAssassin::Plugin::URIDNSBL=HASH(0x5617d8cf6c48) implements 
'parsed_metadata', priority 0
Dec 11 10:46:23.131 [12972] dbg: uridnsbl: more than 20 URIs, picking a subset
...
Dec 11 10:46:23.272 [12972] dbg: async: starting: DNSBL-A, 
dns:A:70.175.80.195.iadb.isipp.com (timeout 15.0s, min 3.0s)
Dec 11 10:48:22.828 [12972] dbg: check: check_main, time limit in 1639.598 s
...
Dec 11 10:48:23.005 [12972] dbg: bayes: corpus size: nspam = 89264, nham = 17109
Dec 11 10:49:30.445 [12972] dbg: bayes: tokenized body: 6158242 tokens
Dec 11 10:49:35.335 [12972] dbg: bayes: tokenized uri: 10881 tokens
Dec 11 10:49:35.351 [12972] dbg: bayes: tokenized invisible: 0 tokens
Dec 11 10:49:35.354 [12972] dbg: bayes: tokenized header: 208 tokens
Dec 11 10:50:54.200 [12972] dbg: bayes: score = 0.5
...
Dec 11 10:50:54.202 [12972] dbg: check: tagrun - tag TOKENSUMMARY is now ready, 
value: CODE(0x5617de4969e8)
Dec 11 10:50:58.537 [12972] dbg: async: select found no responses ready 
(t.o.=0.0)
Dec 11 10:50:58.537 [12972] dbg: async: queries completed: 0, started: 0
Dec 11 10:50:58.537 [12972] dbg: async: queries active: DNSBL-A=4 DNSBL-TXT=2 
URI-A=9 URI-DNSBL=20 URI-NS=10, all expired at Wed Dec 11 10:50:58 2019
Dec 11 10:51:01.653 [12972] dbg: rules: running rawbody tests; score so 
far=-0.699
...
Dec 11 10:51:02.711 [12972] dbg: rules: compiled body tests
Dec 11 10:51:08.066 [12972] dbg: rules: ran body rule __hk_bigmoney ======> got hit: 
"$NK7M"
Dec 11 10:52:00.372 [12972] dbg: rules: ran body rule __DRUGS_MUSCLE1 ======> got hit: 
"@S"'<0 MA[+*"
Dec 11 10:52:01.853 [12972] dbg: rules: ran body rule __LOTSA_MONEY_03 ======> got hit: 
"$3M"
Dec 11 10:52:01.886 [12972] dbg: rules: ran body rule __DOS_BODY_WED ======> got hit: 
"WED"
Dec 11 10:52:05.859 [12972] dbg: rules: ran body rule __LOTSA_MONEY_01 ======> got hit: 
"$94O0541"
Dec 11 10:52:31.895 [12972] dbg: rules: ran body rule __HAS_ANY_EMAIL ======> got hit: 
"a@nspnz.s"
Dec 11 10:52:53.298 [12972] dbg: rules: ran body rule __DOS_BODY_SUN ======> got hit: 
"SUN"
Dec 11 10:52:53.298 [12972] dbg: rules: ran body rule __DOS_BODY_TUE ======> got hit: 
"Tuesday"
Dec 11 10:53:01.629 [12972] dbg: rules: ran body rule __FIFTY_FIFTY ======> got hit: 
"50%"
Dec 11 10:53:04.870 [12972] dbg: rules: ran body rule __DOS_BODY_SAT ======> got hit: 
"SAT"
Dec 11 10:53:06.939 [12972] dbg: rules: ran body rule __DOS_BODY_FRI ======> got hit: 
"FRI"
Dec 11 10:53:06.951 [12972] dbg: rules: ran body rule __freemail_safe_fwd ======> got 
hit: "---Original Message"
Dec 11 10:56:14.590 [12972] dbg: rules: ran body rule __FRAUD_DBI ======> got hit: 
"$,, M"
Dec 11 10:56:58.462 [12972] dbg: rules: ran body rule __FB_COST ======> got hit: 
"COST"
Dec 11 10:57:02.611 [12972] dbg: rules: ran body rule FUZZY_PRICES ======> got hit: 
"PR!@*3Z"
Dec 11 10:57:07.993 [12972] dbg: rules: ran body rule WEIRD_QUOTING ======> got hit: 
""",`_'2""""
Dec 11 10:57:11.069 [12972] dbg: rules: ran body rule FUZZY_CPILL ======> got hit: 
"KYO11Z"
Dec 11 10:57:21.916 [12972] dbg: rules: ran body rule __LOTSA_MONEY_02 ======> got hit: 
"2,3O964$"
Dec 11 10:57:47.954 [12972] dbg: rules: ran body rule __DOS_BODY_THU ======> got hit: 
"THU"
Dec 11 10:58:03.551 [12972] dbg: rules: ran body rule __LOTSA_MONEY_04 ======> got hit: 
"1MN98USD"
Dec 11 10:58:10.930 [12972] dbg: rules: ran body rule __NONEMPTY_BODY ======> got hit: 
"R"
Dec 11 10:58:29.975 [12972] dbg: rules: ran body rule FUZZY_CREDIT ======> got hit: 
"CREDYT"
Dec 11 10:58:42.635 [12972] dbg: rules: ran body rule __FUZZY_DR_OZ ======> got hit: 
"DGC0S "
Dec 11 10:58:54.772 [12972] dbg: rules: ran body rule __DOS_BODY_TICKER ======> got hit: 
"MVYR.PK"
Dec 11 10:59:20.483 [12972] dbg: rules: ran body rule __FB_NUM_PERCNT ======> got hit: 
"0%"
Dec 11 10:59:20.490 [12972] dbg: rules: ran body rule __DOS_BODY_MON ======> got hit: 
"MON"
Dec 11 10:59:25.221 [12972] dbg: rules: ran body rule __BODY_TEXT_LINE ======> got hit: 
"R"
Dec 11 10:59:25.221 [12972] dbg: rules: ran body rule __BODY_TEXT_LINE ======> got hit: 
"M"
Dec 11 10:59:25.221 [12972] dbg: rules: ran body rule __BODY_TEXT_LINE ======> got hit: 
"<CF>"
Dec 11 11:00:24.968 [12972] dbg: rules: ran body rule FUZZY_XPILL ======> got hit: 
"X;#0NA%X"
Dec 11 11:02:29.635 [12972] dbg: dns: bgread: received 113 bytes from 10.51.1.14
...
Dec 11 11:02:31.471 [12972] dbg: rules: compiled rawbody tests
Dec 11 11:02:35.862 [12972] dbg: rules: ran rawbody rule __HTML_SINGLET ======> got hit: 
">W<"
...
Dec 11 11:02:36.349 [12972] dbg: rules: [...] M5ULS"=>P"
Dec 11 11:02:37.267 [12972] dbg: async: select found no responses ready 
(t.o.=0.0)
...
Dec 11 11:02:37.281 [12972] dbg: check: ascii_text_illegal: matches >> 
Odoslan<e9> z iPhonu
Dec 11 11:02:38.039 [12972] dbg: async: select found no responses ready 
(t.o.=0.0)
...
Dec 11 11:02:38.064 [12972] dbg: dns: entering helper-app run mode
Dec 11 11:02:43.064 [12972] dbg: dns: leaving helper-app run mode
...
Dec 11 11:02:43.735 [12972] dbg: netset: cache trusted_networks hits/attempts: 
8/10, 80.0 %
--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
I feel like I'm diagonally parked in a parallel universe.

Reply via email to