Absolutely I've thought about this.  I consider everything I post prior to
posting.

Can you briefly explain why the ability to scan (MaxBytes + some additional
amount)kb on incoming mails for bombs but only use MaxBytes for bayesian
and the rebuild would be such a bad idea?

Since you questioned if I ever thought about this, here's what the thought
process is and the reason for the request.  Maybe I didn't explain myself
well enough in the previous messages:

The MaxBytes "documentation" says to lower it to 3000 for a mature
installation, but 10x larger than that if you can handle it.

How many bytes of the message body will ASSP look at - the message header
is always included in all checks. Mails stored in the collecting folders
will be truncated to this size, if StoreCompleteMail is disabled. *The
average of Ham messages (message body) is 6K, the average of Spam messages
is 3K.* Usually the spam folder will be filled quicker than the notspam
folder, therefore set this value to 4000 to get more wordpairs per Ham
Message. When both folders are close to the maxfiles limit, reduce it to
3000.
If your system is fast enough and has enough RAM multiply all the above
recommendations and the default value by ten.


The gui doesn't say "IF the average is 6k ham, 3k spam," is says that it IS
6k ham / 3k spam.  That's not true of my installation.  My average spam
size, as I've mentioned before, has a median size of about 20kb because of
all of the html in them.  And not-spam has a median size of 40kb.  Using
the logic in your gui, *I believe I should set my MaxBytes to 20kb*, the
median size of my spam corpus.

But, if I set my MaxBytes to 20kb (which it appears to be able to handle
okay, rebuilding in an hour and change), then bombs after 20kb aren't
detected when a message is attempting delivery.

Why does this matter to me?
We're seeing messages with @gmail.com and @whatever.onmicrosoft.com
addresses that are copying legitimate looking order receipts from vendors
like Amazon.com, BestBuy (US based big box electronics store), and Norton.
Many look identical to a legitimate message.  Ultimately, they want to call
them on the phone and give your credit card number, using the guise that
they're going to refund it.  Classic scam.

These messages will always pass bayesian, they read identically to real
messages.  BUT, I can detect some with the phone numbers that they direct
people to.   The email addresses change frequently, but the scam phone
numbers remain pretty constant.  I could maintain a list of known bad phone
numbers (also available online) to capture these messages before they're
delivered.  Simple.  If the message has one of these phone numbers, score
it such that it'll get blocked.

*The problem with many of these emails is that the phone number is way past
the 3k mark, and past the 20k mark too.  The scammers have a bunch of HTML
in the "confirmation" email, just like real stores tend to do.  I tried
increasing MaxBytes up to 50kb, which easily caught messages with bombs
later in the body, but that then seemed to cause a lot of false positives
and obviously much longer rebuild process.  *

If there could be a "continue canning for bombs for ___kb after maxbytes"
setting, that would let bombs later in the body be detected.  I don't know
what the downside to having such a feature would be.


Based on your reaction to my question, I'm obviously missing something
important.





On Thu, Nov 11, 2021 at 1:38 AM Thomas Eckardt <thomas.ecka...@thockar.com>
wrote:

> >Is there logic to having a separate MaxBytes setting like
> MaxBytesForBombs that's used only during message delivery?  That way, the
> entire message can be scanned for bombs, but the rebuild could use a lower
> number to better balance the differential between the average sized spam
> and average sized not-spam message.
>
> DID YOU EVER thougth about that ??????????????? Or do you only write
> something to fillup the community mailing list?
>
> No - no way!
>
> Thomas
>
>
>
>
>
>
>
> Von:        "K Post" <nntp.p...@gmail.com>
> An:        "ASSP development mailing list" <
> assp-test@lists.sourceforge.net>
> Datum:        10.11.2021 20:22
> Betreff:        Re: [Assp-test] Concept Question: Scan entire message for
> Bombs, regardless of MaxBytes setting? New MaxBytes recommendation?
> ------------------------------
>
>
>
> After about 12 weeks of going from MaxBytes of 4k to MaxBytes of 50k, 've
> seen:
> 1) Rebuild go from just over an hour (with 30k MaxFiles) to just over 2
> hours.  I'm fine with that, there's more to scan
> 2) Bomb detections improve, as a lot of what's detected is beyond the 20k
> or 30k mark
> 3) but, bayesian false positives going way up.  Lots of mail that would
> have (correctly) been delivered, is now getting too high of a score and is
> blocked.
>
> Surely #3 is specific to the types of messages my users are getting and I
> can tweak settings.  BUT, it makes me raise this question again:
> Is there logic to having a separate MaxBytes setting like MaxBytesForBombs
> that's used only during message delivery?  That way, the entire message can
> be scanned for bombs, but the rebuild could use a lower number to better
> balance the differential between the average sized spam and average sized
> not-spam message.
>
>
>
> On Mon, Nov 1, 2021 at 2:43 PM K Post <*nntp.p...@gmail.com*
> <nntp.p...@gmail.com>> wrote:
> When looking at the "Use this HTML Parser" section on the GUI, I found
> this line:
> it is recommended to set MaxBytes to 50000 (be carefull on heavy load
> systems - spam bomb regular expressions will take longer using 50000!).\
> I'm going to change my settings and see how bad the rebuild time is.  I've
> got enough processing power and RAM now, but the disks aren't SSD.  Just a
> 4 disk Raid 1+0 traditional HDD setup.  We'll see...
>
> Since HTMl email accounts for a big percentage of all mail,  might it be a
> good idea to update/expand the guidance in the MaxBytes section of the
> GUI?
>
>
>
> On Fri, Oct 29, 2021 at 8:40 PM K Post <*nntp.p...@gmail.com*
> <nntp.p...@gmail.com>> wrote:
> Summary:
> *Should/could any consideration be given to having ASSP scan the entire
> message at the time it is received for Bombs (only), while still using
> MaxBytes for Bayesian/HMM?*
>
> We've been having some cleverly crafted messages slipping through all
> filters that would be easy to catch with Bombs if only the catchable
> content came before MaxBytes.  These messages are 20kb+, They have a scam
> phone number at the very end of the larger than MaxBytes messages.  I
> want/need to use bombs to catch the scam phone numbers.
>
> With MaxBytes set to 3000, which is useful for faster RebuildSpamDB, these
> BombDataRE matches just aren't being caught.  If I increase MaxBytes, my
> BombDataRE catches them, but then rebuildspamdb is (probably? see below)
> longer than it needs to be.
>
> So, is there any value in considering a* MaxBytesAdditionalForBombs *variable
> which would be *added to MaxBytes *and only used when scanning for bombs
> as messages arrive?   Would that kill performance??  Other downsides?
>
> We could still only look at MaxBytes for Bayesian/HMM since it's only
> MaxBytes used when building those databases.
>
> What do you think?
>
> And while we're talking MaxBytes:
> I've asked this before, is the guidance for 3kb for MaxBytes once there's
> a mature corpus still a valid recommendation?  With unlimited horsepower
> and ram, sure, why not, do 30kb or 100kb.  That's not my reality, so I want
> to see where to best allocate resources. If 3kb is still the guidance, even
> though the spam files I'm seeing have a median size around 20kb, so be it.
> I feel like when that guidance was written, html wasn't used as
> prolifically in spam.  The median size of notspam in my corpus is about
> 40kb.  That's determined unscientifically by sorting by size and scrolling
> to approximately half way down.
>
> Thanks.  Have a good weekend.
> Ken
> _______________________________________________
> Assp-test mailing list
> Assp-test@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/assp-test
>
>
>
>
> DISCLAIMER:
> *******************************************************
> This email and any files transmitted with it may be confidential, legally
> privileged and protected in law and are intended solely for the use of the
> individual to whom it is addressed.
> This email was multiple times scanned for viruses. There should be no
> known virus in this email!
> *******************************************************
>
> _______________________________________________
> Assp-test mailing list
> Assp-test@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/assp-test
>
_______________________________________________
Assp-test mailing list
Assp-test@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/assp-test

Reply via email to