Summary: *Should/could any consideration be given to having ASSP scan the entire message at the time it is received for Bombs (only), while still using MaxBytes for Bayesian/HMM?*
We've been having some cleverly crafted messages slipping through all filters that would be easy to catch with Bombs if only the catchable content came before MaxBytes. These messages are 20kb+, They have a scam phone number at the very end of the larger than MaxBytes messages. I want/need to use bombs to catch the scam phone numbers. With MaxBytes set to 3000, which is useful for faster RebuildSpamDB, these BombDataRE matches just aren't being caught. If I increase MaxBytes, my BombDataRE catches them, but then rebuildspamdb is (probably? see below) longer than it needs to be. So, is there any value in considering a* MaxBytesAdditionalForBombs *variable which would be *added to MaxBytes *and only used when scanning for bombs as messages arrive? Would that kill performance?? Other downsides? We could still only look at MaxBytes for Bayesian/HMM since it's only MaxBytes used when building those databases. What do you think? And while we're talking MaxBytes: I've asked this before, is the guidance for 3kb for MaxBytes once there's a mature corpus still a valid recommendation? With unlimited horsepower and ram, sure, why not, do 30kb or 100kb. That's not my reality, so I want to see where to best allocate resources. If 3kb is still the guidance, even though the spam files I'm seeing have a median size around 20kb, so be it. I feel like when that guidance was written, html wasn't used as prolifically in spam. The median size of notspam in my corpus is about 40kb. That's determined unscientifically by sorting by size and scrolling to approximately half way down. Thanks. Have a good weekend. Ken
_______________________________________________ Assp-test mailing list Assp-test@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/assp-test