On 2017-05-31 16:59, Kim Roar Foldøy Hauge wrote:
On Wed, 31 May 2017, John Hardin wrote:

On Thu, 1 Jun 2017, Benny Pedersen wrote:

 John Hardin skrev den 2017-06-01 00:29:

>   That sort of thing has happened before, and there are rules to *try*
>   to catch nonsense headers in my sandbox, but IIRC they never worked
>   well enough in masscheck to actually get published.

 would it be possible to make list of non nonsense headers, and count based
 on that how many other headers is in mail ?

Define "nonsense".

There are a fairly limited number of headers explicitly defined by the various RFCs which could be used to restrict the hits, but the number of *valid* headers is unbounded - any header that begins with "X-" is permitted.

 and thus based on how many other headers a mail have say its more spammy
 by to many no nonsense headers ?

 anyway food for bayes training

Potentially.

The headers' randomness could be a clue. Perhaps a plugin that records headers in a database with a "seen" count, and if a message has more than a half-dozen or so low-seen-count headers then it would earn a point or two. The risk there is FP on messages with a bunch of unusual but not-spammy headers.


To me, this sounds like an excellent candidate for some sort of bayes filtering. Use the headers to make tokens. Tokens token that are only in spam, or never seen before, should lead to a slightly higher score.

Regular headers should be scored 0 or an extremely low negative score.

Since headers are somewhat more limited than the body, there should be less room for false negatives if there is a decent default set of headers already in the database.

Legitimate mail with a lot of odd headers, is hopefully, a very rare occurance.

If I were to guess, adding such headers is done to confuse tools that compute hashes based on headers or use bayes filtering on the entire mail,
since it adds innocent words to the mail without showing them to most end-users.

That's basically the "Bayes Poison" argument. It should be possible to do better. I'm also finding here that a Bayes that remembered two word phrases could go a long way to killing off spam. (In this context a, and, the, his, and other such words would be ignored in gathering the two word phrases.) I suspect it would be a nasty piece of code to write; but, I do think it could produce some nice results. Specifically results on the random headers might be pretty good, too.

{^_^}

Reply via email to