On Wed, 2016-09-28 at 13:29 +0000, Nicola Piazzi wrote: > a plugin that check similar words in oldest messages (for example 3 of > 4 words match) > > Then plugin check if sender domain is different and recipient is > different
<snip> > Detection routine > > > > A mail arrive > > Subject is : FedEx Shipment 702193383647 Notification > > I search in maillog table for a regex that MATCH FedEx > Shipment 702193383647 Notification ALSO IN FedEx Shipment > 722566383641 Notification AND IN FedEx Shipment 734563383644 > Notification > > If it match I verify that FROM DOMAIN IS DIFFERENT > And then I verify that TO ADDRESS IS DIFFERENT > > > > Now I need a regex sintax to put all extracted words of PHRASE > FedEx Shipment 734563383644 Notification and match if it found > at least 3 of 4 words I'm also not clear on exactly what you're intending, but this certainly sounds reminiscent of Marc Perkel's "evolution filter" (which I don't know that anyone fully understands). What I've made out of the discussion is it is token-based like bayes, using multi-word (and partial-word/string?) tokens and adds some other data and metadata as tokens (data from headers, eg. your from: and to: domains), and tosses out results that aren't confident (nearly 100% ham or spam); it utilizes Redis Sets for set logic/operations. If you are creating a plugin for these phishing emails, it may be an avenue to pursue; it sounds like it works quite well (when trained with a large ham/spam corpus). -- Jesse Norell Kentec Communications, Inc. 970-522-8107 - www.kci.net