Re: SA Concepts - plugin for email semantics

2016-05-31 Thread David Jones
>From: RW >Sent: Tuesday, May 31, 2016 5:20 PM >To: users@spamassassin.apache.org >Subject: Re: SA Concepts - plugin for email semantics >On Tue, 31 May 2016 15:20:56 -0400 >Bill Cole wrote: >> On 29 May 2016, at 11:07, RW wrote: >> >> > Statistical filte

Re: SA Concepts - plugin for email semantics

2016-05-31 Thread RW
On Tue, 31 May 2016 15:20:56 -0400 Bill Cole wrote: > On 29 May 2016, at 11:07, RW wrote: > > > Statistical filters are based on some statistical theory combined > > with pragmatic kludges and assumptions. Practical filters have been > > developed based on what's been found to work, not on what'

Re: SA Concepts - plugin for email semantics

2016-05-31 Thread Dianne Skoll
On Tue, 31 May 2016 21:23:11 +0100 Paul Stead wrote: > The implementation was undertaken from a personal interest - I asked > the question of what people thought of the implementation and the > impact to Bayes DB. I think what the "concepts" concept ends up doing is this: "concepts" are more-or-

Re: SA Concepts - plugin for email semantics

2016-05-31 Thread Paul Stead
On 31/05/16 20:20, Bill Cole wrote: It is no shock that while this implementation has Paul Stead's name on it, it is apparently mostly the product of the anti-spam community's most spectacular case of Dunning-Kruger Syndrome, who has apparently figured out that his personal 'brand' has negative

Re: SA Concepts - plugin for email semantics

2016-05-31 Thread Bill Cole
On 29 May 2016, at 11:07, RW wrote: On Sat, 28 May 2016 15:37:21 -0400 Bill Cole wrote: More importantly (IMHO) they aren't designed to collide with existing common tokens and be added back into messages that may contain those tokens already in order to influence Bayesian classification. The

Re: SA Concepts - plugin for email semantics

2016-05-31 Thread RW
On Tue, 31 May 2016 12:05:39 -0400 Bill Cole wrote: > On 31 May 2016, at 2:21, Henrik K wrote: > > > On Mon, May 30, 2016 at 06:25:08PM -0400, Dianne Skoll wrote: > >> On Mon, 30 May 2016 17:45:52 -0400 > >> "Bill Cole" wrote: > >> > >>> So you could have 'sex' and 'meds' and 'watches' talli

Re: SA Concepts - plugin for email semantics

2016-05-31 Thread RW
On Mon, 30 May 2016 17:45:52 -0400 Bill Cole wrote: > The "Naive Bayes" classification approach is theoretically moored to > Bayes' Theorem FWIW Bayes hasn't been "Naive Bayes" for a long time.

Re: SA Concepts - plugin for email semantics

2016-05-31 Thread Bill Cole
On 31 May 2016, at 2:21, Henrik K wrote: On Mon, May 30, 2016 at 06:25:08PM -0400, Dianne Skoll wrote: On Mon, 30 May 2016 17:45:52 -0400 "Bill Cole" wrote: So you could have 'sex' and 'meds' and 'watches' tallied up in into frequency counts that sum up natural (word) and synthetic (concept)

Re: SA Concepts - plugin for email semantics

2016-05-31 Thread Reindl Harald
Am 31.05.2016 um 02:30 schrieb Bill Cole: On 30 May 2016, at 18:25, Dianne Skoll wrote: On Mon, 30 May 2016 17:45:52 -0400 "Bill Cole" wrote: So you could have 'sex' and 'meds' and 'watches' tallied up in into frequency counts that sum up natural (word) and synthetic (concept) occurrences,

Re: SA Concepts - plugin for email semantics

2016-05-30 Thread Henrik K
On Mon, May 30, 2016 at 06:25:08PM -0400, Dianne Skoll wrote: > On Mon, 30 May 2016 17:45:52 -0400 > "Bill Cole" wrote: > > > So you could have 'sex' and 'meds' and 'watches' tallied up in into > > frequency counts that sum up natural (word) and synthetic (concept) > > occurrences, not just as in

Re: SA Concepts - plugin for email semantics

2016-05-30 Thread Bill Cole
On 30 May 2016, at 18:25, Dianne Skoll wrote: On Mon, 30 May 2016 17:45:52 -0400 "Bill Cole" wrote: So you could have 'sex' and 'meds' and 'watches' tallied up in into frequency counts that sum up natural (word) and synthetic (concept) occurrences, not just as incompatible types of input feat

Re: SA Concepts - plugin for email semantics

2016-05-30 Thread Dianne Skoll
On Mon, 30 May 2016 17:45:52 -0400 "Bill Cole" wrote: > So you could have 'sex' and 'meds' and 'watches' tallied up in into > frequency counts that sum up natural (word) and synthetic (concept) > occurrences, not just as incompatible types of input feature but as > a conflation of incompatible fe

Re: SA Concepts - plugin for email semantics

2016-05-30 Thread Bill Cole
On 28 May 2016, at 17:53, John Hardin wrote: Based on that, do you have an opinion on the proposal to add two-word (or configurable-length) combinations to Bayes? CAVEAT: it has literally been decades since I've worked deep in statistics on a routine basis rather than just using blindly trust

Re: SA Concepts - plugin for email semantics

2016-05-29 Thread Reindl Harald
Am 29.05.2016 um 02:46 schrieb Dianne Skoll: And also, two-word phrases can be stronger indicators than the individual words; "hot" and "sex" in isolation may not be strong spam indicators, but "hot sex" probably is stronger. Going from one-word tokens to one+two-word tokens will have a pretty

Re: SA Concepts - plugin for email semantics

2016-05-29 Thread RW
On Sat, 28 May 2016 15:37:21 -0400 Bill Cole wrote: > More importantly (IMHO) they aren't designed to collide with existing > common tokens and be added back into messages that may contain those > tokens already in order to influence Bayesian classification. > > There is sound statistical theo

Re: SA Concepts - plugin for email semantics

2016-05-28 Thread Dianne Skoll
On Sat, 28 May 2016 14:53:15 -0700 (PDT) John Hardin wrote: > Based on that, do you have an opinion on the proposal to add two-word > (or configurable-length) combinations to Bayes? I have an opinion. :) Extending Bayes to look at multiple tokens is a *very* good idea. That's because naive sing

Re: SA Concepts - plugin for email semantics

2016-05-28 Thread John Hardin
On Sat, 28 May 2016, Bill Cole wrote: There is sound statistical theory consistent with empirical evidence underpinning the Bayes classifier implementation in SA. While there can be legitimate critiques of the SA implementation specifically and in general how well email word frequency fits Bay

Re: SA Concepts - plugin for email semantics

2016-05-28 Thread Bill Cole
On 25 May 2016, at 13:15, Dianne Skoll wrote: On Wed, 25 May 2016 18:10:57 +0100 Paul Stead wrote: [quoting Dianne] "Concepts" is a lossy process. You are throwing away information. That is by design, similar to fingerprinting emails with iXhash or Razor. iXhash and Razor are designed to

Re: SA Concepts - plugin for email semantics

2016-05-26 Thread Matus UHLAR - fantomas
On Thu, 26 May 2016 12:20:35 +0200 Matus UHLAR - fantomas wrote: you apparently mistook razor to DCC, the DCC is here to measure bulkiness, but not (necessarily) spamminess. On 26.05.16 09:46, Dianne Skoll wrote: Yes, you are correct. Thanks for the clarification! And also, just to clarify

Re: SA Concepts - plugin for email semantics

2016-05-26 Thread Dianne Skoll
On Thu, 26 May 2016 12:20:35 +0200 Matus UHLAR - fantomas wrote: > you apparently mistook razor to DCC, the DCC is here to measure > bulkiness, but not (necessarily) spamminess. Yes, you are correct. Thanks for the clarification! And also, just to clarify another thing: Lossy procedures are no

Re: SA Concepts - plugin for email semantics

2016-05-25 Thread Dianne Skoll
On Wed, 25 May 2016 18:10:57 +0100 Paul Stead wrote: > > Yes, except here's the problem. A drug company might legitimately > > talk about Viagra, so that wouldn't be a spam token. V1agra almost > > certainly would be a spam token. Bayes can distinguish between the > > two; "concepts" cannot.

Re: SA Concepts - plugin for email semantics

2016-05-25 Thread Paul Stead
On 25/05/16 15:21, Dianne Skoll wrote: On Wed, 25 May 2016 15:07:37 +0100 Paul Stead wrote: Consider the following 2 basic emails: Mail 1: Viagra Mail 2: V1agra Yes, except here's the problem. A drug company might legitimately talk about Viagra, so that wouldn't be a spam token. V1agra al

Re: SA Concepts - plugin for email semantics

2016-05-25 Thread Dianne Skoll
On Wed, 25 May 2016 15:07:37 +0100 Paul Stead wrote: > Consider the following 2 basic emails: > Mail 1: > Viagra > Mail 2: > V1agra Yes, except here's the problem. A drug company might legitimately talk about Viagra, so that wouldn't be a spam token. V1agra almost certainly would be a spam t

Re: SA Concepts - plugin for email semantics

2016-05-25 Thread Merijn van den Kroonenberg
> It may come down to my understanding of Bayes and its tokens.. Also > having a bit a problem explaining this concept on paper... > > I see this as adding an extra layer to the Bayes: > > Consider the following 2 basic emails: > > Mail 1: > Viagra > > Mail 2: > V1agra > > > With Bayes: > > Mail 1:

Re: SA Concepts - plugin for email semantics

2016-05-25 Thread Paul Stead
It may come down to my understanding of Bayes and its tokens.. Also having a bit a problem explaining this concept on paper... I see this as adding an extra layer to the Bayes: Consider the following 2 basic emails: Mail 1: Viagra Mail 2: V1agra With Bayes: Mail 1: Mail 2: With Concepts

Re: SA Concepts - plugin for email semantics

2016-05-25 Thread Merijn van den Kroonenberg
> > With David's help I have tracked down the problem(s). Version 0.02 is > up. Would be interested to hear you thoughts - even if just theoretical > about the affect to the Bayes DB. Just in theory, i am curious what part of the Bayes filter you hope to improve? I think you are not adding any *n

Re: SA Concepts - plugin for email semantics

2016-05-24 Thread Paul Stead
On 24/05/16 17:09, David Jones wrote: Good idea. I would like to test this out so I put this on my CentOS 6 servers (perl v5.10.1) and got this: May 24 10:59:51.850 [30158] warn: plugin: failed to parse plugin /etc/mail/spamassassin/Concepts.pm: Type of arg 1 to push must be array (not priv

Re: SA Concepts - plugin for email semantics

2016-05-24 Thread David Jones
>From: Paul Stead >Sent: Tuesday, May 24, 2016 9:55 AM >To: users@spamassassin.apache.org >Subject: SA Concepts - plugin for email semantics >Hi guys, >Based upon some information from others on the list I have put together >a plugin for SA which canonicalises an email into

SA Concepts - plugin for email semantics

2016-05-24 Thread Paul Stead
Hi guys, Based upon some information from others on the list I have put together a plugin for SA which canonicalises an email into it's basic "concepts". Concepts are converted to tags, which Bayes can use as tokens to further help identify spammy/hammy characteristics Here are some examples of