Re: [SAtalk] spamarchive.org corpuses have quite low success rates with SA

2002-12-02 Thread Justin Mason
Michael Bell said: > A fair statement as to what it is good for,yes. It could be used for > bayesian body stuff - dunno how that's stacked up in your tests > (which I notice do include most headers) - but it's pretty limited > otherwise. well, bayesish stuff does *much* better when it's allowed

Re: [SAtalk] spamarchive.org corpuses have quite low success rates with SA

2002-12-01 Thread Daniel Quinlan
Michael Bell <[EMAIL PROTECTED]> writes: > Out of an evil sense of malice , here's an example of one of their > falsely included messages which IMO doesn't belong in the corpus - it > is simply NOT spam per se. That message doesn't appear to be spam, but it could be. Spammers often disguise thei

Re: [SAtalk] spamarchive.org corpuses have quite low success rates with SA

2002-12-01 Thread Michael Bell
A fair statement as to what it is good for,yes. It could be used for bayesian body stuff - dunno how that's stacked up in your tests (which I notice do include most headers) - but it's pretty limited otherwise. Note that the PR for these guys (CipherMail or whatever $25000 box it's called Ironmail

Re: [SAtalk] spamarchive.org corpuses have quite low success rates with SA

2002-12-01 Thread Matthew Davis
* Michael Bell ([EMAIL PROTECTED]) wrote: > Agreed. I think it's worthless too. Just wanted to bring up the > topic, so we could all be prepared for newbies asking the question. > Now we have a thread to point to > > Here's an example of their substandard corpus. Note that while > looking for an e

Re: [SAtalk] spamarchive.org corpuses have quite low success rates with SA

2002-12-01 Thread Michael Bell
--- Justin Mason <[EMAIL PROTECTED]> wrote: > > I haven't looked yet, but > > (a) if they're not well-cleaned (ie if there is valid nonspam in > there), > it's going to seriously impact the archive's usefulness. It's not well-cleaned. In a random survey of 5 spam files, one was clearly a va

Re: [SAtalk] spamarchive.org corpuses have quite low success rates with SA

2002-12-01 Thread Justin Mason
Michael Bell said: > Kinda hard to say. Most of it IS spammy and valid MIME as far as I > could tell. I did catch a few clearly-non-spam (evite) things in the > corpus. > > The lack of Received lines does mess up quite a few DNS related tests > (RBL, MX records) but I wouldn't think that alone

Re: [SAtalk] spamarchive.org corpuses have quite low success rates with SA

2002-11-30 Thread Daniel Quinlan
Michael Bell <[EMAIL PROTECTED]> writes: > Kinda hard to say. Most of it IS spammy and valid MIME as far as I > could tell. I did catch a few clearly-non-spam (evite) things in the > corpus. You caught or SA caught? ;-) > The lack of Received lines does mess up quite a few DNS related tests >

Re: [SAtalk] spamarchive.org corpuses have quite low success rates with SA

2002-11-29 Thread Michael Bell
Kinda hard to say. Most of it IS spammy and valid MIME as far as I could tell. I did catch a few clearly-non-spam (evite) things in the corpus. The lack of Received lines does mess up quite a few DNS related tests (RBL, MX records) but I wouldn't think that alone made a 23% difference (83% on Jus

Re: [SAtalk] spamarchive.org corpuses have quite low success rates with SA

2002-11-29 Thread Matthew Davis
* Michael Bell ([EMAIL PROTECTED]) wrote: > > I will note that they are poorly organized, with the headers hand > edited, and useful things like the RECEIVED headers removed. Hence > all DNS stuff wasn't worth running. Plus I'm dubious about what > they've done to the formattting,etc. If what y