Michael Bell said:
> A fair statement as to what it is good for,yes. It could be used for
> bayesian body stuff - dunno how that's stacked up in your tests
> (which I notice do include most headers) - but it's pretty limited
> otherwise.
well, bayesish stuff does *much* better when it's allowed
Michael Bell <[EMAIL PROTECTED]> writes:
> Out of an evil sense of malice , here's an example of one of their
> falsely included messages which IMO doesn't belong in the corpus - it
> is simply NOT spam per se.
That message doesn't appear to be spam, but it could be. Spammers
often disguise thei
A fair statement as to what it is good for,yes. It could be used for
bayesian body stuff - dunno how that's stacked up in your tests
(which I notice do include most headers) - but it's pretty limited
otherwise.
Note that the PR for these guys (CipherMail or whatever $25000 box
it's called Ironmail
* Michael Bell ([EMAIL PROTECTED]) wrote:
> Agreed. I think it's worthless too. Just wanted to bring up the
> topic, so we could all be prepared for newbies asking the question.
> Now we have a thread to point to
>
> Here's an example of their substandard corpus. Note that while
> looking for an e
--- Justin Mason <[EMAIL PROTECTED]> wrote:
>
> I haven't looked yet, but
>
> (a) if they're not well-cleaned (ie if there is valid nonspam in
> there),
> it's going to seriously impact the archive's usefulness.
It's not well-cleaned. In a random survey of 5 spam files, one was
clearly a va
Michael Bell said:
> Kinda hard to say. Most of it IS spammy and valid MIME as far as I
> could tell. I did catch a few clearly-non-spam (evite) things in the
> corpus.
>
> The lack of Received lines does mess up quite a few DNS related tests
> (RBL, MX records) but I wouldn't think that alone
Michael Bell <[EMAIL PROTECTED]> writes:
> Kinda hard to say. Most of it IS spammy and valid MIME as far as I
> could tell. I did catch a few clearly-non-spam (evite) things in the
> corpus.
You caught or SA caught? ;-)
> The lack of Received lines does mess up quite a few DNS related tests
>
Kinda hard to say. Most of it IS spammy and valid MIME as far as I
could tell. I did catch a few clearly-non-spam (evite) things in the
corpus.
The lack of Received lines does mess up quite a few DNS related tests
(RBL, MX records) but I wouldn't think that alone made a 23%
difference (83% on Jus
* Michael Bell ([EMAIL PROTECTED]) wrote:
>
> I will note that they are poorly organized, with the headers hand
> edited, and useful things like the RECEIVED headers removed. Hence
> all DNS stuff wasn't worth running. Plus I'm dubious about what
> they've done to the formattting,etc.
If what y