On 02/07/2011 05:37 PM, Mahmoud Khonji wrote:
On 01/21/2011 01:06 AM, Warren Togami Jr. wrote:
On 1/20/2011 7:23 AM, R - elists wrote:
initially this came across as a really suspect idea...
i.e., one man's junk is another man's treasure
Ham is a lot easier to define than Spam. Ham is simply anything that
you subscribed for.
I am currently subscribed to number of mailing lists to collect ham
emails (in addition to other sources). While it might be true that
mailing lists can be good sources of ham, their emails do not contain
realistic diversity of features/characteristics.
I explicitly excluded discussion mailing lists from the ham trap.
In my view, the issue is not just insuring an email is ham, but also
insuring that it contains realistic set of features. If the features are
not realistic, and if we optimize tests scores based on that, then we
might end up worsening test scores for realistic end-users.
Not if it is subscribed to hundreds of opt-in subscriptions for
legitimate mail that ordinary users receive, most of which is otherwise
not represented in the corpora. Many of these subscriptions send mail
only once a week or month.
It is true that the hamtrap corpus is synthetic and thus not fully
representative in frequencies of real ham. But its volume is only a
tiny fraction of a percent of our total ham. It helps us to detect and
fix problems in individual rules by injecting some variety without
causing a measurable impact on the entire corpus.
For example, most list emails are non-HTML. While most end-user ham and
spam emails are HTML. Evaluating sets of features (or tests) based on
this unrealistic corpus is likely to fools us into thinking that a
feature/test is more effective that what it is in reality (i.e. we might
end up giving MIME-based tests higher scores).
The spec and implementation of this ham trap already took this and many
other issues into consideration. We've already had a few experts here
conclude the plan is sound.
I'm somewhat annoyed by the armchair quarterback negative comments on
this topic. (Not just you) didn't read the rest of this thread to
realize this particular concern is moot. None of the people complaining
about how this is such a bad idea are being helpful by actually
participate in the nightly masscheck.
Talk is cheap. I'm actually doing something.
Warren