[SAtalk] Re: False positives

Bob Proulx Sun, 28 Dec 2003 23:26:58 -0800

Hello Lenny

Lenny Schafer wrote:
> To Spamassassin:


I am one of the users of Spamassassin.  As with many things in the
free software world it is a team effort and anyone who takes the time
and effort to contribute are part of that team.  Which means you often
won't find any particular person who can claim to be the 100%
responsible for any particular free software project.  This can be a
cultural shock to people who are used to the classical commercial
software models where everything is bought and sold and there is
always someone to blame.  But so it is.

SpamAssassin is a free software project.  It is available for anyone
to download and use.  Anyone may make suggestions as to improving it.
It is not instrinsically a commercial product of any particular
company.  However, several companies are incorporating the
spamassassin engine into their commercial software and selling it.
There is no single spamassassin developer.  But there are many
developers and you are addressing the mailing list used to talk about
its development.  I am talking to you about it but I cannot claim to
be representing spamassassin development in any official way.

Your language in your email is your interface to others.  I can tell
by your message that you are frustrated, annoyed and not a little
upset.  I am sure that this can be resolved to the satisfaction of all
involved.  But language such as you have used has no place in a
reasoned discussion.  Please write more civilly in the future and you
will achieve a more reasoned response.

You are dealing with what amounts to a huge room full of people all
doing their own development or using of SpamAssassin, all talking to
each other at the same time, all reading your message which is similar
to walking into the room and telling them off.  In a real room full of
people like that you would have several who would listen, several who
would shout back, several who will leave and go home, several who will
ignore you and go back to work and all of many other happenings.  All
of that will happen by way of communication over the internet by email
as well.  Expect it and please be patient.  It is not unlike the
process of a town meeting.  If you have attended those you will know
what I am talking about.

> My publication is double-opted in by 15,000 families with children with
> autism.

In the language of email, newletters and spam, "double opt-in" sounds
like the language a spammer uses when they want to imply that
confirmation from the user is redundant to just getting their email
address.  The better term is "confirmed opt-in".  Using the "double
opt-in" term will cause many people to have an knee-jerk reaction and
assume you are a spammer because you use a spammer's language.  I
suggest that if you are confirmed opt-in that you use that term
instead.

> We are routinely victimized by incompetent software like
> spamassassin because of false positives.

Statements such as that during an introduction of two parties rarely
leads the other party to be kindly disposed to helping you.  This is a
completely symmetrical statement to one that says that spamassassin is
victimized by incompetent newsletter authors which don't know the
basics of newsletters production and distribution.  If that last
statement raised your blood to a boil then you know how we feel about
your statement.  Mudslinging such as this is not productive.

> False positives are intolerable and commercial products that allow them
> should be outlawed as much as spam should be.

First, spamassassin is not a commercial product.

Second, it is not inflicted upon people without their knowledge.
People who find spam intolerable are downloading and installing
spamassassin in order to tag messages as likely spam.  If someone is
using spamassassin then they have "opted-in" by installing it.  They
may then customize it further.  No one is forcing them to use it.

Many ISPs are installing spamassassin for their users.  This is
because spamassassin is a very highly rated spam analysis program.
Users are finding that the huge flood of spam makes email completely
unusable.  Users are demanding that their ISP provide spam filtering
to stop the flood and to make email usable again.  ISPs install
software to tag and filter spam.  One of those software programs is
frequently spamassassin.  But surely it is the responsiblility of the
ISP to inform users as to their actions.  And just as surely the ISP
is working in good faith to provide the users with the best tools and
environment possible.

Third, you say "outlaw" as if technical problems can be solved through
enactments of legislature by the governing body.  Fraud is outlawed in
most countries and yet my mailbox is still filled with fraud schemes
from con-artists.  If only it were so simple as to outlaw something
and so it would happen our lives would be much simpler.  But there are
the technical problems of actually making it work!  As they say, build
a better mousetrap and the world will beat a path to your door.  But
outlawing mice?

> I do not know if this is the right place to complain as I could not find an
> email address that offers feedback to the company.  This arrogance stinks,
> too.  As if software developers don't need public feedback about their junky
> products.

The person who installed spamassassin obviously downloaded and
installed it from somewhere.  One would think they would remember
doing that and where it came from.  That was probably the web page
http://www.spamassassin.org.  Both on the web page and as
documentation in the installed package there are extensive help files
and pointers to other avenues of help such as this mailing list.  And
regardless of those web search engines such as google find the
documentation and home pages easily.

> This piece of junk software rates my publication 99%-100% likely to be spam.
> "* 3.0 -- BODY: Bayesian classifier says spam probability is 99 to 100%"
> Ha! What crap.

By your statement it is clear that you do not know what a Bayesian
classifier is or how it works.  But that is normal.  Most people who
are not students of computer learning algorithms have never heard of
it.  But it is a method employed by many software programs for
computer learning.  SpamAssassin uses it for just that purpose.

Note the emphasis on "computer learning".  The classification database
is not shipped with spamassassin.  That is not how it works.  Instead
spamassassin is capable of learning from what the user tells it.  As
initially installed spamassassin's Bayesian classifier knows nothing.
It is a newborn baby learning about the world for the first time for
every user who installs it.

The user is required to teach spamassassin with a number of spam
messages and a number of non-spam messages, generally several hundred
of each type, what is spam and what is non-spam _for_that_user_.
Spamassassin uses computer learning algorithms to determine how to
classify messages as spam and non-spam.  After seeing what the user
says is spam _for_that_user_ it will then classify other messages
outside the learning set as spam if it looks like the previously
learned spam.  It learns based upon what the user teaches it.

In order for a message to have been classified as 99 to 100% spam by
the Bayes inference engine it is very likely that your user submitted
either that particular message to the learning algorithm as spam or
the message is very, very similar to another message which was
submitted to the learning algorithm as spam.  It is too unlikely to
have happened by accident.  Therefore I conclude by this that your
user _told_spamassassin_that_this_message_was_spam_!  If the user made
a mistake then as documented in the spamassassin documentation they
can instruct the Bayesian classification engine to forget the message
or to classify it as non-spam instead.

Bayesian classification as implemented by spamassassin is a personal
learning algorithm.  It learns by what the person tells it.  It is not
suitable for a large group to use as a group.  An ISP would not
implement a single database for multiple users.  It just does not work
that way.  Therefore it can be concluded that you are talking about a
single user who is instructing spamassassin with conflicting data.

[...rearranged the below for clarity...]
> ---- Start SpamAssassin results
> 7.10 points, 5.5 required;
> * 3.0 -- BODY: Bayesian classifier says spam probability is 99 to 100% [score: 
> 0.9988]

By properly instructing spamassasin's Bayesian classifier the 3.0
points awarded would not have happened and this would only have scored
4.1 points which is below the threshold and would not have been tagged
as spam.  However, two additional things bother me.

> * -4.3 -- AWL: Auto-whitelist adjustment

The auto-whitelist adjustment is what would have brought your message
down below the threshold.  This is another personal user customization
and learning methodology.  Which means that all of the other signs of
spam are in your newsletter.

> * -0.1 -- Message-Id indicates the message was sent from MS Exchange
> * 0.9 -- BODY: No such thing as a free lunch (3)
> * 0.5 -- BODY: No Fees
> * 0.5 -- BODY: Possible porn - Hot, Nasty, Wild, Young
> * 0.1 -- BODY: HTML link text says "click here"
> * 0.1 -- BODY: HTML font color is red
> * 0.2 -- BODY: FONT Size +2 and up or 3 and up
> * 0.1 -- BODY: HTML font color not within safe 6x6x6 palette
> * 1.5 -- BODY: Message is 20% to 30% HTML
> * 0.1 -- BODY: HTML has "tbody" tag
> * 0.2 -- BODY: JavaScript code
> * 0.1 -- BODY: HTML font color is blue
> * 0.2 -- BODY: HTML contains unsafe auto-executing code
> * 2.9 -- BODY: HTML has very strong "shouting" markup
> * 0.4 -- URI: Uses %-escapes inside a URL's hostname
> * 0.7 -- URI: Includes a link to a likely spammer email address
> * 0.0 -- Asks you to click below

I looked at the HTML of your newsletter.  It is some of the ugliest
HMTL markup I have seen.  I am sure it was generated by a program of
some sort because there was not a single newline in the entire file.
It was all on one line!  I am sure that you are only looking at the
front end of some program which is generating it.  But the result is
all that anyone else can see and that result is not good.  Because of
the poor structure of the HTML it has many similarities to spam
messages.

I am sure that others more expert in the area of HTML could provide
suggestions on how to clean that up.  I recommend running it through
'tidy' before posting which in my testing of it here reduced the
result well below the threshold.  http://tidy.sourceforge.net/

But let me make a different suggestion.  Instead of sending HTML mail
(which is like sending a web page) let me suggest that it is more
appropriate to send a pointer to your web page instead.  That will
reduce the size of your messages significantly and making the
mechanics of sending it more efficient.  And also HTML mail is not
appreciated by many on the 'net.  Almost all of the spam I receive is
HTML and that by itself puts any non-spam use of it at a large
disadvantage.

Let me suggest that summarizing the latest news and current events and
letting your users read the further details and infrastructure on your
web site.  I subscribe to many different newsletters and find that
format to be the most pleasant to read.  Having an easy to read
newsletter means that users are more likely to continue to read a
newsletter.

Hope this helps,
Bob


-------------------------------------------------------
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

[SAtalk] Re: False positives

Reply via email to