On Wed, 10 Dec 2003 15:28:27 -0600, SpamTalk <[EMAIL PROTECTED]>
posted to spamassassin-talk:
 >> Note that numbers are sometimes substituted for letters.
 >> [SNIP] This argues for phoneming and/or spell-checking before ALPHA-ing.
 > I figured just stripping them would be best, or with maybe an adjunct
 > dictionary for common ones.
 > HMMM, how about an additional "PHONEME" rendering stream. An SA 3.1
 > feature, I'm sure. From my recollection of R.A. Heinlein's
 > "Farnham's Freehold" (you don't think my knowledge base is based on
 > TEXTBOOKS, do you?!?) phonemes have their own individual symbols,
 > IIRC some are Greek like delta and phi, something in excess of 100
 > atomic representations of voiced sounds. I would think there is
 > probably some kind of USASCII cross reference that would allow them
 > to be represented in a plaintext fashion. I wonder if there is a
 > Unicode/ISO page definition for them.

There is indeed a Unicode section for International Phonetic Alphabet
symbols and even an ASCII-IPA where you creatively substitute ASCII in
somewhat a similar way to how the spammers are doing it.

Let's not confuse terms here. Phonetics tell you how something is
actually pronounced. If you really want phonemes, that drug they all
keep writing me about is something like /,vaI'&gr@/

(<http://www.hpl.hp.com/personal/Evan_Kirshenbaum/IPA/faq.html> tells
how to read the above as IPA but is probably too technical if you are
not familiar with basic phonological terms.)

Getting back on topic, the problem with a stepwise normalization of
the message is that you sort of assume that transformations are
applied consistently and mechanically. What would be really neat would
be to have an automaton which recognizes all possible variants at the
same time. The obfu script (look back in the archives for a few days)
is a nice start, but it could obviously be improved. In the grand
scheme of things, I imagine you would have to use another formalism
instad of regular expressions to really capture what the spammers are
doing.

/* era */

-- 
The email address era     the contact information   Just for kicks, imagine
at iki dot fi is heavily  link on my home page at   what it's like to get
spam filtered.  If you    <http://www.iki.fi/era/>  500 pieces of spam for
want to reach me, see     instead.                  each wanted message.



-------------------------------------------------------
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to