On Wed, 10 Dec 2003 15:28:27 -0600, SpamTalk <[EMAIL PROTECTED]> posted to spamassassin-talk: >> Note that numbers are sometimes substituted for letters. >> [SNIP] This argues for phoneming and/or spell-checking before ALPHA-ing. > I figured just stripping them would be best, or with maybe an adjunct > dictionary for common ones. > HMMM, how about an additional "PHONEME" rendering stream. An SA 3.1 > feature, I'm sure. From my recollection of R.A. Heinlein's > "Farnham's Freehold" (you don't think my knowledge base is based on > TEXTBOOKS, do you?!?) phonemes have their own individual symbols, > IIRC some are Greek like delta and phi, something in excess of 100 > atomic representations of voiced sounds. I would think there is > probably some kind of USASCII cross reference that would allow them > to be represented in a plaintext fashion. I wonder if there is a > Unicode/ISO page definition for them.
There is indeed a Unicode section for International Phonetic Alphabet symbols and even an ASCII-IPA where you creatively substitute ASCII in somewhat a similar way to how the spammers are doing it. Let's not confuse terms here. Phonetics tell you how something is actually pronounced. If you really want phonemes, that drug they all keep writing me about is something like /,vaI'&gr@/ (<http://www.hpl.hp.com/personal/Evan_Kirshenbaum/IPA/faq.html> tells how to read the above as IPA but is probably too technical if you are not familiar with basic phonological terms.) Getting back on topic, the problem with a stepwise normalization of the message is that you sort of assume that transformations are applied consistently and mechanically. What would be really neat would be to have an automaton which recognizes all possible variants at the same time. The obfu script (look back in the archives for a few days) is a nice start, but it could obviously be improved. In the grand scheme of things, I imagine you would have to use another formalism instad of regular expressions to really capture what the spammers are doing. /* era */ -- The email address era the contact information Just for kicks, imagine at iki dot fi is heavily link on my home page at what it's like to get spam filtered. If you <http://www.iki.fi/era/> 500 pieces of spam for want to reach me, see instead. each wanted message. ------------------------------------------------------- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk