Tom Allison schreef: > I don't require actual character comparison, comparison of \{263a} is > sufficient.
A Perl string contains characters (not octets). The codepoint U+263a is represented by the character "\x{263a}". Whether that takes 1 or 2 or 3 or even more octets in the string, shouldn't matter. Read perlunitut. If you convert your data first to proper UTF-8, then the next steps are far easier. > And it's rather difficult to determine in raw email what the correct > charset is to use for each string. I find that email sometimes > passes multiple encodings in one message making it more difficult to > pick apart. There are plenty of tools available to do that for you. I have never looked for it, but I wouldn't be surprised that someone already did exactly that: convert an e-mail message (including of course all encoded header lines, and all MIME parts) to an UTF-8 version. > As for SpamAssassin. I'm trying to stay away from that because it's > very large http://wiki.apache.org/spamassassin/OutOfMemoryProblems http://wiki.apache.org/spamassassin/SURBL > and from a development perspective -- badly documented in > the code. Basically, SpamAssassin is capable for what it does, but I > don't exactly want to do that. Similar, yet, but not exactly. Did you look into SA "custom rules"? I find them quite easy to use. http://mywebpages.comcast.net/mkettler/sa/SA-rules-howto.txt http://www.askdavetaylor.com/how_do_i_add_custom_spamassassin_rules_for_content_filtering.html http://wiki.apache.org/spamassassin/CustomRulesets http://www.rulesemporium.com/rules.htm -- Affijn, Ruud "Gewoon is een tijger." -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/