On Jun 16, 2007, at 6:05 PM, Tom Phoenix wrote:
On 6/16/07, Tom Allison <[EMAIL PROTECTED]> wrote:
I'm trying to do some regular expression on strings in email. They
could be
encoded to something. But I can't tell because I don't have a
utf8 unicode
xterm window that will show me anything. At best I get ?????a??
and other
trash like that. I think this is typical for ascii text
renderings of two-bit
characters.
But, I think what you're saying is, you want to be able to tell
whether today's ?????a?? is the same mystery word that looked like
?????a?? in yesterday's mail, right? That is, you still won't know
what it is, but at least you'll be able to say you saw it again.
This is exactly what I'm trying to do. I just want to know if I've
seen the same string previously.
I found something that SpamAssassin uses to convert all this "goo"
into a
repeatable set of characters (which is all I'm really after) by
running
something that looks like this:
sub _quote_bytea {
my ($str) = @_;
my $buf = "";
foreach my $char (split(//,$str)) {
my $oct = sprintf ("%lo", ord($char));
if (length( $oct ) < 2 ) { $oct = '0' . $oct; }
if (length( $oct ) < 3 ) { $oct = '0' . $oct; }
$buf .= '\\\\\\\\' . $oct;
}
return $buf;
}
So that's somebody else's code, not yours? Does that code have any
comments that explain what it's doing? What does "_quote_bytea" mean?
SpamAssassin. But they have very few comments and not many of them
are very clear.
But it sounds to me as if you don't want that particular string; you
want any function that gives you a lossless, repeatable coding of your
input string, but unlike the input string, the desired result is
composed only of printable characters. Yes? And presumably,
compactness and readability are also desirable features of the encoded
string.
# Encode everything except the "normal" ASCII
# characters. Normal includes newline and space, but no other
# inkless characters. Normal does not include backslash.
###UNPORTABLE### Newline character is machine-dependent
$str =~ s{([^\n\x20-\x5b\x5d-\x7e])}{ sprintf "\\{%x}", ord($1) }seg;
By now, I'm sure I must have sufficiently misunderstood either the
task or Perl's abilities to accomplish it, so I'll leave it at this.
Hope this helps!
This is about it.
Now I'm not familiar with the \x20.. notations but this gives me
something to play with.
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/