On 6/16/07, Tom Allison <[EMAIL PROTECTED]> wrote:
I'm trying to do some regular expression on strings in email. They could be encoded to something. But I can't tell because I don't have a utf8 unicode xterm window that will show me anything. At best I get ?????a?? and other trash like that. I think this is typical for ascii text renderings of two-bit characters.
But, I think what you're saying is, you want to be able to tell whether today's ?????a?? is the same mystery word that looked like ?????a?? in yesterday's mail, right? That is, you still won't know what it is, but at least you'll be able to say you saw it again.
I found something that SpamAssassin uses to convert all this "goo" into a repeatable set of characters (which is all I'm really after) by running something that looks like this: sub _quote_bytea { my ($str) = @_; my $buf = ""; foreach my $char (split(//,$str)) { my $oct = sprintf ("%lo", ord($char)); if (length( $oct ) < 2 ) { $oct = '0' . $oct; } if (length( $oct ) < 3 ) { $oct = '0' . $oct; } $buf .= '\\\\\\\\' . $oct; } return $buf; }
So that's somebody else's code, not yours? Does that code have any comments that explain what it's doing? What does "_quote_bytea" mean? That looks to me like it's replacing each character with four backslashes and at least three octal digits. The two ifs are confusing me. Do you know about leading zeroes in sprintf formats? my $oct = sprintf ("%03lo", ord($char)); # maybe? my $buf = join "", map sprintf("\\\\\\\\%03lo", ord($_)), split //, $str; # ??? $str =~ s{(.)}{ sprintf "\\\\\\\\%03lo", ord($1) }seg; #??? But it sounds to me as if you don't want that particular string; you want any function that gives you a lossless, repeatable coding of your input string, but unlike the input string, the desired result is composed only of printable characters. Yes? And presumably, compactness and readability are also desirable features of the encoded string. # Encode everything except the "normal" ASCII # characters. Normal includes newline and space, but no other # inkless characters. Normal does not include backslash. ###UNPORTABLE### Newline character is machine-dependent $str =~ s{([^\n\x20-\x5b\x5d-\x7e])}{ sprintf "\\{%x}", ord($1) }seg; By now, I'm sure I must have sufficiently misunderstood either the task or Perl's abilities to accomplish it, so I'll leave it at this. Hope this helps! --Tom Phoenix Stonehenge Perl Training -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/