Re: character encoding & regex

Tom Allison Sat, 16 Jun 2007 18:58:21 -0700


On Jun 16, 2007, at 6:05 PM, Tom Phoenix wrote:

On 6/16/07, Tom Allison <[EMAIL PROTECTED]> wrote:
I'm trying to do some regular expression on strings in email. Theycould beencoded to something. But I can't tell because I don't have autf8 unicodexterm window that will show me anything. At best I get ?????a??and othertrash like that. I think this is typical for ascii textrenderings of two-bit
characters.
But, I think what you're saying is, you want to be able to tell
whether today's ?????a?? is the same mystery word that looked like
?????a?? in yesterday's mail, right? That is, you still won't know
what it is, but at least you'll be able to say you saw it again.

This is exactly what I'm trying to do. I just want to know if I'veseen the same string previously.

I found something that SpamAssassin uses to convert all this "goo"into arepeatable set of characters (which is all I'm really after) byrunning

something that looks like this:

sub _quote_bytea {
     my ($str) = @_;
     my $buf = "";
     foreach my $char (split(//,$str)) {
         my $oct = sprintf ("%lo", ord($char));
         if (length( $oct ) < 2 ) { $oct = '0' . $oct; }
         if (length( $oct ) < 3 ) { $oct = '0' . $oct; }
         $buf .= '\\\\\\\\' . $oct;
     }
     return $buf;
}


So that's somebody else's code, not yours? Does that code have any
comments that explain what it's doing? What does "_quote_bytea" mean?

SpamAssassin. But they have very few comments and not many of themare very clear.

But it sounds to me as if you don't want that particular string; you
want any function that gives you a lossless, repeatable coding of your
input string, but unlike the input string, the desired result is
composed only of printable characters. Yes? And presumably,
compactness and readability are also desirable features of the encoded
string.

 # Encode everything except the "normal" ASCII
 # characters. Normal includes newline and space, but no other
 # inkless characters. Normal does not include backslash.
 ###UNPORTABLE### Newline character is machine-dependent
 $str =~ s{([^\n\x20-\x5b\x5d-\x7e])}{ sprintf "\\{%x}", ord($1) }seg;

By now, I'm sure I must have sufficiently misunderstood either the
task or Perl's abilities to accomplish it, so I'll leave it at this.
Hope this helps!



This is about it.

Now I'm not familiar with the \x20.. notations but this gives mesomething to play with.


--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/

Re: character encoding & regex

Reply via email to