Re: character encoding & regex

Tom Allison Sun, 17 Jun 2007 04:54:15 -0700

I got somewhere with this:
From: =?Big5?B?obS2Uq/5r3Wk6KtLLLKjpmGqvbBl?= <>
translates to

From: \\{a1}\\{b4}\\{b6}R\\{af}\\{f9}\\{af}u\\{a4}\\{e8}\\{ab}K,\\{b2}\\{a3}\\{a6}a\\{aa}\\{bd}\\{b0}e <>which still means nothing to me. But at least I can pick it apart, Ithink.

I want to match everything in this list that is either \w or \\\\\{[\da-f]{2}\} (which is ugly)...

to give me three matches:
From
\\{a1}\\{b4}\\{b6}R\\{af}\\{f9}\\{af}u\\{a4}\\{e8}\\{ab}K
\\{b2}\\{a3}\\{a6}a\\{aa}\\{bd}\\{b0}e

foreach ($string =~ /((?:\w|\\\{[\da-f]{2}\})+)/ig) {
    print "$_\n";
}
Seems to work.


And then there are other issues like:

%6d%32%32%36%35%35%34%31%31%2e%6d%79%77%65%62%2e%68%69%6e%65%74%2e%6e%65%74

which is easy to do as it's url encoded
and
&#25298;&#20449;PT

Which I don't even know what this might be... HTML code? It's in amailto: subject line.


What a mess....
But it's progress!!!


On Jun 16, 2007, at 6:05 PM, Tom Phoenix wrote:


On 6/16/07, Tom Allison <[EMAIL PROTECTED]> wrote:

I'm trying to do some regular expression on strings in email. Theycould beencoded to something. But I can't tell because I don't have autf8 unicodexterm window that will show me anything. At best I get ?????a??and othertrash like that. I think this is typical for ascii textrenderings of two-bit
characters.


But, I think what you're saying is, you want to be able to tell
whether today's ?????a?? is the same mystery word that looked like
?????a?? in yesterday's mail, right? That is, you still won't know
what it is, but at least you'll be able to say you saw it again.

I found something that SpamAssassin uses to convert all this "goo"into arepeatable set of characters (which is all I'm really after) byrunning

something that looks like this:

sub _quote_bytea {
     my ($str) = @_;
     my $buf = "";
     foreach my $char (split(//,$str)) {
         my $oct = sprintf ("%lo", ord($char));
         if (length( $oct ) < 2 ) { $oct = '0' . $oct; }
         if (length( $oct ) < 3 ) { $oct = '0' . $oct; }
         $buf .= '\\\\\\\\' . $oct;
     }
     return $buf;
}


So that's somebody else's code, not yours? Does that code have any
comments that explain what it's doing? What does "_quote_bytea" mean?

That looks to me like it's replacing each character with four
backslashes and at least three octal digits. The two ifs are confusing
me. Do you know about leading zeroes in sprintf formats?

 my $oct = sprintf ("%03lo", ord($char));  # maybe?

 my $buf = join "",
   map sprintf("\\\\\\\\%03lo", ord($_)),
   split //, $str;              # ???

 $str =~ s{(.)}{ sprintf "\\\\\\\\%03lo", ord($1) }seg;   #???

But it sounds to me as if you don't want that particular string; you
want any function that gives you a lossless, repeatable coding of your
input string, but unlike the input string, the desired result is
composed only of printable characters. Yes? And presumably,
compactness and readability are also desirable features of the encoded
string.

 # Encode everything except the "normal" ASCII
 # characters. Normal includes newline and space, but no other
 # inkless characters. Normal does not include backslash.
 ###UNPORTABLE### Newline character is machine-dependent
 $str =~ s{([^\n\x20-\x5b\x5d-\x7e])}{ sprintf "\\{%x}", ord($1) }seg;

By now, I'm sure I must have sufficiently misunderstood either the
task or Perl's abilities to accomplish it, so I'll leave it at this.
Hope this helps!

--Tom Phoenix
Stonehenge Perl Training

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/

Re: character encoding & regex

Reply via email to