I've made a bunch of changed to get_decoded_stripped_body_text_array(). First, rather than decoding hex entities like � directly to ascii characters, I chaned it to convert them to decimal before the decimeal entities are replaced. Thus ” will get converted first to ň and then to a double quote, rather than being converted to the 8 bit character with value 148.
Then I chaned all decimal entity replacements of the form "�XYZ;" to "�?XYZ;", so that there doesn't need to be a leading zero. I added some more HTML entity replacements, for the non-breaking space, back-tick, en-dash, em-dash, and tilde. I also added replacements for the (C), (R) and (TM) entities, though that was just for the sake of completeness, and probably unecessary. I added a replacement to change <Q> and </Q> to double quote. I added replacements to turn an <HR> or two consecutive <BR>s into a paragraph break. Next I made some big changes to the anchor HREF extracting regexp. It was: <a\s+href\s*=\s*["']?(.*?)["']\s*> But this has a few problems. It will miss anchors like this: <A target="_blank" href="http://foobar.com"> <A href="http://foobar.com" target="_blank"> I changed it so that it can find the HREF anywere inside of an anchor tag. Also, an HREF with leading and trailing whitespace inside of quotes will work: <A href=" http://foobar.com "> so that had to be taken into account. Then I extend that regexp, and added a few more, to extract URIs from <AREA>, <LINK>, <IMG>, <FRAME>, <IFRAME>, <EMBED>, <SCRIPT> and <FORM>. I also added a regexp to extract the HREF in a <BASE> tag as BASEURI:mumble, and changed do_body_uri_tests() so that it uses this. Finally, I added a regexp to get rid of HTML comments. I did this because a otherwise spammer could add something like: <!-- <BASE HREF="http://fake-uri.com/"> --> to trick SA. They could also do something like: SECTION <!-- foobar --> 301 to escape detection by the SECTION_301 rule. To remove HTML comments, I had to use a non-greedy regexp: <!--.*?--> and non-greedy regexps are supposed to be expensive. Also, a spammer sending a plain text message could put "<!--" at the front of the message and "-->" at the end in order to get the whole message ignored. We should probably change get_decoded_stripped_body_text_array() so it only applies HTML related mungings to HTML sections of the message. -- Visit http://dmoz.org, the world's | Give a man a match, and he'll be warm largest human edited web directory. | for a minute, but set him on fire, and | he'll be warm for the rest of his life. [EMAIL PROTECTED] ICQ: 132152059 |
Index: lib/Mail/SpamAssassin/PerMsgStatus.pm =================================================================== RCS file: /cvsroot/spamassassin/spamassassin/lib/Mail/SpamAssassin/PerMsgStatus.pm,v retrieving revision 1.85 diff -u -3 -p -r1.85 PerMsgStatus.pm --- lib/Mail/SpamAssassin/PerMsgStatus.pm 19 Mar 2002 22:38:45 -0000 1.85 +++ lib/Mail/SpamAssassin/PerMsgStatus.pm 22 Mar 2002 06:14:48 -0000 @@ -689,6 +706,19 @@ sub get_decoded_body_text_array { ########################################################################### +# A URI can be like: +# +# href=foo.htm +# href = foo.htm +# href="foo.htm" +# href='foo.htm' +# href = 'foo.htm' +# href = ' foo.htm ' +# +# and such. Have to deal with all of it +# +my $URI_in_tag = qr/\s*=\s*["']?\s*([^'">\s]*)\s*["']?[^>]*/; + sub get_decoded_stripped_body_text_array { my ($self) = @_; local ($_); @@ -709,34 +739,69 @@ sub get_decoded_stripped_body_text_array $text .= $_; } - $text =~ s/=\r?\n//gis; # QP line endings - # sort out escaped QP markup - $text =~ s/=20/ /gis; - $text =~ s/=3E/>/gis; # spam trick, disguise HTML - $text =~ s/=[0-9a-f][0-9a-f]//gis; - - $text =~ s/\n\n+/<p>/gs; # keep paragraph breaks - - # strip HTML tags and entities - $text =~ s/(?:\&\#0147;|\&\#0148;|\")/"/gs; - $text =~ s/\&\#0146;/'/gs; + # Get rid of comments. Isn't the non-greedy ".*?" awful expensive? + # + # There might be things in comments we'd want to look at, like + # SCRIPT and STYLE content, but that can be taken care of with + # rawbody tests. + $text =~ s/<!--.*?-->//gs; + + # Try to put paragraph breaks where'd they'd be in HTML. There's + # an optional "/" before the ends of some tags in case it's XML style. + $text =~ s/<BR\/?>\s*<BR\/?>/\n\n/gis; # Two line breaks + $text =~ s/<HR\/?>/\n\n/gis; # Horizontal line + + # Keep paragraph breaks + $text =~ s/\n\n+/<p>/gs; + + # Convert hex entities to decimal equivalents, so that the specific + # decimal regexps will match + $text =~ s/\&\#x([a-f0-9]+);/"&#" . hex($1) . ";"/ieg; + + # Convert specific HTML entities + $text =~ s/(?:\ |\&\#0?160;)/ /gis; + $text =~ s/(?:\®|\&\#0?174;)/(R)/gis; + $text =~ s/(?:\©|\&\#0?169;)/(C)/gis; + $text =~ s/\&\#0?153;/(TM)/gs; + $text =~ s/(?:\&\#0?147;|\&\#0?148;|\&\#0?132|\")/"/gis; + $text =~ s/\&\#0?146;/'/gs; + $text =~ s/\&\#0?145;/`/gs; + $text =~ s/\&\#0?150;/-/gs; # En-dash + $text =~ s/\&\#0?151;/--/gs; # Em-dash + $text =~ s/\&\#0?152;/~/gs; $text =~ s/\&\#82(?:16|17|20|11);//gs; + + # Convert <Q> tags + $text =~ s/<\/?Q\b[^>]*>/"/gis; - # convert decimal and hex entities to their characters + # Convert decimal entities to their characters $text =~ s/\&\#(\d+);/chr($1)/eg; - $text =~ s/\&\#x([a-f0-9]+);/chr(hex($1))/eg; - # no idea what this is meant to do? Strip broken entities perhaps? + # Strip all remaining HTML entities $text =~ s/\&[-_a-zA-Z0-9]+;/ /gs; # join all consecutive whitespace into a single space - $text =~ s/\s+/ /gs; + $text =~ s/\s+/ /sg; - $text =~ s/<p>/\n\n/gis; # reinsert para breaks - - $text =~ s/<a\s+href\s*=\s*["']?(.*?)["']\s*>/URI:$1 /gis; + # reinsert para breaks + $text =~ s/<p>/\n\n/gis; + + # Get rid of "BASEURI:" in case spammers insert it raw to try to + # mess us up + $text =~ s/BASEURI://sg; + + # Extract URIs from various HTML tags, so that they'll still be there + # when the URI tests are done. + # <A>, <AREA>, <BASE> and <LINK> use "href=URI". + # <IMG>, <FRAME>, <IFRAME>, <EMBED> and <SCRIPT> use "src=URI" + # <FORM> uses "action=URI" + $text =~ s/<base\s[^>]*\bhref$URI_in_tag>/BASEURI:$1 /ogis; + $text =~ s/<(?:a|area|link)\s[^>]*\bhref$URI_in_tag>/URI:$1 /ogis; + $text =~ s/<(?:img|i?frame|embed|script)\s[^>]*\bsrc$URI_in_tag>/URI:$1 /ogis; + $text =~ s/<form\s[^>]*\baction$URI_in_tag>/URI:$1 /ogis; + # Get rid of all remaing HTML and XML tags $text =~ s/<[?!\s]*[:a-z0-9]+\b[^>]*>//gis; $text =~ s/<\/[:a-z0-9]+>//gis; @@ -757,6 +822,9 @@ sub get { my $getaddr = 0; if ($hdrname =~ s/:addr$//) { $getaddr = 1; } + my $getname = 0; + if ($hdrname =~ s/:name$//) { $getname = 1; } + my @hdrs = $self->{msg}->get_header ($hdrname); if ($#hdrs >= 0) { $_ = join ("\n", @hdrs); @@ -784,6 +852,11 @@ sub get { s/^.*?<(.+)>\s*$/$1/g # Foo Blah <jm@foo> or s/^(.+)\s\(.*?\)\s*$/$1/g; # jm@foo (Foo Blah) + } elsif ($getname) { + chomp; s/\r?\n//gs; + s/^[\'\"]*(.*?)[\'\"]*\s*<.+>\s*$/$1/g # Foo Blah <jm@foo> + or s/^.+\s\((.*?)\)\s*$/$1/g; # jm@foo (Foo Blah) + } else { $_ = $self->mime_decode_header ($_); } @@ -1081,12 +1154,25 @@ sub do_body_uri_tests { my $text = join('', @$textary); # warn("spam: /$uriRe/ $text\n"); - + + my $base_uri = "http://"; while ($text =~ /\G.*?(<$uriRe>|$uriRe)/gsoc) { my $uri = $1; + $uri =~ s/^<(.*)>$/$1/; + + # Use <BASE HREF="URI"> to turn relative links into + # absolute links + if ($uri =~ s/^BASEURI://i) { + $base_uri = $uri; + + # Make sure it ends in a slash + $base_uri .= "/" unless($base_uri =~ /\/$/); + next; + } + $uri =~ s/^URI://i; - $uri = "http://$uri" unless $uri =~ /^[a-z]+:/i; + $uri = "${base_uri}$uri" unless $uri =~ /^[a-z]+:/i; # warn("Got URI: $uri\n"); push @uris, $uri; }