On Wed, Apr 26, 2023 at 03:21:50PM -0400, Kris Deugau wrote:
> http://deepnet.cx/~kdeugau/spamtools/cornell-birds.eml

Thanks. Adding some dbg() in HTML.pm of my SA 3.4.6, it seems it is
triggered this part of the email:

<td ... background="none">

"background" is deprecated (but still supported) HTML attribute:
https://www.w3.org/TR/html4/struct/global.html#adef-background


It seems to happen in this part of the SA HTML.pm code (dbg line added by 
myself):

sub html_uri {
  my ($self, $tag, $attr) = @_;

  use Data::Dumper; dbg ("/mn/ html_uri tag=$tag attr=" . Dumper($attr));
  
  # ordered by frequency of tag groups
  if ($tag =~ /^(?:body|table|tr|td)$/) {
    if (defined $attr->{background}) {
      $self->push_uri($tag, $attr->{background});
    }

My reading of the HTML specs (and tested in Debian Bullseye firefox and
chromium) is that "background=none" was not any special value (as the
HTML author maybe intended), but is simply taken as relative URI -
meaning picture file with a literal name of "none" in the same
directory as the HTML being viewed.

However, the issue is not restricted to that deprecated "background" attribute.
E.g. <img src="none"> or even <a href="none.com"> would likely confuse SA in 
the same way.


The browser would treat them as relative URLs. 

I.e. if you were viewing "https://example.com/dir/example.html"; those
two would resolve to:

<img src="none">    ==> https://example.com/dir/none
<a href="none.com"> ==> https://example.com/dir/none.com

instead of "http://www.none.com"; as SA seems to do (and as browser
might do if you typed "none.com" in address bar -- but NOT if it was
invoked via HTML elements)

One should also read comments about "<base>" handling in that same
file.

Now, I see two ways to change SA behaviour here:

- simple but lacking: do not call push_uri() if assumed URI does not look like 
absolute
  URI (i.e. if it does not contain at least '//')
  
  This would avoid false positives, but will not add relative URIs.
  e.g. it might add:
  http://www.example.com/dir
  but it would NOT also add:
  http://www.example.com/newdir/photo1.jpg 
  if for example "<a href=/newdir/photo1.jpg>" was in there.

- complex but emulating browser behaviour better:
  Add full handling of relative URIs. i.e. have push_uri() detect all
  relative URIs and convert them to absolute URIs before adding them
  to the list of URIs.
  Might not be that hard in base case as $self->{base_href} seems to
  be saved, but what happens if there are for example multiple HTML
  attachments in e-mail? Would/Should it propagate? What if there is
  no "<base>" specified, those relative URIs are invalid then?

-- 
Opinions above are GNU-copylefted.

Reply via email to