On Wed, Apr 26, 2023 at 03:21:50PM -0400, Kris Deugau wrote: > http://deepnet.cx/~kdeugau/spamtools/cornell-birds.eml
Thanks. Adding some dbg() in HTML.pm of my SA 3.4.6, it seems it is triggered this part of the email: <td ... background="none"> "background" is deprecated (but still supported) HTML attribute: https://www.w3.org/TR/html4/struct/global.html#adef-background It seems to happen in this part of the SA HTML.pm code (dbg line added by myself): sub html_uri { my ($self, $tag, $attr) = @_; use Data::Dumper; dbg ("/mn/ html_uri tag=$tag attr=" . Dumper($attr)); # ordered by frequency of tag groups if ($tag =~ /^(?:body|table|tr|td)$/) { if (defined $attr->{background}) { $self->push_uri($tag, $attr->{background}); } My reading of the HTML specs (and tested in Debian Bullseye firefox and chromium) is that "background=none" was not any special value (as the HTML author maybe intended), but is simply taken as relative URI - meaning picture file with a literal name of "none" in the same directory as the HTML being viewed. However, the issue is not restricted to that deprecated "background" attribute. E.g. <img src="none"> or even <a href="none.com"> would likely confuse SA in the same way. The browser would treat them as relative URLs. I.e. if you were viewing "https://example.com/dir/example.html" those two would resolve to: <img src="none"> ==> https://example.com/dir/none <a href="none.com"> ==> https://example.com/dir/none.com instead of "http://www.none.com" as SA seems to do (and as browser might do if you typed "none.com" in address bar -- but NOT if it was invoked via HTML elements) One should also read comments about "<base>" handling in that same file. Now, I see two ways to change SA behaviour here: - simple but lacking: do not call push_uri() if assumed URI does not look like absolute URI (i.e. if it does not contain at least '//') This would avoid false positives, but will not add relative URIs. e.g. it might add: http://www.example.com/dir but it would NOT also add: http://www.example.com/newdir/photo1.jpg if for example "<a href=/newdir/photo1.jpg>" was in there. - complex but emulating browser behaviour better: Add full handling of relative URIs. i.e. have push_uri() detect all relative URIs and convert them to absolute URIs before adding them to the list of URIs. Might not be that hard in base case as $self->{base_href} seems to be saved, but what happens if there are for example multiple HTML attachments in e-mail? Would/Should it propagate? What if there is no "<base>" specified, those relative URIs are invalid then? -- Opinions above are GNU-copylefted.