Bill Cole wrote: > On 25 Feb 2021, at 13:37, Rick Cooper wrote: > >> I was just working on some rules to catch the current crop of mal >> formed urls used to escape detection by solutions that extract urls >> from emails and compare them to known bad urls and I am wondering if >> spamassassin's patterns for extraction take this into account? >> >> For instance: >> >> https:www.google.com/mail >> https:\/www.google.com/mail >> https:\\www.google.com/mail >> >> Will all work at getting you to gmail because the technical spec >> doesn't actually require \\ after the colon. > > Of course not: A http: URI must NOT contain '\\' after the colon, it > MUST contain '//' after the colon. See
Sorry , the \\ is a type since that would be the beginning of a unc path for a windows box. As far as I can tell the authority/path-abempty portion of a uri is optional and must begin with // but can be empty Hence https:www.google.com or https:\/www.google.com/. I have noticed every browser I tested it with normalizes it back to the conventional //. But my question was, given this is apparently an issue with some solutions parsing of uris does SA extract them and as both you and John pointed out it does so I am happy > https://tools.ietf.org/html/rfc7230#section-2.7.1 which is the > technical spec for the formal syntax of a http URI. OTOH, there are > URI schemes which do not include '//' (e.g. mailto:) so any tool that > is doing broad URI detection can't be too picky. > > What flavors of garbage almost-URIs will work in a browser very much > depends on the whims of browser developers, and whether those are > 'clickable' in your preferred MUA is dependent on the gullibility of > your MUA author. > > SpamAssassin traditionally has assumed that there will always be some > MUA and browser authors who lack any sense of caution or prudence, so > SA is VERY loose with what it will consider as maybe being a hostname > in something that could be a URI in some obscure or novel scheme. > >> Will spamassassin still extract and normalize the urls above? > > Yes, it will see all 3 as the same canonicalized URI. > >> I was hoping >> to avoid digging through the source to find out. > > No need to dig though the source, you can see what URIs SpamAssassin > detects (trimmed of the parts after the hostname) in a message by > manually testing it with 'spamassassin -D uri' Note that SA will only > show one instance of otherwise identical URIs after trimming and > canonicalization.