> > On Fri, May 12, 2023 at 05:32:30PM +0200, Reindl Harald wrote: > > > On Fri, May 12, 2023 at 09:49:40AM -0500, Dave Funk wrote: > > > > On Fri, 12 May 2023, Matija Nalis wrote: > > > > > That is because those domains are not EQUAL? Od did you wanted a > > > > > rule that checks only on SIMILAR domain names (e.g. with > lowercase > > > > > letter "L" replaced with number "1" as in your example)? > > > > > > It should be relatively easy to write SA plugin for that: > > > > and with *what* do you replace the "1"? > > With one of the similar looking characters. Doesn't really matter > which one, but it needs to be done consistently. Personally I'd > probably chose lowercase "L", but it can be anything. > > e.g. for simple first variant (i.e. for direct matching, not more > advanced statistical similarity based approach suggested in later > step) > > sub normalize_domain($) > { > my ($domain) = @_; > > # (yes I know we have tr///) > $domain =~ s/1/l/g; # number 1 to lowercase "L" > $domain =~ s/I/l/g; # uppercase "I" to lowercase "L" > > return lc($domain); > } > > [...] > > if (lc($domain1) ne lc($domain2)) { # domains are NOT the same... > if (normalize_domain($domain1) eq normalize_domain($domain2))) { # > ...but they LOOK the same > add_spam_score("domain_is_not_same_but_looks_the_same") > } > } > > so normalize_domain() would return the same string for "paypal.com", > "PayPal.com", "PayPaI.com" or "PayPa1.com": i.e. "paypal.com" > > It doesn't matter if the result of it isn't the real domain (as it > will be used only for comparison to simularly mangled other domain), > e.g. if one had real domain "TheReallyBest1.com", it would be > normalized to "thereallybestl.com" -- so while that is NOT how domain > is really named, it doesn't matter, as it would still work for > detecting fakes like "TheReallyBestI.com" (regardless if neither > lowercase "L" nor the uppercase "I" are used in real domain name). > > > > be careful with "relatively easy" when it comes to reality > > Sure, I though I was. Do you spot problems with the code above? > Think of any real-life examples where it would backfire or fail to work? > > The code like the above looks trivial to me ("relatively easy" was > more geared toward statistical analyses of the words to return > statistical score in percentage instead of simple fake/not_fake > boolean like above; as it should take into account ordering of the > letters, missed letters, duplicated letters, dyslexia-alike reversal > of two neighboring letters and similar psychological ways in which > human mind can easily be fooled). Still might take few weeks to make > it to reasonably publishable shape... > > But I was more interested if SA already has something like that? > I haven't dabbled in 4.0 yet, and there might be code already > writting to accomplish similar things, so it would be a waste to > reinvent a wheel. >
Hi Matija, It is nice to see such interest in this topic. The goal is indeed to catch purposfully chosen domains that could mislead the recipient like j...@her0.com > ad...@hero.com not to be mistaken with i...@paypa1.com > ad...@hero.com although these algorithms are probably very similar. Catching stuff like paypa1.com could be a start, and if you combine the knowledge of knowing that the received email is not from the same company, but external. One could apply the checks on the sender/recipient combination. For similar character sets one could also look at password generators that do exactly the opposite, skip such characters in passwords. If I am not mistaken, some registries are already utilizing technology to try and catch phishing domains.