> 
> On Fri, May 12, 2023 at 05:32:30PM +0200, Reindl Harald wrote:
> > > On Fri, May 12, 2023 at 09:49:40AM -0500, Dave Funk wrote:
> > > > On Fri, 12 May 2023, Matija Nalis wrote:
> > > > > That is because those domains are not EQUAL? Od did you wanted a
> > > > > rule that checks only on SIMILAR domain names (e.g. with
> lowercase
> > > > > letter "L" replaced with number "1" as in your example)?
> > >
> > > It should be relatively easy to write SA plugin for that:
> >
> > and with *what* do you replace the "1"?
> 
> With one of the similar looking characters. Doesn't really matter
> which one, but it needs to be done consistently. Personally I'd
> probably chose lowercase "L", but it can be anything.
> 
> e.g. for simple first variant (i.e. for direct matching, not more
> advanced statistical similarity based approach suggested in later
> step)
> 
> sub normalize_domain($)
> {
>   my ($domain) = @_;
> 
>   # (yes I know we have tr///)
>   $domain =~ s/1/l/g;    # number 1 to lowercase "L"
>   $domain =~ s/I/l/g;    # uppercase "I" to lowercase "L"
> 
>   return lc($domain);
> }
> 
> [...]
> 
> if (lc($domain1) ne lc($domain2)) { # domains are NOT the same...
>    if (normalize_domain($domain1) eq normalize_domain($domain2))) { #
> ...but they LOOK the same
>       add_spam_score("domain_is_not_same_but_looks_the_same")
>    }
> }
> 
> so normalize_domain() would return the same string for "paypal.com",
> "PayPal.com", "PayPaI.com" or "PayPa1.com": i.e. "paypal.com"
> 
> It doesn't matter if the result of it isn't the real domain (as it
> will be used only for comparison to simularly mangled other domain),
> e.g. if one had real domain "TheReallyBest1.com", it would be
> normalized to "thereallybestl.com" -- so while that is NOT how domain
> is really named, it doesn't matter, as it would still work for
> detecting fakes like "TheReallyBestI.com" (regardless if neither
> lowercase "L" nor the uppercase "I" are used in real domain name).
> 
> 
> > be careful with "relatively easy" when it comes to reality
> 
> Sure, I though I was. Do you spot problems with the code above?
> Think of any real-life examples where it would backfire or fail to work?
> 
> The code like the above looks trivial to me ("relatively easy" was
> more geared toward statistical analyses of the words to return
> statistical score in percentage instead of simple fake/not_fake
> boolean like above; as it should take into account ordering of the
> letters, missed letters, duplicated letters, dyslexia-alike reversal
> of two neighboring letters and similar psychological ways in which
> human mind can easily be fooled). Still might take few weeks to make
> it to reasonably publishable shape...
> 
> But I was more interested if SA already has something like that?
> I haven't dabbled in 4.0 yet, and there might be code already
> writting to accomplish similar things, so it would be a waste to
> reinvent a wheel.
> 

Hi Matija, 

It is nice to see such interest in this topic. The goal is indeed to catch 
purposfully chosen domains that could mislead the recipient like j...@her0.com 
> ad...@hero.com not to be mistaken with i...@paypa1.com > ad...@hero.com 
although these algorithms are probably very similar.

Catching stuff like paypa1.com could be a start, and if you combine the 
knowledge of knowing that the received email is not from the same company, but 
external. One could apply the checks on the sender/recipient combination.

For similar character sets one could also look at password generators that do 
exactly the opposite, skip such characters in passwords.

If I am not mistaken, some registries are already utilizing technology to try 
and catch phishing domains.




Reply via email to