On Fri, May 12, 2023 at 05:32:30PM +0200, Reindl Harald wrote:
> > On Fri, May 12, 2023 at 09:49:40AM -0500, Dave Funk wrote:
> > > On Fri, 12 May 2023, Matija Nalis wrote:
> > > > That is because those domains are not EQUAL? Od did you wanted a
> > > > rule that checks only on SIMILAR domain names (e.g. with lowercase
> > > > letter "L" replaced with number "1" as in your example)?
> > 
> > It should be relatively easy to write SA plugin for that:
> 
> and with *what* do you replace the "1"?

With one of the similar looking characters. Doesn't really matter
which one, but it needs to be done consistently. Personally I'd 
probably chose lowercase "L", but it can be anything.

e.g. for simple first variant (i.e. for direct matching, not more
advanced statistical similarity based approach suggested in later
step)

sub normalize_domain($)
{
  my ($domain) = @_;

  # (yes I know we have tr///)
  $domain =~ s/1/l/g;    # number 1 to lowercase "L"
  $domain =~ s/I/l/g;    # uppercase "I" to lowercase "L"

  return lc($domain);  
}

[...]

if (lc($domain1) ne lc($domain2)) { # domains are NOT the same...
   if (normalize_domain($domain1) eq normalize_domain($domain2))) { # ...but 
they LOOK the same
      add_spam_score("domain_is_not_same_but_looks_the_same")
   }
}

so normalize_domain() would return the same string for "paypal.com",
"PayPal.com", "PayPaI.com" or "PayPa1.com": i.e. "paypal.com"

It doesn't matter if the result of it isn't the real domain (as it
will be used only for comparison to simularly mangled other domain),
e.g. if one had real domain "TheReallyBest1.com", it would be
normalized to "thereallybestl.com" -- so while that is NOT how domain
is really named, it doesn't matter, as it would still work for
detecting fakes like "TheReallyBestI.com" (regardless if neither
lowercase "L" nor the uppercase "I" are used in real domain name).


> be careful with "relatively easy" when it comes to reality

Sure, I though I was. Do you spot problems with the code above?
Think of any real-life examples where it would backfire or fail to work?

The code like the above looks trivial to me ("relatively easy" was
more geared toward statistical analyses of the words to return
statistical score in percentage instead of simple fake/not_fake
boolean like above; as it should take into account ordering of the
letters, missed letters, duplicated letters, dyslexia-alike reversal
of two neighboring letters and similar psychological ways in which
human mind can easily be fooled). Still might take few weeks to make
it to reasonably publishable shape...

But I was more interested if SA already has something like that?
I haven't dabbled in 4.0 yet, and there might be code already
writting to accomplish similar things, so it would be a waste to
reinvent a wheel.

-- 
Opinions above are GNU-copylefted.

Reply via email to