On Fri, May 12, 2023 at 05:32:30PM +0200, Reindl Harald wrote: > > On Fri, May 12, 2023 at 09:49:40AM -0500, Dave Funk wrote: > > > On Fri, 12 May 2023, Matija Nalis wrote: > > > > That is because those domains are not EQUAL? Od did you wanted a > > > > rule that checks only on SIMILAR domain names (e.g. with lowercase > > > > letter "L" replaced with number "1" as in your example)? > > > > It should be relatively easy to write SA plugin for that: > > and with *what* do you replace the "1"?
With one of the similar looking characters. Doesn't really matter which one, but it needs to be done consistently. Personally I'd probably chose lowercase "L", but it can be anything. e.g. for simple first variant (i.e. for direct matching, not more advanced statistical similarity based approach suggested in later step) sub normalize_domain($) { my ($domain) = @_; # (yes I know we have tr///) $domain =~ s/1/l/g; # number 1 to lowercase "L" $domain =~ s/I/l/g; # uppercase "I" to lowercase "L" return lc($domain); } [...] if (lc($domain1) ne lc($domain2)) { # domains are NOT the same... if (normalize_domain($domain1) eq normalize_domain($domain2))) { # ...but they LOOK the same add_spam_score("domain_is_not_same_but_looks_the_same") } } so normalize_domain() would return the same string for "paypal.com", "PayPal.com", "PayPaI.com" or "PayPa1.com": i.e. "paypal.com" It doesn't matter if the result of it isn't the real domain (as it will be used only for comparison to simularly mangled other domain), e.g. if one had real domain "TheReallyBest1.com", it would be normalized to "thereallybestl.com" -- so while that is NOT how domain is really named, it doesn't matter, as it would still work for detecting fakes like "TheReallyBestI.com" (regardless if neither lowercase "L" nor the uppercase "I" are used in real domain name). > be careful with "relatively easy" when it comes to reality Sure, I though I was. Do you spot problems with the code above? Think of any real-life examples where it would backfire or fail to work? The code like the above looks trivial to me ("relatively easy" was more geared toward statistical analyses of the words to return statistical score in percentage instead of simple fake/not_fake boolean like above; as it should take into account ordering of the letters, missed letters, duplicated letters, dyslexia-alike reversal of two neighboring letters and similar psychological ways in which human mind can easily be fooled). Still might take few weeks to make it to reasonably publishable shape... But I was more interested if SA already has something like that? I haven't dabbled in 4.0 yet, and there might be code already writting to accomplish similar things, so it would be a waste to reinvent a wheel. -- Opinions above are GNU-copylefted.