Re: URI parser problems

John Hardin Tue, 05 Dec 2017 14:59:56 -0800

On Tue, 5 Dec 2017, RW wrote:

On Tue, 5 Dec 2017 16:25:28 -0500
Alex wrote:

Hi, I have the following rule that is used to detect some of the less
common URIs:

uri        URI_RARE_TLD
m;://[^/]+\.(?:work|space|club|science|pub|red|blue|green|link|ninja|lol|xyz|faith|review|download|top|global|(?:web)?site|tech|party|pro|bid|trade|win|moda|news|online|xxx|health|bot|cw|date)(?:/|$);i
describe   URI_RARE_TLD     URI refers to rarely-nonspam TLD

The problem is that it is hitting patterns that aren't necessarily
URIs. This one matches on ".SPACE"

TIX400 ROH B.W.SPACE SHUTTLE IN

...

Should I submit a bug,


It's been discussed before. Not doing that would mean that spammers
could just leave off the protocol and avoid URI lists.


That's obviously a nonstarter.

Perhaps a smaller step that would be useful would be to have the parserrequire the second-level domain name have > 1 character.

How often would we see a valid registered domain name like "x.info" forexample?

or does someone have other suggestions on how
to handle this?


It's a reason to exercise caution in scoring such rules.

Agreed. The rule in question could also require two chars before the finalperiod; but it doesn't address the underlying issue with recognizingnon-protocol domain names in body text.



--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhar...@impsec.org    FALaholic #11174     pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  When fascism comes to America, it will be wrapped in
  "Diversity" and demanding "Safe Spaces."             -- Mona Charen
-----------------------------------------------------------------------
 2 days until The 76th anniversary of Pearl Harbor

Re: URI parser problems

Reply via email to