Hi, I have the following rule that is used to detect some of the less
common URIs:

uri        URI_RARE_TLD
m;://[^/]+\.(?:work|space|club|science|pub|red|blue|green|link|ninja|lol|xyz|faith|review|download|top|global|(?:web)?site|tech|party|pro|bid|trade|win|moda|news|online|xxx|health|bot|cw|date)(?:/|$);i
describe   URI_RARE_TLD     URI refers to rarely-nonspam TLD

The problem is that it is hitting patterns that aren't necessarily
URIs. This one matches on ".SPACE"

TIX400 ROH B.W.SPACE SHUTTLE IN

Dec  4 22:14:43.126 [15338] dbg: rules: ran uri rule URI_RARE_TLD
======> got hit: "://B.W.SPACE"

I asked John Hardin off-list as the author of the rule, and he wrote
the following, and thought I should open it up to the list.

It looks like the parser knows about TLDs, and it's looking for stuff
that looks like hostnames even if there is not a protocol spec. It
would, for example, treat "B.W.com" in the body as a URI. It might be
a bit too eager.

It's possible that the aggressive URI parsing is risky now that IANA
has crapped all over the TLD list and made it a lot harder to
recognize text that looks like valid domainnames and hostnames and
consensus would be to open a bug to modify the behavior of the parser.

Should I submit a bug, or does someone have other suggestions on how
to handle this?

Reply via email to