Craig Morrison wrote: >Philip Prindeville wrote: > > >>I'm wondering what would be involved in putting in an HTML parser >>that could call various rules to check things, like the case of: >> >><a href="http://www.foo.com/xyzzy">http://www.bar.com/aardvark</a> >> >>where the link disagrees with the text between the anchor tags (yeah, you >>could limit it to partial matches on the host-portion)... >> >> > >This is the functional equivalent of pissing in the wind. If you are >downwind, you are going to get wet. > >Anchor text in too many/most cases will not match the HREF. grep is >good, but it isn't good enough to catch all cases without significant >overhead. Anchor text is a descriptor, nothing more than that. It is not >a regurgitation of the link HREF. > > >
Usually it's not. That's the point. It's when the anchor text is tries to look like a URL that one needs to be suspicious. At the very least, if the anchor text starts with "https://" but the anchor URL looks like "http://", I'd say that this is a definite spam. Does anyone have a way of doing a statistical analysis of ham that contains http(s?):// as the beginning of the anchor text? -Philip -Philip