Loren Wilton wrote: >At a guess the table obfuscation stuff will have to be handled after table >removal, assuming the rendered text ends up looking like the visible text. >(I haven't checked to see if it does.) At that point I'd probably go with >metas on number of different drugs or other key phrases, since probably the >drug names aren't further munged. > > > Loren, After removal of the HTML tags the text looks nothing like the words to be caught.
These spams fragment the words up into 1-4 character chunks, and then interleave multiple words together. Jim's example looks like this with the HTML tags removed: VA U AG C IS Ll M Vl RA lAL and many other What they are doing is interleaving two table rows. I've inserted a blank line above to delineate the two rows. If you re-arrange each you can read the following in a zig-zag fashion: VA U AG C IS LI M VI RA 1AL And put it together: VALIUM VIAGRA C1ALIS