On Thu, 2005-01-27 at 13:05 -0600, Damian Menscher wrote: > Oh, ok. Apparently we have a different definition of plaintext. I > generally take anything using only the lower 7 bits (ASCII table) to > mean plaintext, and things that use the 8th bit to mean binary. > Regardless of your definition of "plaintext", it would seem that my > conclusion that phishing signatures that rely exclusively on 7-bit ascii > are more likely to have a false positive than binary signatures that use > the full 8 bits is correct.
Even with your definition of plaintext you are still wrong :-) Why? Because the structure of language in plaintext files is much richer than that used in the binaries of computer programs. An aside: HTML is actually Universal Character Set (UCS), or to quote the standard: "The ASCII character set is not sufficient for a global information system such as the Web, so HTML uses the much more complete character set called the Universal Character Set (UCS), defined in [ISO10646]. This standard defines a repertoire of thousands of characters used by communities all over the world." and "When HTML text is transmitted in UTF-16 (charset=UTF-16), text data should be transmitted in network byte order ("big-endian", high-order byte first) in accordance with [ISO10646], Section 6.3 and [UNICODE], clause C3, page 3-1. Furthermore, to maximize chances of proper interpretation, it is recommended that documents transmitted as UTF-16 always begin with a ZERO-WIDTH NON-BREAKING SPACE character (hexadecimal FEFF, also called Byte Order Mark (BOM)) which, when byte-reversed, becomes hexadecimal FFFE, a character guaranteed never to be assigned. Thus, a user-agent receiving a hexadecimal FFFE as the first bytes of a text would know that bytes have to be reversed for the remainder of the text." -trog
signature.asc
Description: This is a digitally signed message part
_______________________________________________ http://lists.clamav.net/cgi-bin/mailman/listinfo/clamav-users