In comp.lang.python, Chris Angelico <ros...@gmail.com> wrote: > Eli the Bearded <*@eli.users.panix.com> wrote: >> Read first N lines of a file. If all parse as valid UTF-8, consider it text. >> That's probably the rough method file(1) and Perl's -T use. (In >> particular allow no nulls. Maybe allow ISO-8859-1.) > ISO-8859-1 is basically "allow any byte values", so all you'd be doing > is checking for a lack of NUL bytes.
ISO-8859-1, unlike similar Windows "charset"s, does not use octets 128-190. Charsets like Windows CP-1252 are nastier, because they do use that range. Usage of 1-31 will be pretty restricted in either, probably not more than tab, linefeed, and carriage return. > I'd definitely recommend > mandating UTF-8, as that's a very good way of recognizing valid text, > but if you can't do that then the simple NUL check is all you really > need. Dealing with all UTF-8 is my preference, too. > And let's be honest here, there aren't THAT many binary files that > manage to contain a total of zero NULs, so you won't get many false > hits :) There's always the issue of how much to read before deciding. Elijah ------ ASCII with embedded escapes? could be a VT100 animation -- https://mail.python.org/mailman/listinfo/python-list