On May 11, 11:31 pm, norseman <norse...@hughes.net> wrote: > Steve Howell wrote: > > On May 11, 10:16 pm, norseman <norse...@hughes.net> wrote: > >> Tim Arnold wrote: > >>> Hi, I have some html files that I want to validate by using an external > >>> script 'validate'. The html files need a doctype header attached before > >>> validation. The files are in utf8 encoding. My code: > >>> --------------- > >>> import os,sys > >>> import codecs,subprocess > >>> HEADER = '<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">' > >>> filename = 'mytest.html' > >>> fd = codecs.open(filename,'rb',encoding='utf8') > >>> s = HEADER + fd.read() > >>> fd.close() > >>> p = subprocess.Popen(['validate'], > >>> stdin=subprocess.PIPE, > >>> stdout=subprocess.PIPE, > >>> stderr=subprocess.STDOUT) > >>> validate = p.communicate(unicode(s,encoding='utf8')) > >>> print validate > >>> --------------- > >>> I get lots of lines like this: > >>> Error at line 1, character 66:\tillegal character number 0 > >>> etc etc. > >>> But I can give the command in a terminal 'cat mytest.html | validate' and > >>> get reasonable output. My subprocess code must be wrong, but I could use > >>> some help to see what the problem is. > >>> python2.5.1, freebsd6 > >>> thanks, > >>> --Tim > >> ============================ > >> If you search through the recent Python-List for UTF-8 things you might > >> get the same understanding I have come to. > > >> the problem is the use of python's 'print' subcommand or what ever it > >> is. It 'cooks' things and someone decided that it would only handle 1/2 > >> of a byte (in the x'00 to x'7f' range) and ignore or send error messages > >> against anything else. I guess the person doing the deciding read the > >> part that says ASCII printables are in the 7 bit range and chose to > >> ignore the part about the rest of the byte being undefined. That is > >> undefined, not disallowed. Means the high bit half can be used as > >> wanted since it isn't already taken. Nor did whoever it was take a look > >> around the computer world and realize the conflict that was going to be > >> generated by using only 1/2 of a byte in a 1byte+ world. > > >> If you can modify your code to use read and write you can bypass print > >> and be OK. Or just have python do the 'cat mytest.html | validate' for > >> you. (Apply a var for html and let python accomplish the the equivalent > >> of Unix's: > >> for f in *.html; do cat $f | validate; done > >> or > >> for f in *.html; do validate $f; done #file name available this way > > >> If you still have problems, take a look at os.POPEN2 (and its popen3) > >> Also take look at os.spawn.. et al > > > Wow. Unicode and subprocessing and printing can have dark corners, > > but common sense does apply in MOST situations. > > > If you send the header, add the newline. > > > But you do not need the header if you can cat the input file sans > > header and get sensible input. > > Yep! The problem is with 'print' >
Huh? Print is printing exactly what you expect it to print. -- http://mail.python.org/mailman/listinfo/python-list