On Sat, 30 Apr 2005 11:35:05 +0100, Michael Hoffman <[EMAIL PROTECTED]> wrote:
>John Machin wrote: >> Real-world data is not "text". > >A lot of real-world data is. For example, almost all of the data I deal with >is text. OK, depends on one's definitions of "data" and "text". In the domain of commercial database applications, there is what's loosely called "text": entity names, and addresses, and product descriptions, and the dreaded free-text "note" columns -- all of which (not just the "notes") one can end up parsing trying to extract extraneous data that's been dumped in there ... sigh ... > >>>That's nice. Well I agree with you, if the OP is concerned about embedded >>>CRs, LFs and ^Zs in his data (and he is using Windows in the latter case), >>>then he *definitely* shouldn't use fileinput. >> >> And if the OP is naive enough not to be concerned, then it's OK, is >> it? > >It simply isn't a problem in some real-world problem domains. And if there >are control characters the OP didn't expect in the input, and csv loads it >without complaint, I would say that he is likely to have other problems once >he's processing it. Presuming for the moment that the reason for csv not complaining is that the data meets the csv non-spec and that the csv module is checking that: then at least he's got his data in the structural format he's expecting; if he doesn't do any/enough validation on the data, we can't save him from that. > >> Except, perhaps, the reason stated in fileinput.py itself: >> >> """ >> Performance: this module is unfortunately one of the slower ways of >> processing large numbers of input lines. >> """ > >Fair enough, although Python is full of useful things that save the >programmer's time at the expense of that of the CPU, and this is >frequently considered a Good Thing. > >Let me ask you this, are you simply opposed to something like fileinput >in principle or is it only because of (1) no binary mode, and (2) poor >performance? Because those are both things that could be fixed. I think >fileinput is so useful that I'm willing to spend some time working on it >when I have some. I wouldn't use fileinput for a "commercial data processing" exercise, because it's slow, and (if it involved using the Python csv module) it opens the files in text mode, and because in such exercises I don't often need to process multiple files as though they were one file. When I am interested in multiple files -- more likely a script that scans source files -- even though I wouldn't care about the speed nor the binary mode, I usually do something like: for pattern in args: # args from an optparse parser for filename in glob.glob(pattern): for line in open(filename): There is also an "on principle" element to it as well -- with fileinput one has to use the awkish methods like filelineno() and nextfile(); strikes me as a tricksy and inverted way of doing things. Cheers, John -- http://mail.python.org/mailman/listinfo/python-list