On 23 Feb 2007 07:31:35 -0800, "John Machin" <[EMAIL PROTECTED]> wrote:
>On Feb 23, 10:11 pm, David C. Ullrich <[EMAIL PROTECTED]> >wrote: >> Is there a csvlib out there somewhere? > >I can make available the following which should be capable of running >on 1.5.2 -- unless they've suffered bitrot :-) > >(a) a csv.py which does simple line-at-a-time hard-coded-delimiter-etc >pack and unpack i.e. very similar to your functionality *except* that >it doesn't handle newline embedded in a field. You may in any case be >interested to see a different way of writing this sort of thing: my >unpack does extensive error checking; it uses a finite state machine >so unexpected input in any state is automatically an error. Actually a finite-state machine was the first thing I thought of. Then while I was thinking about what states would be needed, etc, it ocurred to me that I could get something working _now_ by just noticing that (assuming valid input) a quoted field would be terminated by '",' or '"[eos]'. A finite-state machine seems like the "right" way to do it, but there are plenty of other parts of the project where doing it right is much more important - yes, in my experience doing it "right" saves time in the long run, but that finite-state machine would have taken more time _yesterday_. >(b) an extension module (i.e. written in C) with the same API. The >python version (a) imports and uses (b) if it exists. > >(c) an extension module which parameterises everything including the >ability to handle embedded newlines. > >The two extension modules have never been compiled & tested on other >than Windows but they both should IIRC be compilable with both gcc >(MinGW) and the free Borland 5.5 compiler -- in other words vanilla C >which should compile OK on Linux etc. > >If you are interested in any of the above, just e-mail me. Keen. >> >> And/or does anyone see any problems with >> the code below? >> >> What csvline does is straightforward: fields >> is a list of strings. csvline(fields) returns >> the strings concatenated into one string >> separated by commas. Except that if a field >> contains a comma or a double quote then the >> double quote is escaped to a pair of double >> quotes and the field is enclosed in double >> quotes. >> >> The part that seems somewhat hideous is >> parsecsvline. The intention is that >> parsecsvline(csvline(fields)) should be >> the same as fields. Haven't attempted >> to deal with parsecsvline(data) where >> data is in an invalid format - in the >> intended application data will always >> be something that was returned by >> csvline. > >"Always"? Famous last words :-) Heh. Otoh, having read about all the existing variations in csv files, I don't think I'd attempt to write something that parses csv provided from an external source. >> It seems right after some >> testing... also seems blechitudinous. > >I agree that it's bletchworthy, but only mildly so. If it'll make you >feel better, I can send you as a yardstick csv pack and unpack written >in awk -- that's definitely *not* a thing of beauty and a joy >forever :-) > >I presume that you don't write csvline() output to a file, using >newline as a record terminator and then try to read them back and pull >them apart with parsecsvline() -- such a tactic would of course blow >up on the first embedded newline. Indeed. Thanks - this is exactly the sort of problem I was hoping people would point out (although in fact this one is irrelevant, since I already realized this). In fact the fields will not contain linefeeds (the data is coming from <INPUT type="text"> on an html form, which means that unless someone's _trying_ to cause trouble a linefeed is impossible, right? Regardless, incoming data is filtered. Fields containing newlines are quoted just to make the thing usable in other situations - I wouldn't use parsecsvline without being very careful, but there's no reason csvline shouldn't have general applicability.) And in any case, no, I don't intend to be parsing multi-record csv files. Although come to think of it one could modify the above to do that without too much trouble, at least assuming valid input - end-of-field followed by linefeed must be end-of-record, right? >So as a matter of curiosity, where/ >how are you storing multiple csvline() outputs? Since you ask: the project is to allow alumni to store contact information on a web site, and then let office staff access the information for various purposes. So each almunus' data is stored as a csvline in an anydbm "database" - when someone in the office requests the information it's dumped into a csv file, the idea being that the office staff opens that in Excel or whatever. (Why not simply provide a suitable interface to the data instead of just giving them the csv file? So they can use the data in ways I haven't anticipated. Why not give them access to a real database? They know how to use Excel. I do think I'll provide a few access thingies in addition to the csv file, for example an automatic mass mailer...) So why put csv data into an anydbm thing instead of using shelve or something? Laughably or not, the reason is to speed up what seems like the main bottleneck: If I use my parsecsvline() that will be very slow. But that doesn't matter, since that only happens once or twice a day on one record, when an alumnus logs in and edits his contact information. But when the office requests the data we run through the entire database - if we store the data as csv then we don't have any conversion to do at that point, we just write the raw data in the database to a file. Should be much quicker than converting something else to csv at that point. (So why not just store the data in a csv file? Random access.) Since you asked, if you had any comments on what's silly about the general plan there by all means say so. Hmm. Why not use one of the many Python web tools out there? (i) Doing it myself is more interesting. I'm not getting paid for this. (ii) If I do it muself it's going to be easier for me to be certain I know exactly where user input is at all times. The boss wanted me to use php because Python was going to be too hard for someone else to read. That's nonsense, of course. Anyway, he gave me a book on php security. The book raised a lot of issues that I wouldn't have thought of, but it also convinced me I wouldn't want to use php - all through the book we're warned that php will do this or that bad thing if you're not careful. Don't want to have to learn all the things you need not to do with whatever tool I use. Here, the only write access to the database is through an Alum object; Alum objects filter their data on creation, and they're read-only (via the magic of ___setattr__), so a maintainer would have _try_ if he wanted to insert unfiltered data - wouldn't be hard to do, but he can't do it by accident. And the only html output is through PostHTML, which filters everything through cgi.escape(). In particular print statements raise exceptions (via sys.stdout = PrintExploder().) Again, a maintainer could easily write to sys.__stdout__ to get around this, but that's not going to happen by accident. Altogether seems much cleaner than the php stuff I saw in that book - the way he does things you need to be careful every time you do something, with the current setup I only need to be careful twice, in Alum.__init__ and in PostHTML. Could be I'm being arrogant putting more trust in asetup like that instead of some well-known Python web thingie. But I don't see anyplace things can leak out, and using someone else's thing I'd either have to just believe them or read a lot of code. That'll teach you to express curiosity about something I'm doing. Been thinking about all this for a few weeks, you asked a question and the fingers started ty[ing. >> >> (Um: Believe it or not I'm _still_ using >> python 1.5.7. So comments about iterators, >> list comprehensions, string methods, etc >> are irrelevent. Comments about errors in >> the algorithm would be great. Thanks.) > >1.5.7 ? Well I _said_ you wouldn't believe it... >[big snip] > >Cheers, >John ************************ David C. Ullrich -- http://mail.python.org/mailman/listinfo/python-list