On Dec 31 2008, 9:56 am, John Machin <sjmac...@lexicon.net> wrote: > On Dec 31 2008, 4:02 pm, brooklineTom <brookline...@gmail.com> wrote: > > > andyh...@gmail.com wrote: > > > Hi, > > > > Can anybody recommend an approach for loading and parsing Excel > > > spreadsheets in Python. Any well known/recommended libraries for this? > > > > The only thing I found in a brief search > > > washttp://www.lexicon.net/sjmachin/xlrd.htm, > > > but I'd rather get some more input before going with something I don't > > > know. > > > > Thanks, > > > Andy. > > > I save the spreadsheets (in Excel) in xml format. > > Which means that you need to be on a Windows box with a licensed copy > of Excel. I presume you talking about using Excel 2003 and saving as > "XML Spreadsheet (*.xml)". Do you save the files manually, or using a > COM script? What is the largest xls file that you've saved as xml, how > big was the xml file, and how long did it take to parse the xml file? > Do you extract formatting information or just cell contents?
1. The OP requested Excel files, by construction those must be generated with a licensed copy of Excel. I did the actual processing on both linux and windoze platforms. 2. Yes, I used Excel 2003. I haven't looked at later versions. 3. The largest file I used was about 228M, containing 36,393 hotel properties from Commission Junction. Each entry had 113 cells. The parsing overhead was minimal (on a per-entry basis) -- that's why I choose to use a pull-parser. 4. I extracted primarily cell contents, though I did some very limited format handling (looking for non-text fields and such). > > I started with the > > standard xml tools (xml.dom and xml.dom.minidom). I built a > > pullparser, and then just crack them. The MS format is tedious and > > overly complex (like all MS stuff), but straightforward. > > What do you think of the xml spat out by Excel 2007's (default) xlsx > format? I haven't looked at this. > > Once I've > > cracked them into their component parts (headers, rows, cells, etc), > > then I walk through them doing whatever I want. > > > I found this material to be no worse than doing similar crud with > > xhtml. I know there are various python packages around that do it, but > > I found the learning curve of those packages to be steeper than just > > grokking the spreadsheet structure itself. > > I'm curious to know which are the "various python packages" with the > so steep learning curves, and what the steep bits were. I looked, briefly, at xlrd. I found and scanned a few alternatives, though I don't remember what the others were. I needed something I could incorporate into my own application framework, and I knew I didn't need most of the formatting information. I'm not in any way criticizing xlrd, it's simply that, based on its API summary, it seems focused on problems I didn't have to solve. I knew that I needed only a small subset of the xlrd behavior, and I concluded (perhaps incorrectly) that it would be easier to roll my own parser than find, extract, and then port (to my own framework) the corresponding parts of xlrd. I needed to extract the content of each row, cell by cell, and build data objects (in my framework) with the content of various cells. I also needed to build an "exception file" containing malformed entries that I could re-open with Excel after my code finished, so that the bogus entries could be manually corrected. What I mean by "malformed" entry is, for example, an address field that fails to correctly geocode or comment fields with confused utf8/unicode contents. My focus was on data content, as opposed to presentation. I needed to crack the cells into things like "string", "boolean", "float", and so on. Most importantly, I needed to do this one entry at a time -- I did *not* want to load the entire spreadsheet at once. I'm not saying that this couldn't be done with xlrd; only that I chose to roll my own and had minimal difficulty doing so. I hope this helps! -- http://mail.python.org/mailman/listinfo/python-list