On 2010-12-26, Stefan Behnel <stefan...@behnel.de> wrote: > Tim Harig, 26.12.2010 02:05: >> On 2010-12-25, Nobody<nob...@nowhere.com> wrote: >>> On Sat, 25 Dec 2010 14:41:29 -0500, Roy Smith wrote: >>>> Of course, one advantage of XML is that with so much redundant text, it >>>> compresses well. We typically see gzip compression ratios of 20:1. >>>> But, that just means you can archive them efficiently; you can't do >>>> anything useful until you unzip them. >>> >>> XML is typically processed sequentially, so you don't need to create a >>> decompressed copy of the file before you start processing it. >> >> Sometimes XML is processed sequentially. When the markup footprint is >> large enough it must be. Quite often, as in the case of the OP, you only >> want to extract a small piece out of the total data. In those cases, being >> forced to read all of the data sequentially is both inconvenient and and a >> performance penalty unless there is some way to address the data you want >> directly. > > So what? If you only have to do that once, it doesn't matter if you have to > read the whole file or just a part of it. Should make a difference of a > couple of minutes.
Much agreed. I assume that the process needs to be repeated or it probably would be simpler just to rip out what I wanted using regular expressions with shell utilities. > If you do it a lot, you will have to find a way to make the access > efficient for your specific use case. So the file format doesn't matter > either, because the data will most likely end up in a fast data base after > reading it in sequentially *once*, just as in the case above. If the data is just going to end up in a database anyway; then why not send it as a database to begin with and save the trouble of having to convert it? -- http://mail.python.org/mailman/listinfo/python-list