On Jul 16, 4:50 am, Stefan Behnel <[EMAIL PROTECTED]> wrote: > Diez B. Roggisch wrote: > > John Nagle wrote: > >> I'm reading the PhishTank XML file of active phishing sites, > >> at "http://data.phishtank.com/data/online-valid/" This changes > >> frequently, and it's big (about 10MB right now) and on a busy server. > >> So once in a while I get a bogus copy of the file because the file > >> was rewritten while being sent by the server. > > >> Any good way to deal with this, short of reading it twice > >> and comparing? > > > Apart from that - the only thing you could try is to apply a SAX parser > > on the input stream immediatly, so that at least if the XML is non-valid > > because of the way they serve it you get to that ASAP. > > Sure, if you want to use lxml.etree, you can pass the URL right into > etree.parse() and it will throw an exception if parsing from the URL fails to > yield a well-formed document. > > http://codespeak.net/lxml/http://codespeak.net/lxml/dev/parsing.html > > BTW, parsing and serialising it back to a string is most likely dominated by > the time it takes to transfer the document over the network, so it will not be > much slower than reading it using urlopen() and the like. > > Stefan
xml.etree.ElementTree is in the standard lib now, too. Also, xml.etree.cElementTree, which has the same interface but is blindingly fast. (I'm working on a program which needs to read/recreate the (badly designed, horrible, evil) iTunes Library XML, of which mine is about 10mb, and cEtree parses it in under a second and 60mb of ram (whearas minidom takes like two minutes and 600+mb to do the same thing).) (I mean really -- the playlists are stored as five megs of lists with elements that are dictionaries of one element, all looking exactly like this: <dict>\n<key>Track ID</key><integer>4521</integer>\n</dict> \n --- </rant>) -- <weaver>star</weaver> -- http://mail.python.org/mailman/listinfo/python-list