Diez B. Roggisch wrote: > John Nagle wrote: >> I'm reading the PhishTank XML file of active phishing sites, >> at "http://data.phishtank.com/data/online-valid/" This changes >> frequently, and it's big (about 10MB right now) and on a busy server. >> So once in a while I get a bogus copy of the file because the file >> was rewritten while being sent by the server. >> >> Any good way to deal with this, short of reading it twice >> and comparing? > > Apart from that - the only thing you could try is to apply a SAX parser > on the input stream immediatly, so that at least if the XML is non-valid > because of the way they serve it you get to that ASAP.
Sure, if you want to use lxml.etree, you can pass the URL right into etree.parse() and it will throw an exception if parsing from the URL fails to yield a well-formed document. http://codespeak.net/lxml/ http://codespeak.net/lxml/dev/parsing.html BTW, parsing and serialising it back to a string is most likely dominated by the time it takes to transfer the document over the network, so it will not be much slower than reading it using urlopen() and the like. Stefan -- http://mail.python.org/mailman/listinfo/python-list