On Jul 16, 1:00 am, John Nagle <[EMAIL PROTECTED]> wrote: > I'm reading the PhishTank XML file of active phishing sites, > at "http://data.phishtank.com/data/online-valid/" This changes > frequently, and it's big (about 10MB right now) and on a busy server. > So once in a while I get a bogus copy of the file because the file > was rewritten while being sent by the server. > > Any good way to deal with this, short of reading it twice > and comparing? > > John Nagle
Sounds like that's the host's problem--they should be using atomic writes, which is usally done be renaming the new file on top of the old one. How "bogus" are the bad files? If it's just incomplete, then since it's XML, it'll be missing the "</output>" and you should get a parse error if you're using a suitable strict parser. If it's mixed old data and new data, but still manages to be well-formed XML, then yes, you'll probably have to read it twice. -Miles -- http://mail.python.org/mailman/listinfo/python-list