On Tue, 2007-07-17 at 00:47 +0000, John Nagle wrote: > Miles wrote: > > On Jul 16, 1:00 am, John Nagle <[EMAIL PROTECTED]> wrote: > > > >> I'm reading the PhishTank XML file of active phishing sites, > >>at "http://data.phishtank.com/data/online-valid/" This changes > >>frequently, and it's big (about 10MB right now) and on a busy server. > >>So once in a while I get a bogus copy of the file because the file > >>was rewritten while being sent by the server. > >> > >> Any good way to deal with this, short of reading it twice > >>and comparing? > >> > >> John Nagle > > > > > > Sounds like that's the host's problem--they should be using atomic > > writes, which is usally done be renaming the new file on top of the > > old one. How "bogus" are the bad files? If it's just incomplete, > > then since it's XML, it'll be missing the "</output>" and you should > > get a parse error if you're using a suitable strict parser. If it's > > mixed old data and new data, but still manages to be well-formed XML, > > then yes, you'll probably have to read it twice. > > The files don't change much from update to update; typically they > contain about 10,000 entries, and about 5-10 change every hour. So > the odds of getting a seemingly valid XML file with incorrect data > are reasonably good.
Does the server return a reliable last-modified timestamp? If yes, you can do something like this: prev_last_mod = None while True: u = urllib.urlopen(theUrl) if prev_last_mod==u.headers['last-modified']: break prev_last_mod = u.headers['last-modified'] contents = u.read() u.close() That way, you only have to re-read the file if it actually changed according to the time stamp, rather than having to re-read in any case just to check whether it changed. HTH, -- Carsten Haese http://informixdb.sourceforge.net -- http://mail.python.org/mailman/listinfo/python-list