John Nagle wrote: > Miles wrote: >> On Jul 16, 1:00 am, John Nagle <[EMAIL PROTECTED]> wrote: >> >>> I'm reading the PhishTank XML file of active phishing sites, >>> at "http://data.phishtank.com/data/online-valid/" This changes >>> frequently, and it's big (about 10MB right now) and on a busy server. >>> So once in a while I get a bogus copy of the file because the file >>> was rewritten while being sent by the server. >>> >>> Any good way to deal with this, short of reading it twice >>> and comparing? >>> >>> John Nagle >> >> Sounds like that's the host's problem--they should be using atomic >> writes, which is usally done be renaming the new file on top of the >> old one. How "bogus" are the bad files? If it's just incomplete, >> then since it's XML, it'll be missing the "</output>" and you should >> get a parse error if you're using a suitable strict parser. If it's >> mixed old data and new data, but still manages to be well-formed XML, >> then yes, you'll probably have to read it twice. > > The files don't change much from update to update; typically they > contain about 10,000 entries, and about 5-10 change every hour. So > the odds of getting a seemingly valid XML file with incorrect data > are reasonably good. > I'm still left wondering what the hell kind of server process will start serving one copy of a file and complete the request from another. Oh, well.
regards Steve -- Steve Holden +1 571 484 6266 +1 800 494 3119 Holden Web LLC/Ltd http://www.holdenweb.com Skype: holdenweb http://del.icio.us/steve.holden --------------- Asciimercial ------------------ Get on the web: Blog, lens and tag the Internet Many services currently offer free registration ----------- Thank You for Reading ------------- -- http://mail.python.org/mailman/listinfo/python-list