Re: Fetching a clean copy of a changing web page

Miles Sun, 15 Jul 2007 23:13:25 -0700

On Jul 16, 1:00 am, John Nagle <[EMAIL PROTECTED]> wrote:
>     I'm reading the PhishTank XML file of active phishing sites,
> at "http://data.phishtank.com/data/online-valid/";  This changes
> frequently, and it's big (about 10MB right now) and on a busy server.
> So once in a while I get a bogus copy of the file because the file
> was rewritten while being sent by the server.
>
>     Any good way to deal with this, short of reading it twice
> and comparing?
>
>                                 John Nagle


Sounds like that's the host's problem--they should be using atomic
writes, which is usally done be renaming the new file on top of the
old one.  How "bogus" are the bad files?  If it's just incomplete,
then since it's XML, it'll be missing the "</output>" and you should
get a parse error if you're using a suitable strict parser.  If it's
mixed old data and new data, but still manages to be well-formed XML,
then yes, you'll probably have to read it twice.

-Miles

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Fetching a clean copy of a changing web page

Reply via email to