Again, this is more for a user@ list.... Sorry.

I want to confirm I understand refetching correctly.

When the crawler goes to refetch a page, it adds the If-Modified-Since
and the If-None-Match (if an etag exists) headers. If the host
respects those, it will return a 200 and new content if something has
changed, otherwise it will return a non-200.

If the host doesn't respect those headers and returns exactly the same
bytes as were originally fetched with a 200, that content is returned
and written to a bolt.

In short, if we're writing to warcs, and we refetch a page that
returns a 200 and the contents are the same as we originally fetched,
we'll have two copies of the same content?

Thank you!

Best,

         Tim

Reply via email to