I think, that there is no additional logic inside of the warc processing.
So in case you are going to re-fetch (the old content), you will have an 
additional copy.

(Disclaimer: not using warc)

> Am 07.02.2025 um 12:57 schrieb Tim Allison <talli...@apache.org>:
> 
> Again, this is more for a user@ list.... Sorry.
> 
> I want to confirm I understand refetching correctly.
> 
> When the crawler goes to refetch a page, it adds the If-Modified-Since
> and the If-None-Match (if an etag exists) headers. If the host
> respects those, it will return a 200 and new content if something has
> changed, otherwise it will return a non-200.
> 
> If the host doesn't respect those headers and returns exactly the same
> bytes as were originally fetched with a 200, that content is returned
> and written to a bolt.
> 
> In short, if we're writing to warcs, and we refetch a page that
> returns a 200 and the contents are the same as we originally fetched,
> we'll have two copies of the same content?
> 
> Thank you!
> 
> Best,
> 
>         Tim

Reply via email to