I think, that there is no additional logic inside of the warc processing. So in case you are going to re-fetch (the old content), you will have an additional copy.
(Disclaimer: not using warc) > Am 07.02.2025 um 12:57 schrieb Tim Allison <talli...@apache.org>: > > Again, this is more for a user@ list.... Sorry. > > I want to confirm I understand refetching correctly. > > When the crawler goes to refetch a page, it adds the If-Modified-Since > and the If-None-Match (if an etag exists) headers. If the host > respects those, it will return a 200 and new content if something has > changed, otherwise it will return a non-200. > > If the host doesn't respect those headers and returns exactly the same > bytes as were originally fetched with a 200, that content is returned > and written to a bolt. > > In short, if we're writing to warcs, and we refetch a page that > returns a 200 and the contents are the same as we originally fetched, > we'll have two copies of the same content? > > Thank you! > > Best, > > Tim