Re: [CODE4LIB] [EXTERNAL] Re: [CODE4LIB] WARC --> static HTML?

Demian Katz Thu, 05 Mar 2020 06:01:24 -0800

Thank you, Stuart, and to everyone else who answered both on- and off-list. I 
now have a few different ideas I can try! It may take me a little while to find 
time to try them all, but I'll report back with a solution once I've found 
something that meets my needs, in case it's helpful to others in future. I 
greatly appreciate all of your support. 😊

- Demian

-----Original Message-----
From: Code for Libraries <[email protected]> On Behalf Of Stuart A. Yeates
Sent: Wednesday, March 4, 2020 4:36 PM
To: [email protected]
Subject: [EXTERNAL] Re: [CODE4LIB] WARC --> static HTML?

WARC is not an access format.

WARC is entirely optimised for crawling and the gold standard for archiving 
because it's close to the 'on the wire' web experience.

BUT

There is no file index: you access every file using a linear search from the 
start of the archive.
There is no guarantee that related files are stored together: an HTML page and 
it's CSS, images and embedded streaming video There is no guarantee that 
related pages are stored together.

If you're using WARC for access, you need something that overcomes these 
limitations, and the obvious choice is CDX indexes. For an explanation of how 
CDX files index WARC files, see the diagram on
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsupport.archive-it.org%2Fhc%2Fen-us%2Farticles%2F115001790023-Access-Archive-It-s-Wayback-index-with-the-CDX-C-API&amp;data=02%7C01%7Cdemian.katz%40VILLANOVA.EDU%7C0909a4083ea24454af7008d7c084466e%7C765a8de5cf9444f09cafae5bf8cfa366%7C0%7C0%7C637189546668073215&amp;sdata=7JuqJlnJactLtEftPPJ%2BkMHXdV%2B2DxRGDz%2BQ8073r9k%3D&amp;reserved=0

---

Alternatively, use wget with the --convert-links option over your WARC / pywb 
solution. This should be faster than 40 mins per page on average, since CSS and 
branding images should only have to be retrieved once (assuming sane site 
design).

cheers
stuart
--
...let us be heard from red core to black sky

On Thu, 5 Mar 2020 at 04:37, Demian Katz <[email protected]> wrote:

> Hello, everyone –
>
> I’ve been struggling with a use case that feels like it can’t be 
> unique to my situation. Wondering if anyone else has solved this!
>
> We’ve decommissioned an old dynamic site, and we still want to make 
> the content available in a static form. It was a large and complex 
> site with a lot of pages, and after trying a variety of solutions, we 
> ended up harvesting it all into a WARC file. This is great for 
> archival purposes, but we’re struggling with presentation.
>
> The problem with serving content from a WARC is that it seems to be 
> unbearably slow in every solution we try. (And when I say unbearably, 
> I mean “40 minutes to load one page using pywb” – not kidding).
>
> I assume that this slowness has to do with dynamically navigating 
> around in a multi-gigabyte file to retrieve things… but really all we 
> want to do is serve up static content.
>
> Is there some tool that can simply unpack a WARC into a directory of 
> static files that can be navigated quickly? It seems like this should 
> be possible, but I’m coming up empty in searching.
>
> And just to be clear: I understand that unpacking a WARC probably 
> won’t retain all of the richness of detail that dynamic retrieval from 
> the WARC can provide, and I certainly don’t plan to throw away the 
> WARC… but for people who just want to quickly navigate content from 
> the most recently-crawled version of the old site, I want a solution 
> that will perform acceptably, and I haven’t found it yet.
>
> Thanks for any and all advice! 😊
>
> - Demian
>

Re: [CODE4LIB] [EXTERNAL] Re: [CODE4LIB] WARC --> static HTML?

Reply via email to