Does your current archive fetching things from servers that only do SSLv2? Or is this a theoretical concern?
We do indeed occasionally come across old servers that only support insecure ciphers or ancient protocols. Since those are the websites at high risk of being completely forgotten and just vanishing some day because someone unplugged an old server that had been sitting in a corner for years, we unfortunately have to keep support for any protocol that's still observed in the wild, no matter how ancient, insecure, or rare it is, and regardless of whether it is accessible with an unmodified typical web browser. Just to give perspective, we still regularly deal with HTTP/1.0 and even HTTP/0.9 servers as well, both protocols that were already outdated 20 years ago. So not supporting SSLv2 is sadly not an option for us.
But that is incomplete. It doesn’t tell you IP address, v4 or v6. Given that that your first message said you were concerned about the kind of response you got, I would expect that knowing the exact IP address you reached would be important. Saying “archive.org” will give you what the DNS system (and its complicated interaction of resolvers and DNS-based filtering) thinks it is **now**. It does not tell you what it was at the time of the archive fetch. Of course, IP addresses move as well, so that’s not perfect either. I don’t know what would be.
The IP address we're actually connected to is already recorded in the WARC-IP-Address header of each record, see the example. I suppose the hostname looked up over DNS, sent in SNI, and sent in the HTTP Host header could all be different in theory. Using the IP address in the URI might be preferable in that case. (Recording the DNS query and reply is something we're also looking into but unrelated to TLS of course.)
Your proposal also doesn’t address which protocol was used to do the fetching. Maybe that information is stored in another part of the WARC file, but your decription quoted above is still incomplete. What version of HTTP are you using? Or is it gopher? RealPlayer audio? H3? You cannot intuit that just from the “443” and if you are concerned about SSLv2, presumably you also want dead formats like the first two.
Yes, this information is recorded – only since very recently, and not yet part of the WARC specification – in a WARC header on the HTTP records. For example, an 'HTTPS' retrieval that was actually HTTP 1.0 with SSLv2 would have 'WARC-Protocol: ssl/2' and 'WARC-Protocol: http/1.0' headers (though the request might be 'http/1.1' instead since we potentially don't know the server is HTTP/1.0 at request time; a similar thing would apply for recording backwards-compatible ClientHello where the server replies with an older TLS version). Other protocols like Gopher or RTSP (also not yet part of the WARC spec) would get a similar WARC-Protocol value.
_______________________________________________ TLS mailing list TLS@ietf.org https://www.ietf.org/mailman/listinfo/tls