Use of wget and http to download an entire site means numerous TCP opens and HTTP GET requests. The entire point of rsync is that it knows there are numerous downloads. It does ONE open. This allows TCP slow start to ramp up
A multi-download session with ftp is also efficient. Clients like ncftp have batch transfer built in. If setting up an initial mirror you might do better with ftp but maintaining it is where rsync rules. I haven't looked closely but I have the impression from watching wget work that wget using HTTP::Date opens two TCP connections per file: it opens a socket and issues a réquest for timestamp then closes it then opens a socket to issue an http GET if it wants the file. Then it closes that socket and the process repeats for next file. It keeps hoping for the timestanp even if the server doesn't support http::Date Rsync and ftp are stateful; http is not. For absolute getting one file http is better since you skip the whole login thing and setting up data and control sockets. So a CPAN client session will do better with an http mirror: it gets a tar.gz opens it up processes it and then goes back many seconds from original request for the first dependency. Repeat until entire dependency tree is completed Sent from my BlackBerry® smartphone with Nextel Direct Connect -----Original Message----- From: Nicholas Clark <n...@ccl4.org> Date: Sun, 28 Mar 2010 17:20:34 To: Arthur Corliss<corl...@digitalmages.com> Cc: Elaine Ashton<eash...@mac.com>; <cpan-work...@perl.org>; <module-authors@perl.org> Subject: Re: Trimming the CPAN - "Automatic Purging" On Sat, Mar 27, 2010 at 08:52:22PM -0800, Arthur Corliss wrote: > On Sat, 27 Mar 2010, Elaine Ashton wrote: > > >Actually, I thought I was merely offering my opinion both as the sysadmin > >for the canonical CPAN mothership and as an end-user. If that makes me a > >prick, well, I suppose I should go out and buy one :) > > :-) You'll have to pardon my indiscriminate epithets. The barbs are coming > from multiple directions. My point still stands, however. Your experience, > however worthy, has zero bearing on whether or not my experience is > just as worthy. Even moreso when you guys have zero clue who you're talking Are you running a large public mirror site, where you don't even have knowledge of who is mirroring from you? (Not even knowledge, let alone channels of communication with, let alone control over) Because (as I see it, not having done any of this) the logistics of that is going to have as much bearing on trying to change protocols as the actual technical merits of the protocol itself. Most of the cost of rsync is an externality to the clients. If one has an existing mirror, one is using rsync to keep it up to date, what's the incentive to change? > Sounds like you may be hamstrung by your own bureacracy, but that's rarely > the case in most the places I've worked. Not to mention that between > passive mode FTP or even using an HTTP proxy (most of which support FTP > requests) what I'm proposing is relatively painless, simple, and easy to > secure. This concern I suspect is a non-issue for most mirror operators. > Even if it was, allow them to pull it via HTTP for all I care. Either one > is significantly more efficient than rsync. I'm missing something here, I suspect. How can HTTP be more efficient than rsync? The only obvious method to me of mirroring a CPAN site by HTTP is to instruct a client (such as wget) to get it all. In which case, in the course of doing this the client is going to recurse over the entire directory tree of the server, which, I thought, was functionally equivalent to the behaviour of the rsync server. Nicholas Clark