Use of wget and http to download an entire site means numerous TCP opens and 
HTTP GET requests. The entire point of rsync is that it knows there are 
numerous downloads. It does ONE open. This allows TCP slow start to ramp up  

A multi-download session with ftp is also efficient. Clients like ncftp have 
batch transfer built in. If setting up an initial mirror you might do better 
with ftp but maintaining it is where rsync rules. 

I haven't looked closely but I have the impression from watching wget work that 
wget using HTTP::Date opens two TCP connections per file: it opens a socket and 
issues a réquest for timestamp then closes it then opens a socket to issue an 
http GET if it wants the file. Then it closes that socket and the process 
repeats for next file. It keeps hoping for the timestanp even if the server 
doesn't support http::Date 

Rsync and ftp are stateful; http is not. For absolute getting one file http is 
better since you skip the whole login thing and setting up data and control 
sockets. 
So a CPAN client session will do better with an http mirror: it gets a tar.gz 
opens it up processes it and then goes back many seconds from original request 
for the first dependency. Repeat until entire dependency tree is completed 

Sent from my BlackBerry® smartphone with Nextel Direct Connect

-----Original Message-----
From: Nicholas Clark <n...@ccl4.org>
Date: Sun, 28 Mar 2010 17:20:34 
To: Arthur Corliss<corl...@digitalmages.com>
Cc: Elaine Ashton<eash...@mac.com>; <cpan-work...@perl.org>; 
<module-authors@perl.org>
Subject: Re: Trimming the CPAN - "Automatic Purging"

On Sat, Mar 27, 2010 at 08:52:22PM -0800, Arthur Corliss wrote:
> On Sat, 27 Mar 2010, Elaine Ashton wrote:
> 
> >Actually, I thought I was merely offering my opinion both as the sysadmin 
> >for the canonical CPAN mothership and as an end-user. If that makes me a 
> >prick, well, I suppose I should go out and buy one :)
> 
> :-) You'll have to pardon my indiscriminate epithets.  The barbs are coming
> from multiple directions.  My point still stands, however.  Your experience,
> however worthy, has zero bearing on whether or not my experience is
> just as worthy.  Even moreso when you guys have zero clue who you're talking

Are you running a large public mirror site, where you don't even have
knowledge of who is mirroring from you?

(Not even knowledge, let alone channels of communication with, let alone
control over)

Because (as I see it, not having done any of this) the logistics of that is
going to have as much bearing on trying to change protocols as the actual
technical merits of the protocol itself.

Most of the cost of rsync is an externality to the clients. If one has an
existing mirror, one is using rsync to keep it up to date, what's the
incentive to change?

> Sounds like you may be hamstrung by your own bureacracy, but that's rarely
> the case in most the places I've worked.  Not to mention that between
> passive mode FTP or even using an HTTP proxy (most of which support FTP
> requests) what I'm proposing is relatively painless, simple, and easy to
> secure.  This concern I suspect is a non-issue for most mirror operators.
> Even if it was, allow them to pull it via HTTP for all I care.  Either one
> is significantly more efficient than rsync.

I'm missing something here, I suspect. How can HTTP be more efficient than
rsync? The only obvious method to me of mirroring a CPAN site by HTTP is to
instruct a client (such as wget) to get it all. In which case, in the course
of doing this the client is going to recurse over the entire directory tree
of the server, which, I thought, was functionally equivalent to the behaviour
of the rsync server.

Nicholas Clark

Reply via email to