On Sun, 28 Mar 2010, Dana Hudes wrote:

Use of wget and http to download an entire site means numerous TCP opens and 
HTTP GET requests. The entire point of rsync is that it knows there are 
numerous downloads. It does ONE open. This allows TCP slow start to ramp up

That wasn't exactly what I was suggesting.  And we'll ignore HTTP's
Keep-Alive support for the time being which negates your TCP open issue.  If
you're fetching transaction logs by which you can determine beforehand
precisely what files to retrieve HTTP or FTP will beat the pants off of
allowing rsync to tell you what you need to retrieve and delivering it.

A multi-download session with ftp is also efficient. Clients like ncftp have 
batch transfer built in. If setting up an initial mirror you might do better 
with ftp but maintaining it is where rsync rules.

I haven't looked closely but I have the impression from watching wget work that 
wget using HTTP::Date opens two TCP connections per file: it opens a socket and 
issues a r?quest for timestamp then closes it then opens a socket to issue an 
http GET if it wants the file. Then it closes that socket and the process 
repeats for next file. It keeps hoping for the timestanp even if the server 
doesn't support http::Date

Rsync and ftp are stateful; http is not. For absolute getting one file http is 
better since you skip the whole login thing and setting up data and control 
sockets.
So a CPAN client session will do better with an http mirror: it gets a tar.gz 
opens it up processes it and then goes back many seconds from original request 
for the first dependency. Repeat until entire dependency tree is completed

Dude, you definitely don't understand what we're discussing.  And neither
rsync, ftp, or http are stateful -- that's the problem.  Rsync has to
build a picture of the repositories state *per* request, even the old files
that haven't been touched in years.  It then uses that information to select
and deliver the new files you need.  Maintaining state means that you
maintain knowledge of state over time, across multiple requests.  And rsync
doesn't do that, it simulates that.  Quite cleverly, but in an very
expensive way which is borne by the server.

        --Arthur Corliss
          Live Free or Die

Reply via email to