On Sun, 28 Mar 2010, Dana Hudes wrote:
Use of wget and http to download an entire site means numerous TCP opens and HTTP GET requests. The entire point of rsync is that it knows there are numerous downloads. It does ONE open. This allows TCP slow start to ramp up
That wasn't exactly what I was suggesting. And we'll ignore HTTP's Keep-Alive support for the time being which negates your TCP open issue. If you're fetching transaction logs by which you can determine beforehand precisely what files to retrieve HTTP or FTP will beat the pants off of allowing rsync to tell you what you need to retrieve and delivering it.
A multi-download session with ftp is also efficient. Clients like ncftp have batch transfer built in. If setting up an initial mirror you might do better with ftp but maintaining it is where rsync rules. I haven't looked closely but I have the impression from watching wget work that wget using HTTP::Date opens two TCP connections per file: it opens a socket and issues a r?quest for timestamp then closes it then opens a socket to issue an http GET if it wants the file. Then it closes that socket and the process repeats for next file. It keeps hoping for the timestanp even if the server doesn't support http::Date Rsync and ftp are stateful; http is not. For absolute getting one file http is better since you skip the whole login thing and setting up data and control sockets. So a CPAN client session will do better with an http mirror: it gets a tar.gz opens it up processes it and then goes back many seconds from original request for the first dependency. Repeat until entire dependency tree is completed
Dude, you definitely don't understand what we're discussing. And neither rsync, ftp, or http are stateful -- that's the problem. Rsync has to build a picture of the repositories state *per* request, even the old files that haven't been touched in years. It then uses that information to select and deliver the new files you need. Maintaining state means that you maintain knowledge of state over time, across multiple requests. And rsync doesn't do that, it simulates that. Quite cleverly, but in an very expensive way which is borne by the server. --Arthur Corliss Live Free or Die