Hi!

Now the key, as I see it, is that unlike all the other use cases where rsync is used, large mirrors are likely to have their directories directly transfered from another mirror. So, the client that pulled the tree update down could store a list of changed files, and the server could then just use that list to determine which files need to be synced to the downstream mirror. (Sure, the original site has to generate the list, but if they use a tool like PAUSE to upload the files, that shouldn't be hard to do).

Agreed, but I'm not sure we've gotten past the stat storm on the server,
though.

Ok, this might be a complete wacky idea, but couldn't we use some kind of version control system.

Before you kick my backside, hear me out: This is of course very theoretical at the moment, there are probably quite a number of pitfalls and kinks to work out...

Currently, there's CPAN and Backpan. With Backpan playing the archive.

Suppose, just suppose we see that as some kind of old style, simplistic version control system, e.g. CPAN is a checkout of the latest version of all files and Backpan holding the older versions.

Now, if we where to put all files into mercurial, git or the like, renaming the files so they don't have version numbers in their names but storing them sequentially as commits so new versions update old ones.

Now, a new mirror would (once) ask for the latest version without the history of all the files, meaning it will have to make a complete "checkout" of the latest version. No way around it, really. We call that version FOO.

But, suppose 100 modules get updated on the main server, so the server stores 100 changesets, which in many version control systems are stored sequentially in a single file. Call that version BAR.

Now the mirror wants to update again, calls the server and says, "i have version FOO, give me all updates". So the server looks up version FOO in the file (via some shorter index list), open the main file, seeks to the indicated position and basically dumps the rest of the file via network to the mirror. The mirror then applies this changeset by taking each chunk as a patch and applying it to the corresponding file(s).

For fast mirroring and legacy clients, the main server still would have a full directory checkout, allowing the oldstyle sync. Compressed, slurpable tarballs can also be autogenerated like once a month.

This could also solve some long-standing problems, like having modules available for legacy production environments. A user might still be able to checkout a specific version of CPAN depending on his/her needs, like "give me CPAN as it was on 23th December 2007".


This could work like any modern, distributed version control systems. That way, the user would also be able to apply local patches and/or deciding which changesets to pull in from the main server. Or have a complete, local mirror and one for the production systems where he/she pulls in changes after they have been reviewed.


NOW its time to kick my butt, if you want to.

LG
Rene

Reply via email to