On Sun, 21 Jul 2002, jw schultz wrote: > What i am seeing is a Multi-stage pipeline.
This is quite an interesting design idea. Let me comment on a few things that I've been mulling over since first reading it: One thing you don't discuss in your data flow is auxiliary data flow. For instance, error messages need to go somewhere (perhaps mixed into the main data flow), and they need to get back to the side where the user resides. This can add an extra network transfer after the update stage (6) to send errors back to the user (if the user is not on the same side as stage 6). Another open issue is what we do when a file changes while we're transferring it. Rsync sends a "redo" request to the generator process and it reruns all changed files at the end of the run. If such a thing is desirable in this utility (instead of just warning the user that the file was unable to be updated), then this "redo" data flow also needs to be mapped out. If this protocol remains more batch oriented, then it probably won't need to redo files -- just warn the user. One of the really nice features of your design is that it is easy to interrupt the flow of data at any point and continue it later. This is a useful thing if the cached information remains valid and thus saves us time/resources on either the next run or on multiple updates to different destination systems. One downside to your protocol is that it requires several socket connections between systems. This either mandates using multiple rsh/ssh connections (possibly with multiple password prompts for a single transfer) OR using some kind of socket-forwarding protocol (such as the one provided by ssh). When I proposed adding extra sockets to the rsync protocol a while back, at least one fellow mentioned that a requirement of using ssh would not be an acceptable solution to him, so this area could be a little controversial (depending on what kind of a solution we can come up with). Another question is whether we need to support the bi-directional transfer of files in a single connection. My rZync test app supports sending files in both directions just because it was so simple to add -- having a message-based protocol makes this a breeze. Your first protocol (the one without any backchannels) looks like it would be a snap to setup using separate processes. It does, as you note, add quite a bit of extra data transmission (such as an extra 2x hit in filename transfer alone). The backchannels add some complicating factors to the file I/O that will need to be carefully designed to avoid deadlocks. Since the data is strictly ordered with one chunk for pipe-A and one chunk for pipe-B (for each file), the code should be fairly straight-forward, though, so hopefully this won't be a big problem. Caching off data from the backchannel utility might be pretty complex, though -- think about interrupting the stream after step 3, you'd need to buffer off the backchannel data from step 1 plus the main output and backchannel data from step 3 and then restart things at steps 4 and 5 with the appropriate main-stream input and backchannel flows. That would be much harder than saving off the one single output flow from step 3 and starting up step 4 later on using it, so either the backchannel algorithm may not be very useful in a batch scenario, or we'd need to have a helper script that can figure out how to interrupt and restart the chain of processes at any point. I find your idea to allow the first 4 steps of the scan/compare/checksum sequence to be reversed intriguing. At first I thought that it would be too fragile since the server's data tends to be updating constantly (and this protocol needs to have the server data remain constant from the moment the checksum blocks are created until the client(s) all fetch the updated data). However, I can see that this may well be a really nice way to update an archive and let multiple (non-identical) clients request updates. This will require an extension to librsync that would allow a reversed rolling-checksum diff option, and an option to separate the diff and transmit stages (which are currently done at the same time), so this idea has a bigger overhead than the rest of the tool as far as the rsync protocol is concerned. The most efficient multi-server duplication process would be to save off the output of the transmit phase and send it to multiple systems for just the final update phase. This does require that the destination machines all have identical file trees for the updating to work, though, so this only works on tightly-controlled mirrors. The advantage is that the server expends no further resources than to just get the update stream transmitted to the clients (who can duplicate the stream without the server's help). Since your proposed protocol seems to fit so well with batch-oriented scenarios while potentially having problems in the more interactive scenarios, I'm wondering if this should be a separate utility-set from a more interactive program (which I think should use a message oriented protocol over a single 2-way socket/pipe). The alternative is to add batch-output code to an interactive program (like what was done with rsync), which would probably be harder to maintain and less flexible than a set of batch-oriented utilities. What do you think? ..wayne.. -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html