I agree that for updating rsync is probably preferable, and it seems like for that purpose it would also parallelize well since most of the time is spent computing checksums so the process is not constrained by the total i/o capacity of the master. However it is a problem for the initial replication of the master to the slaves. If you are running on ec2 then the dollar overhead of launching is quadratic in the number of slaves. if you launch a 100 machine cluster you will wait a 100 minutes, but you will pay for 10000 machine minutes or 167 hours before anything useful starts to happen.
On Mon, May 19, 2014 at 1:32 AM, Aaron Davidson <ilike...@gmail.com> wrote: > One issue with using Spark itself is that this rsync is required to get > Spark to work... > > Also note that a similar strategy is used for *updating* the spark > cluster on ec2, where the "diff" aspect is much more important, as you > might only make a small change on the driver node (recompile or > reconfigure) and can get a fast sync. > > > On Sun, May 18, 2014 at 11:22 PM, Mosharaf Chowdhury < > mosharafka...@gmail.com> wrote: > >> What twitter calls murder, unless it has changed since then, is just a >> BitTornado wrapper. In 2011, We did some comparison on the performance of >> murder and the TorrentBroadcast we have right now for Spark's own broadcast >> (Section 7.1 in >> http://www.mosharaf.com/wp-content/uploads/orchestra-sigcomm11.pdf). >> Spark's implementation was 4.5X faster than murder. >> >> The only issue with using TorrentBroadcast to deploy code/VM is writing a >> wrapper around it to read from disk, but it shouldn't be too complicated. >> If someone picks it up, I can give some pointers on how to proceed (I've >> thought about doing it myself forever, but never ended up actually taking >> the time; right now I don't have enough free cycles either) >> >> Otherwise, murder/BitTornado would be better than the current strategy we >> have. >> >> A third option would be to use rsync; but instead of rsync-ing to every >> slave from the master, one can simply rsync from the master first to one >> slave; then use the two sources (master and the first slave) to rsync to >> two more; then four and so on. Might be a simpler solution without many >> changes. >> >> -- >> Mosharaf Chowdhury >> http://www.mosharaf.com/ >> >> >> On Sun, May 18, 2014 at 11:07 PM, Andrew Ash <and...@andrewash.com>wrote: >> >>> My first thought would be to use libtorrent for this setup, and it turns >>> out that both Twitter and Facebook do code deploys with a bittorrent setup. >>> Twitter even released their code as open source: >>> >>> >>> https://blog.twitter.com/2010/murder-fast-datacenter-code-deploys-using-bittorrent >>> >>> >>> http://arstechnica.com/business/2012/04/exclusive-a-behind-the-scenes-look-at-facebook-release-engineering/ >>> >>> >>> On Sun, May 18, 2014 at 10:44 PM, Daniel Mahler <dmah...@gmail.com>wrote: >>> >>>> I am not an expert in this space either. I thought the initial rsync >>>> during launch is really just a straight copy that did not need the tree >>>> diff. So it seemed like having the slaves do the copying among it each >>>> other would be better than having the master copy to everyone directly. >>>> That made me think of bittorrent, though there may well be other systems >>>> that do this. >>>> From the launches I did today it seems that it is taking around 1 >>>> minute per slave to launch a cluster, which can be a problem for clusters >>>> with 10s or 100s of slaves, particularly since on ec2 that time has to be >>>> paid for. >>>> >>>> >>>> On Sun, May 18, 2014 at 11:54 PM, Aaron Davidson <ilike...@gmail.com>wrote: >>>> >>>>> Out of curiosity, do you have a library in mind that would make it >>>>> easy to setup a bit torrent network and distribute files in an rsync >>>>> (i.e., >>>>> apply a diff to a tree, ideally) fashion? I'm not familiar with this >>>>> space, >>>>> but we do want to minimize the complexity of our standard ec2 launch >>>>> scripts to reduce the chance of something breaking. >>>>> >>>>> >>>>> On Sun, May 18, 2014 at 9:22 PM, Daniel Mahler <dmah...@gmail.com>wrote: >>>>> >>>>>> I am launching a rather large cluster on ec2. >>>>>> It seems like the launch is taking forever on >>>>>> .... >>>>>> Setting up spark >>>>>> RSYNC'ing /root/spark to slaves... >>>>>> ... >>>>>> >>>>>> It seems that bittorrent might be a faster way to replicate >>>>>> the sizeable spark directory to the slaves >>>>>> particularly if there is a lot of not very powerful slaves. >>>>>> >>>>>> Just a thought ... >>>>>> >>>>>> cheers >>>>>> Daniel >>>>>> >>>>>> >>>>> >>>> >>> >> >