For starters, thanks for the awesome product! When creating ec2-clusters of 20-40 nodes, things work great. When we create a cluster with the provided spark-ec2 script, it takes hours. When creating a 200 node cluster, it takes 2 1/2 hours and for a 500 node cluster it takes over 5 hours. One other problem we are having is that some nodes don't come up when the other ones do, the process seems to just move on, skipping the rsync and any installs on those ones.
My guess as to why it takes so long to set up a large cluster is because of the use of rsync. What if instead of using rsync, you synched to s3 and then did a pdsh to pull it down on all of the machines. This is a big deal for us and if we can come up with a good plan, we might be able help out with the required changes. Are there any suggestions on how to deal with some of the nodes not being ready when the process starts? Thanks for your time, Christian