I wanted to revive the conversation about the spark-ec2 tools, as it seems to have been lost in the 1.4.1 release voting spree.
I think that splitting it into its own repository is a really good move, and I would also be happy to help with this transition, as well as help maintain the resulting repository. Here is my justification for why we ought to do this split. User Facing: - The spark-ec2 launcher dosen't use anything in the parent spark repository - spark-ec2 version is disjoint from the parent repo. I consider it confusing that the spark-ec2 script dosen't launch the version of spark it is checked-out with. - Someone interested in setting up spark-ec2 with anything but the default configuration will have to clone at least 2 repositories at present, and probably fork and push changes to 1. - spark-ec2 has mismatched dependencies wrt. to spark itself. This includes a confusing shim in the spark-ec2 script to install boto, which frankly should just be a dependency of the script Developer Facing: - Support across 2 repos will be worse than across 1. Its unclear where to file issues/PRs, and requires extra communications for even fairly trivial stuff. - Spark-ec2 also depends on a number binary blobs being in the right place, currently the responsibility for these is decentralized, and likely prone to various flavors of dumb. - The current flow of booting a spark-ec2 cluster is _complicated_ I spent the better part of a couple days figuring out how to integrate our custom tools into this stack. This is very hard to fix when commits/PR's need to span groups/repositories/buckets-o-binary, I am sure there are several other problems that are languishing under similar roadblocks - It makes testing possible. The spark-ec2 script is a great case for CI given the number of permutations of launch criteria there are. I suspect AWS would be happy to foot the bill on spark-ec2 testing (probably ~20 bucks a month based on some envelope sketches), as it is a piece of software that directly impacts other people giving them money. I have some contacts there, and I am pretty sure this would be an easy conversation, particularly if the repo directly concerned with ec2. Think also being able to assemble the binary blobs into s3 bucket dedicated to spark-ec2 Any other thoughts/voices appreciated here. spark-ec2 is a super-power tool and deserves a fair bit of attention! --Matthew Goodman ===================== Check Out My Website: http://craneium.net Find me on LinkedIn: http://tinyurl.com/d6wlch