Here is a store about how shared storage simplify all the things: In Douban, we use Moose FS[1] instead of HDFS as the distributed file system, it's POSIX compatible and can be mounted just as NFS.
We put all the data and tools and code in it, so we can access them easily on all the machines, just like local disks. You can modify them in anywhere, and get the modified one from anywhere. One example, you will want to know the fields in your compressed log files: $ bunzip2 -k path | head Then you will need to modify your code to deal with these fields in logs: $ vim myjob.py you have bunch of libraries or modules in the projects, but you will not need to worry about them when run distributed jobs, you just need to do: $ python myjob.py If something wrong, you could modify myjob.py and save some RDDs into disks, then check the results: $ head path_to_result_of_rdd maybe something wrong is your library, then fix them, and run again: $ python myjob.py dump the result as CSV file, then load them into MySQL mysql xxx < path_of_the_result In a summary, a shared storage can help a lot in distributed environment, some simple solution (such as NFS) is natural to solve these problem. setup once, benefit forever. PS: I'm also a contributor of Moose FS, has a fork at github.com/davies/moosefs/ PPS: I'm sorry for my pool English, if the above sounds rude to you, Davies [1] http://moosefs.org/ On Fri, Sep 5, 2014 at 11:22 AM, Dimension Data, LLC. <subscripti...@didata.us> wrote: > > I'd have to agree with Marcelo and Andrew here... > > Favoring a simple Build-and-Run/Submit wrapper-script that leverages > '--py-files file.zip' > over adding another layer of complexity -- even if seemingly 'trivial' like > NFS -- is > probably a good approach (... b/c more technology is never is 'trivial' over > time). =:). > Less is more. > > > > On 09/05/2014 01:58 PM, Marcelo Vanzin wrote: > > On Fri, Sep 5, 2014 at 10:50 AM, Davies Liu <dav...@databricks.com> wrote: > > In daily development, it's common to modify your projects and re-run > the jobs. If using zip or egg to package your code, you need to do > this every time after modification, I think it will be boring. > > That's why shell scripts were invented. :-) > > Probably a lot easier than setting up and maintaining shared storage > in a large cluster. > > > -- > > Sincerely yours, > Team Dimension Data > ________________________________ > Dimension Data, LLC. | www.didata.us > P: 212.882.1276 | subscripti...@didata.us > > > > Data Analytics you can literally count on. > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org