Re: PySpark on Yarn a lot of python scripts project

Davies Liu Fri, 05 Sep 2014 13:05:29 -0700

Here is a store about how shared storage simplify all the things:

In Douban, we use Moose FS[1] instead of HDFS as the distributed file system,
it's POSIX compatible and can be mounted just as NFS.


We put all the data and tools and code in it, so we can access them easily on
all the machines, just like local disks. You can modify them in anywhere, and
get the modified one from anywhere.

One example, you will want to know the fields in your compressed log files:

$ bunzip2 -k path | head

Then you will need to modify your code to deal with these fields in logs:

$ vim myjob.py

you have bunch of libraries or modules in the projects, but you will not need to
worry about them when run distributed jobs, you just need to do:

$ python myjob.py

If something wrong, you could modify myjob.py and save some RDDs into
disks, then check the results:

$ head path_to_result_of_rdd

maybe something wrong is your library, then fix them, and run again:

$ python myjob.py

dump the result as CSV file, then load them into MySQL

mysql xxx < path_of_the_result

In a summary, a shared storage can help a lot in distributed environment,
some simple solution (such as NFS) is natural to solve these problem.
setup once, benefit forever.

PS: I'm also a contributor of Moose FS, has a fork at github.com/davies/moosefs/

PPS: I'm sorry for my pool English, if the above sounds rude to you,

Davies

[1] http://moosefs.org/

On Fri, Sep 5, 2014 at 11:22 AM, Dimension Data, LLC.
<subscripti...@didata.us> wrote:
>
> I'd have to agree with Marcelo and Andrew here...
>
> Favoring a simple Build-and-Run/Submit wrapper-script that leverages 
> '--py-files file.zip'
> over adding another layer of complexity -- even if seemingly 'trivial' like 
> NFS -- is
> probably a good approach (... b/c more technology is never is 'trivial' over 
> time). =:).
> Less is more.
>
>
>
> On 09/05/2014 01:58 PM, Marcelo Vanzin wrote:
>
> On Fri, Sep 5, 2014 at 10:50 AM, Davies Liu <dav...@databricks.com> wrote:
>
> In daily development, it's common to modify your projects and re-run
> the jobs. If using zip or egg to package your code, you need to do
> this every time after modification, I think it will be boring.
>
> That's why shell scripts were invented. :-)
>
> Probably a lot easier than setting up and maintaining shared storage
> in a large cluster.
>
>
> --
>
> Sincerely yours,
> Team Dimension Data
> ________________________________
> Dimension Data, LLC. | www.didata.us
> P: 212.882.1276 |  subscripti...@didata.us
>
>
>
> Data Analytics you can literally count on.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: PySpark on Yarn a lot of python scripts project

Reply via email to