Re: Loading Python libraries into Spark

Andrei Thu, 05 Jun 2014 08:54:46 -0700

For third party libraries the simplest way is to use Puppet [1] or Chef [2]
or any similar automation tool to install packages (either from PIP [2] or
from distribution's repository). It's easy because if you manage your
cluster's software you are most probably already using one of these
automation tools, and thus need only to write one more recipe to keep your
Python packages healthy.

For your own libraries it may be inconvenient to publish them to PIP just
to deploy to your server. So you can also "attach" your Python lib to
SparkContext via "pyFiles" option or "addPyFile" method. For example, if
you want to attach single Python file, you can do the following:

    conf = SparkConf(...) ...
    sc = SparkContext(...)
    sc.addPyFile("/path/to/yourmodule.py")

And for entire packages (in Python package is any directory with
"__init__.py" and maybe several more "*.py" files) you can use a trick and
pack them into zip archive. I used following code more my library:

    import dictconfig
    import zipfile

    def ziplib():
        libpath = os.path.dirname(__file__)                  # this should
point to your packages directory
        zippath = '/tmp/mylib-' + rand_str(6) + '.zip'      # some random
filename in writable directory
        zf = zipfile.PyZipFile(zippath, mode='w')
        try:
            zf.debug = 3                                              #
making it verbose, good for debugging
            zf.writepy(libpath)
            return zippath                                             #
return path to generated zip archive
        finally:
            zf.close()

    ...
    zip_path = ziplib()                                               #
generate zip archive containing your lib
    sc.addPyFile(zip_path)                                       # add the
entire archive to SparkContext
    ...
    os.remove(zip_path)                                           # don't
forget to remove temporary file, preferably in "finally" clause

Python has one nice feature: it can import code not only from modules
(simple "*.py" files) and packages, but also from a variety of other
formats including zip archives. So when you write in your distributed code
something like "import mylib", Python finds "mylib.zip" attached to
SparkContext and imports required modules.

HTH,
Andrei

[1]: http://puppetlabs.com/
[2]: http://www.getchef.com/chef/
[3]: https://pypi.python.org/pypi/pip ; if you program in Python and still
don't use PIP, you should definitely give it a try

On Thu, Jun 5, 2014 at 5:29 PM, mrm <ma...@skimlinks.com> wrote:

> Hi,
>
> I am new to Spark (and almost-new in python!). How can I download and
> install a Python library in my cluster so I can just import it later?
>
> Any help would be much appreciated.
>
> Thanks!
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Loading-Python-libraries-into-Spark-tp7059.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: Loading Python libraries into Spark

Reply via email to