For third party libraries the simplest way is to use Puppet [1] or Chef [2] or any similar automation tool to install packages (either from PIP [2] or from distribution's repository). It's easy because if you manage your cluster's software you are most probably already using one of these automation tools, and thus need only to write one more recipe to keep your Python packages healthy.
For your own libraries it may be inconvenient to publish them to PIP just to deploy to your server. So you can also "attach" your Python lib to SparkContext via "pyFiles" option or "addPyFile" method. For example, if you want to attach single Python file, you can do the following: conf = SparkConf(...) ... sc = SparkContext(...) sc.addPyFile("/path/to/yourmodule.py") And for entire packages (in Python package is any directory with "__init__.py" and maybe several more "*.py" files) you can use a trick and pack them into zip archive. I used following code more my library: import dictconfig import zipfile def ziplib(): libpath = os.path.dirname(__file__) # this should point to your packages directory zippath = '/tmp/mylib-' + rand_str(6) + '.zip' # some random filename in writable directory zf = zipfile.PyZipFile(zippath, mode='w') try: zf.debug = 3 # making it verbose, good for debugging zf.writepy(libpath) return zippath # return path to generated zip archive finally: zf.close() ... zip_path = ziplib() # generate zip archive containing your lib sc.addPyFile(zip_path) # add the entire archive to SparkContext ... os.remove(zip_path) # don't forget to remove temporary file, preferably in "finally" clause Python has one nice feature: it can import code not only from modules (simple "*.py" files) and packages, but also from a variety of other formats including zip archives. So when you write in your distributed code something like "import mylib", Python finds "mylib.zip" attached to SparkContext and imports required modules. HTH, Andrei [1]: http://puppetlabs.com/ [2]: http://www.getchef.com/chef/ [3]: https://pypi.python.org/pypi/pip ; if you program in Python and still don't use PIP, you should definitely give it a try On Thu, Jun 5, 2014 at 5:29 PM, mrm <ma...@skimlinks.com> wrote: > Hi, > > I am new to Spark (and almost-new in python!). How can I download and > install a Python library in my cluster so I can just import it later? > > Any help would be much appreciated. > > Thanks! > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Loading-Python-libraries-into-Spark-tp7059.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. >