You can basically add one function call to install the stuffs you want. If you look at the spark-ec2 script, there's a function which does all the setup named: setup_cluster(..) <https://github.com/apache/spark/blob/master/ec2/spark_ec2.py#L625>. Now, if you want to install a python library ( assuming pip is already installed), you can add one more line in the above function like:
ssh(master, opts, "pip install pandas") This will install it on the master node, you have slave_nodes variable which has all info of slave machines . You can iterate through it and do the same. Thanks Best Regards On Sun, Feb 8, 2015 at 2:16 PM, Chengi Liu <chengi.liu...@gmail.com> wrote: > Hi, > I want to install couple of python libraries (pip install > python_library) which I want to use on pyspark cluster which are developed > using the ec2 scripts. > Is there a way to specify these libraries when I am building those ec2 > clusters? > Whats the best way to install these libraries on each ec2 node? > Thanks >