Re: PySpark without PySpark

Sujit Pal Thu, 09 Jul 2015 20:49:54 -0700

Hi Ashish,

Julian's approach is probably better, but few observations:


1) Your SPARK_HOME should be C:\spark-1.3.0 (not C:\spark-1.3.0\bin).

2) If you have anaconda python installed (I saw that you had set this up in
a separate thread, py4j should be part of the package - at least I think
so. To test this, try in your python repl:
>>> from py4j.java_gateway import JavaGateway
if it succeeds you already have it.

3) In case Py4J is not installed, the best way to install a new package is
using easy_install or pip. Make sure your path is set up so when you call
python you are calling the anaconda version (in case you have multiple
python versions), then if so, do "easy_install py4j" - this will install
py4j correctly without any messing around on your part. Install
instructions for py4j available on their site:
http://py4j.sourceforge.net/install.html

4) You should replace the "python2" in your 00-setup-script with "python",
so you point to the $SPARK_HOME/python directory (C:\spark-1.3.0\python).

-sujit


On Thu, Jul 9, 2015 at 8:26 PM, Ashish Dutt <ashish.du...@gmail.com> wrote:

> Hello Sujit,
> Many thanks for your response.
> To answer your questions;
> Q1) Do you have SPARK_HOME set up in your environment?- Yes, I do. It is
> SPARK_HOME="C:/spark-1.3.0/bin"
> Q2) Is there a python2 or python subdirectory under the root of your
> Spark installation? - Yes, i do have that too. It is called python. To fix
> this problem this is what I did,
> I downloaded py4j-0.8.2.1-src from here
> <https://pypi.python.org/pypi/py4j> which was not there initially when I
> downloaded the spark package from the official repository. I then put it in
> the lib directory as C:\spark-1.3.0\python\lib. Note I did not extract the
> zip file. I put it in as it is.
> The pyspark folder of the spark-1.3.0 root folder. What I next did was
> copy this file and put it in the  pythonpath. So my python path now reads
> as PYTHONPATH="C:/Python27/"
>
> I then rebooted the computer and a silent prayer :-) Then I opened the
> command prompt and invoked the command pyspark from the bin directory of
> spark and EUREKA, it worked :-)  Attached is the screenshot for the same.
> Now, the problem is with IPython notebook. I cannot get it to work with
> pySpark.
> I have a cluster with 4 nodes using CDH5.4
>
> I was able to resolve the problem. Now the next challenge was to configure
> it with IPython. Followed the steps as documented in the blog. And I get
> the errors, attached is the screenshot
>
> @Julian, I tried your method too. Attached is the screenshot of the error
> message 7.png
>
> Hope you can help me out to fix this problem.
> Thank you for your time.
>
> Sincerely,
> Ashish Dutt
> PhD Candidate
> Department of Information Systems
> University of Malaya, Lembah Pantai,
> 50603 Kuala Lumpur, Malaysia
>
> On Fri, Jul 10, 2015 at 12:02 AM, Sujit Pal <sujitatgt...@gmail.com>
> wrote:
>
>> Hi Ashish,
>>
>> Your 00-pyspark-setup file looks very different from mine (and from the
>> one described in the blog post). Questions:
>>
>> 1) Do you have SPARK_HOME set up in your environment? Because if not, it
>> sets it to None in your code. You should provide the path to your Spark
>> installation. In my case I have spark-1.3.1 installed under $HOME/Software
>> and the code block under "# Configure the environment" (or yellow highlight
>> in the code below) reflects that.
>> 2) Is there a python2 or python subdirectory under the root of your Spark
>> installation? In my case its "python" not "python2". This contains the
>> Python bindings for spark, so the block under "# Add the PySpark/py4j to
>> the Python path" (or green highlight in the code below) adds it to the
>> Python sys.path so things like pyspark.SparkContext are accessible in your
>> Python environment.
>>
>> import os
>> import sys
>>
>> # Configure the environment
>> if 'SPARK_HOME' not in os.environ:
>>     os.environ['SPARK_HOME'] = "/Users/palsujit/Software/spark-1.3.1"
>>
>> # Create a variable for our root path
>> SPARK_HOME = os.environ['SPARK_HOME']
>>
>> # Add the PySpark/py4j to the Python Path
>> sys.path.insert(0, os.path.join(SPARK_HOME, "python", "build"))
>> sys.path.insert(0, os.path.join(SPARK_HOME, "python"))
>>
>> Hope this fixes things for you.
>>
>> -sujit
>>
>>
>> On Wed, Jul 8, 2015 at 9:52 PM, Ashish Dutt <ashish.du...@gmail.com>
>> wrote:
>>
>>> Hi Sujit,
>>> Thanks for your response.
>>>
>>> So i opened a new notebook using the command ipython notebook --profile
>>> spark and tried the sequence of commands. i am getting errors. Attached is
>>> the screenshot of the same.
>>> Also I am attaching the  00-pyspark-setup.py for your reference. Looks
>>> like, I have written something wrong here. Cannot seem to figure out, what
>>> is it?
>>>
>>> Thank you for your help
>>>
>>>
>>> Sincerely,
>>> Ashish Dutt
>>>
>>> On Thu, Jul 9, 2015 at 11:53 AM, Sujit Pal <sujitatgt...@gmail.com>
>>> wrote:
>>>
>>>> Hi Ashish,
>>>>
>>>> >> Nice post.
>>>> Agreed, kudos to the author of the post, Benjamin Benfort of District
>>>> Labs.
>>>>
>>>> >> Following your post, I get this problem;
>>>> Again, not my post.
>>>>
>>>> I did try setting up IPython with the Spark profile for the edX Intro
>>>> to Spark course (because I didn't want to use the Vagrant container) and it
>>>> worked flawlessly with the instructions provided (on OSX). I haven't used
>>>> the IPython/PySpark environment beyond very basic tasks since then though,
>>>> because my employer has a Databricks license which we were already using
>>>> for other stuff and we ended up doing the labs on Databricks.
>>>>
>>>> Looking at your screenshot though, I don't see why you think its
>>>> picking up the default profile. One simple way of checking to see if things
>>>> are working is to open a new notebook and try this sequence of commands:
>>>>
>>>> from pyspark import SparkContext
>>>> sc = SparkContext("local", "pyspark")
>>>> sc
>>>>
>>>> You should see something like this after a little while:
>>>> <pyspark.context.SparkContext at 0x1093c9b10>
>>>>
>>>> While the context is being instantiated, you should also see lots of
>>>> log lines scroll by on the terminal where you started the "ipython notebook
>>>> --profile spark" command - these log lines are from Spark.
>>>>
>>>> Hope this helps,
>>>> Sujit
>>>>
>>>>
>>>> On Wed, Jul 8, 2015 at 6:04 PM, Ashish Dutt <ashish.du...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Sujit,
>>>>> Nice post.. Exactly what I had been looking for.
>>>>> I am relatively a beginner with Spark and real time data processing.
>>>>> We have a server with CDH5.4 with 4 nodes. The spark version in our
>>>>> server is 1.3.0
>>>>> On my laptop I have spark 1.3.0 too and its using Windows 7
>>>>> environment. As per point 5 of your post I am able to invoke pyspark
>>>>> locally as in a standalone mode.
>>>>>
>>>>> Following your post, I get this problem;
>>>>>
>>>>> 1. In section "Using Ipython notebook with spark" I cannot understand
>>>>> why it is picking up the default profile and not the pyspark profile. I am
>>>>> sure it is because of the path variables. Attached is the screenshot. Can
>>>>> you suggest how to solve this.
>>>>>
>>>>> Current the path variables for my laptop are like
>>>>> SPARK_HOME="C:\SPARK-1.3.0\BIN", JAVA_HOME="C:\PROGRAM
>>>>> FILES\JAVA\JDK1.7.0_79", HADOOP_HOME="D:\WINUTILS", 
>>>>> M2_HOME="D:\MAVEN\BIN",
>>>>> MAVEN_HOME="D:\MAVEN\BIN", PYTHON_HOME="C:\PYTHON27\", SBT_HOME="C:\SBT\"
>>>>>
>>>>>
>>>>> Sincerely,
>>>>> Ashish Dutt
>>>>> PhD Candidate
>>>>> Department of Information Systems
>>>>> University of Malaya, Lembah Pantai,
>>>>> 50603 Kuala Lumpur, Malaysia
>>>>>
>>>>> On Thu, Jul 9, 2015 at 4:56 AM, Sujit Pal <sujitatgt...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> You are welcome Davies. Just to clarify, I didn't write the post (not
>>>>>> sure if my earlier post gave that impression, apologize if so), although 
>>>>>> I
>>>>>> agree its great :-).
>>>>>>
>>>>>> -sujit
>>>>>>
>>>>>>
>>>>>> On Wed, Jul 8, 2015 at 10:36 AM, Davies Liu <dav...@databricks.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Great post, thanks for sharing with us!
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Jul 8, 2015 at 9:59 AM, Sujit Pal <sujitatgt...@gmail.com>
>>>>>>> wrote:
>>>>>>> > Hi Julian,
>>>>>>> >
>>>>>>> > I recently built a Python+Spark application to do search relevance
>>>>>>> > analytics. I use spark-submit to submit PySpark jobs to a Spark
>>>>>>> cluster on
>>>>>>> > EC2 (so I don't use the PySpark shell, hopefully thats what you
>>>>>>> are looking
>>>>>>> > for). Can't share the code, but the basic approach is covered in
>>>>>>> this blog
>>>>>>> > post - scroll down to the section "Writing a Spark Application".
>>>>>>> >
>>>>>>> >
>>>>>>> https://districtdatalabs.silvrback.com/getting-started-with-spark-in-python
>>>>>>> >
>>>>>>> > Hope this helps,
>>>>>>> >
>>>>>>> > -sujit
>>>>>>> >
>>>>>>> >
>>>>>>> > On Wed, Jul 8, 2015 at 7:46 AM, Julian <julian+sp...@magnetic.com>
>>>>>>> wrote:
>>>>>>> >>
>>>>>>> >> Hey.
>>>>>>> >>
>>>>>>> >> Is there a resource that has written up what the necessary steps
>>>>>>> are for
>>>>>>> >> running PySpark without using the PySpark shell?
>>>>>>> >>
>>>>>>> >> I can reverse engineer (by following the tracebacks and reading
>>>>>>> the shell
>>>>>>> >> source) what the relevant Java imports needed are, but I would
>>>>>>> assume
>>>>>>> >> someone has attempted this before and just published something I
>>>>>>> can
>>>>>>> >> either
>>>>>>> >> follow or install? If not, I have something that pretty much
>>>>>>> works and can
>>>>>>> >> publish it, but I'm not a heavy Spark user, so there may be some
>>>>>>> things
>>>>>>> >> I've
>>>>>>> >> left out that I haven't hit because of how little of pyspark I'm
>>>>>>> playing
>>>>>>> >> with.
>>>>>>> >>
>>>>>>> >> Thanks,
>>>>>>> >> Julian
>>>>>>> >>
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> --
>>>>>>> >> View this message in context:
>>>>>>> >>
>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-without-PySpark-tp23719.html
>>>>>>> >> Sent from the Apache Spark User List mailing list archive at
>>>>>>> Nabble.com.
>>>>>>> >>
>>>>>>> >>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>>> >> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>> >>
>>>>>>> >
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: PySpark without PySpark

Reply via email to