Hi Alton:

Thanks for the reply. I just wanted to build/use it from scratch to get a better intuition of what's a happening.

Btw, using the binaries provided by Cloudera/CDH5 yielded the same issue as my compiled version (i.e. it, too,
tried to access the HDFS / Name Node. Same exact error).

However, a small breakthrough. Just now I tinkered some more and found that this variation works:

   REPLACE THIS: >>> distData = 
sc.textFile('/home/user/Download/ml-10M100K/ratings.dat')
   WITH THIS:    >>>  distData = 
sc.textFile('*file:///*home/user/Download/ml-10M100K/ratings.dat')

That is, use 'file:///'.

I don't know if that is the correct way of specifying the URI for local files, 
or whether this just *happens to
work*. The documents that I've read thus far haven't shown it that specified 
way, but I still have more to
read.   =:)

Thank you,
~NMV


On 04/10/2014 04:20 PM, Alton Alexander wrote:
I am doing the exact same thing for the purpose of learning. I also
don't have a hadoop cluster and plan to scale on ec2 as soon as I get
it working locally.

I am having good success just using the binaries on and not compiling
from source... Is there a reason why you aren't just using the
binaries?

On Thu, Apr 10, 2014 at 1:30 PM, DiData <subscripti...@didata.us> wrote:
Hello friends:

I recently compiled and installed Spark v0.9 from the Apache distribution.

Note: I have the Cloudera/CDH5 Spark RPMs co-installed as well (actually,
the
entire big-data suite from CDH is installed), but for the moment I'm using
my
manually built Apache Spark for 'ground-up' learning purposes.

Now, prior to compilation (i.e. 'sbt/sbt clean compile') I specified the
following:

       export SPARK_YARN=true
       export SPARK_HADOOP_VERSION=2.3.0-cdh5.0.0

The resulting examples ran fine locally as well as on YARN.

I'm not interested in YARN here; just mentioning it for completeness in case
that matters in
my upcoming question. Here is my issue / question:

I start pyspark locally -- on one machine for API learning purposes -- as
shown below, and attempt to
interact with a local text file (not in HDFS). Unfortunately, the
SparkContext (sc) tries to connect to
a HDFS Name Node (which I don't currently have enabled because I don't need
it).

The SparkContext cleverly inspects the configurations in my
'/etc/hadoop/conf/' directory to learn
where my Name Node is, however I don't want it to do that in this case. I
just want it to run a
one-machine local version of 'pyspark'.

Did I miss something in my invocation/use of 'pyspark' below? Do I need to
add something else?

(Btw: I searched but could not find any solutions, and the documentation,
while good, doesn't
quite get me there).

See below, and thank you all in advance!


user$ export PYSPARK_PYTHON=/usr/bin/bpython
user$ export MASTER=local[8]
user$ /home/user/APPS.d/SPARK.d/latest/bin/pyspark
   #
===========================================================================================
   >>> sc
   <pyspark.context.SparkContext object at 0x24f0f50>
   >>>
   >>> distData = sc.textFile('/home/user/Download/ml-10M100K/ratings.dat')
   >>> distData.count()
   [ ... snip ... ]
   Py4JJavaError: An error occurred while calling o21.collect.
   : java.net.ConnectException: Call From server01/192.168.0.15 to
namenode:8020 failed on connection exception:
     java.net.ConnectException: Connection refused; For more details see:
http://wiki.apache.org/hadoop/ConnectionRefused
   [ ... snip ... ]
   >>>
   >>>
   #
===========================================================================================

--
Sincerely,
DiData

--
Sincerely,
DiData

Reply via email to