RE: About a problem when mapping a file located within a HDFS vmware cdh-5.7 image

David Newberger Tue, 31 May 2016 10:41:16 -0700

Is 
https://github.com/alonsoir/awesome-recommendation-engine/blob/master/build.sbt 
  the build.sbt you are using?


David Newberger
QA Analyst
WAND  -  The Future of Restaurant Technology
(W)  www.wandcorp.com<http://www.wandcorp.com/>
(E)   [email protected]<mailto:[email protected]>
(P)   952.361.6200

From: Alonso [mailto:[email protected]]
Sent: Tuesday, May 31, 2016 11:11 AM
To: [email protected]
Subject: About a problem when mapping a file located within a HDFS vmware 
cdh-5.7 image


I have a vmware cloudera image, cdh-5.7 running with centos6.8, i am using OS X 
as my development machine, and the cdh image to run the code, i upload the code 
using git to the cdh image, i have modified my /etc/hosts file located in the 
cdh image with a line like this:

127.0.0.1       quickstart.cloudera     quickstart      localhost       
localhost.domain



192.168.30.138       quickstart.cloudera     quickstart      localhost       
localhost.domain

The cloudera version that i am running is:

[cloudera@quickstart bin]$ cat /usr/lib/hadoop/cloudera/cdh_version.properties



# Autogenerated build properties

version=2.6.0-cdh5.7.0

git.hash=c00978c67b0d3fe9f3b896b5030741bd40bf541a

cloudera.hash=c00978c67b0d3fe9f3b896b5030741bd40bf541a

cloudera.cdh.hash=e7465a27c5da4ceee397421b89e924e67bc3cbe1

cloudera.cdh-packaging.hash=8f9a1632ebfb9da946f7d8a3a8cf86efcdccec76

cloudera.base-branch=cdh5-base-2.6.0

cloudera.build-branch=cdh5-2.6.0_5.7.0

cloudera.pkg.version=2.6.0+cdh5.7.0+1280

cloudera.pkg.release=1.cdh5.7.0.p0.92

cloudera.cdh.release=cdh5.7.0

cloudera.build.time=2016.03.23-18:30:29GMT

I can do a ls command in the vmware machine:

[cloudera@quickstart ~]$ hdfs dfs -ls /user/cloudera/ratings.csv

-rw-r--r-- 1 cloudera cloudera 16906296 2016-05-30 11:29 
/user/cloudera/ratings.csv

I can read its content:

[cloudera@quickstart ~]$ hdfs dfs -cat /user/cloudera/ratings.csv | wc -l

568454

The code is quite simple, just trying to map its content:

val ratingFile="hdfs://quickstart.cloudera:8020/user/cloudera/ratings.csv"



case class AmazonRating(userId: String, productId: String, rating: Double)



val NumRecommendations = 10

val MinRecommendationsPerUser = 10

val MaxRecommendationsPerUser = 20

val MyUsername = "myself"

val NumPartitions = 20





println("Using this ratingFile: " + ratingFile)

  // first create an RDD out of the rating file

val rawTrainingRatings = sc.textFile(ratingFile).map {

    line =>

      val Array(userId, productId, scoreStr) = line.split(",")

      AmazonRating(userId, productId, scoreStr.toDouble)

}



  // only keep users that have rated between MinRecommendationsPerUser and 
MaxRecommendationsPerUser products

val trainingRatings = rawTrainingRatings.groupBy(_.userId).filter(r => 
MinRecommendationsPerUser <= r._2.size  && r._2.size < 
MaxRecommendationsPerUser).flatMap(_._2).repartition(NumPartitions).cache()



println(s"Parsed $ratingFile. Kept ${trainingRatings.count()} ratings out of 
${rawTrainingRatings.count()}")

I am getting this message:

Parsed hdfs://quickstart.cloudera:8020/user/cloudera/ratings.csv. Kept 0 
ratings out of 568454

because if i run the exact code within the spark-shell, i got this message:

Parsed hdfs://quickstart.cloudera:8020/user/cloudera/ratings.csv. Kept 73279 
ratings out of 568454

Why is it working fine within the spark-shell but it is not running fine 
programmatically  in the vmware image?

I am running the code using sbt-pack plugin to generate unix commands and run 
them within the vmware image which has the spark pseudocluster,

This is the code i use to instantiate the sparkconf:

val sparkConf = new SparkConf().setAppName("AmazonKafkaConnector")

                                   
.setMaster("local[4]").set("spark.driver.allowMultipleContexts", "true")

    val sc = new SparkContext(sparkConf)

    val sqlContext = new SQLContext(sc)

    val ssc = new StreamingContext(sparkConf, Seconds(2))

    //this checkpointdir should be in a conf file, for now it is hardcoded!

    val streamingCheckpointDir = 
"/home/cloudera/my-recommendation-spark-engine/checkpoint"

    ssc.checkpoint(streamingCheckpointDir)

I have tried to use this way of setting spark master, but an exception raises, 
i suspect that this is symptomatic of my problem.  
//.setMaster("spark://quickstart.cloudera:7077")

The exception when i try to use the fully qualified domain name:

.setMaster("spark://quickstart.cloudera:7077")



java.io.IOException: Failed to connect to 
quickstart.cloudera/127.0.0.1:7077<http://127.0.0.1:7077>

        at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)

        at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)

        at 
org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:200)

        at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:187)

        at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:183)

        at java.util.concurrent.FutureTask.run(FutureTask.java:262)

        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

        at java.lang.Thread.run(Thread.java:745)

Caused by: java.net.ConnectException: Connection refused: 
quickstart.cloudera/127.0.0.1:7077<http://127.0.0.1:7077>

        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)

        at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)

        at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)

        at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)

        at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)

        at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)

        at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)

        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)

I can ping to quickstart.cloudera in the cloudera terminal, so why i can't use 
.setMaster("spark://quickstart.cloudera:7077") instead of 
.setMaster("local[*]"):

[cloudera@quickstart bin]$ ping quickstart.cloudera

PING quickstart.cloudera (127.0.0.1) 56(84) bytes of data.

64 bytes from quickstart.cloudera (127.0.0.1): icmp_seq=1 ttl=64 time=0.019 ms

64 bytes from quickstart.cloudera (127.0.0.1): icmp_seq=2 ttl=64 time=0.026 ms

64 bytes from quickstart.cloudera (127.0.0.1): icmp_seq=3 ttl=64 time=0.026 ms

64 bytes from quickstart.cloudera (127.0.0.1): icmp_seq=4 ttl=64 time=0.028 ms

64 bytes from quickstart.cloudera (127.0.0.1): icmp_seq=5 ttl=64 time=0.026 ms

64 bytes from quickstart.cloudera (127.0.0.1): icmp_seq=6 ttl=64 time=0.020 ms

And the port 7077 is listening to incoming calls:

[cloudera@quickstart bin]$ netstat -nap | grep 7077

(Not all processes could be identified, non-owned process info

 will not be shown, you would have to be root to see it all.)

tcp        0      0 192.168.30.138:7077<http://192.168.30.138:7077>         
0.0.0.0:*                   LISTEN





[cloudera@quickstart bin]$ ping 192.168.30.138

PING 192.168.30.138 (192.168.30.138) 56(84) bytes of data.

64 bytes from 192.168.30.138<http://192.168.30.138>: icmp_seq=1 ttl=64 
time=0.023 ms

64 bytes from 192.168.30.138<http://192.168.30.138>: icmp_seq=2 ttl=64 
time=0.026 ms

64 bytes from 192.168.30.138<http://192.168.30.138>: icmp_seq=3 ttl=64 
time=0.028 ms

^C

--- 192.168.30.138 ping statistics ---

3 packets transmitted, 3 received, 0% packet loss, time 2810ms

rtt min/avg/max/mdev = 0.023/0.025/0.028/0.006 ms

[cloudera@quickstart bin]$ ifconfig

eth2      Link encap:Ethernet  HWaddr 00:0C:29:6F:80:D2

          inet addr:192.168.30.138  Bcast:192.168.30.255  Mask:255.255.255.0

          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

          RX packets:8612 errors:0 dropped:0 overruns:0 frame:0

          TX packets:8493 errors:0 dropped:0 overruns:0 carrier:0

          collisions:0 txqueuelen:1000

          RX bytes:2917515 (2.7 MiB)  TX bytes:849750 (829.8 KiB)



lo        Link encap:Local Loopback

          inet addr:127.0.0.1  Mask:255.0.0.0

          UP LOOPBACK RUNNING  MTU:65536  Metric:1

          RX packets:57534 errors:0 dropped:0 overruns:0 frame:0

          TX packets:57534 errors:0 dropped:0 overruns:0 carrier:0

          collisions:0 txqueuelen:0

          RX bytes:44440656 (42.3 MiB)  TX bytes:44440656 (42.3 MiB)

I think that this must be a misconfiguration in a cloudera configuration file, 
but which one?

Thank you very much for reading until here.
Alonso Isidoro Roman
about.me/alonso.isidoro.roman


________________________________
View this message in context: About a problem when mapping a file located 
within a HDFS vmware cdh-5.7 
image<http://apache-spark-user-list.1001560.n3.nabble.com/About-a-problem-when-mapping-a-file-located-within-a-HDFS-vmware-cdh-5-7-image-tp27058.html>
Sent from the Apache Spark User List mailing list 
archive<http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.

RE: About a problem when mapping a file located within a HDFS vmware cdh-5.7 image

Reply via email to