Re: hadoopRDD stalls reading entire directory

Russell Jurney Mon, 02 Jun 2014 10:10:37 -0700

Looks like just worker and master processes are running:

[hivedata@hivecluster2 ~]$ jps


10425 Jps

[hivedata@hivecluster2 ~]$ ps aux|grep spark

hivedata 10424  0.0  0.0 103248   820 pts/3    S+   10:05   0:00 grep spark

root     10918  0.5  1.4 4752880 230512 ?      Sl   May27  41:43 java -cp
:/opt/cloudera/parcels/SPARK-0.9.0-1.cdh4.6.0.p0.98/lib/spark/conf:/opt/cloudera/parcels/SPARK-0.9.0-1.cdh4.6.0.p0.98/lib/spark/core/lib/*:/opt/cloudera/parcels/SPARK-0.9.0-1.cdh4.6.0.p0.98/lib/spark/repl/lib/*:/opt/cloudera/parcels/SPARK-0.9.0-1.cdh4.6.0.p0.98/lib/spark/examples/lib/*:/opt/cloudera/parcels/SPARK-0.9.0-1.cdh4.6.0.p0.98/lib/spark/bagel/lib/*:/opt/cloudera/parcels/SPARK-0.9.0-1.cdh4.6.0.p0.98/lib/spark/mllib/lib/*:/opt/cloudera/parcels/SPARK-0.9.0-1.cdh4.6.0.p0.98/lib/spark/streaming/lib/*:/opt/cloudera/parcels/SPARK-0.9.0-1.cdh4.6.0.p0.98/lib/spark/lib/*:/etc/hadoop/conf:/opt/cloudera/parcels/CDH/lib/hadoop/*:/opt/cloudera/parcels/CDH/lib/hadoop/../hadoop-hdfs/*:/opt/cloudera/parcels/CDH/lib/hadoop/../hadoop-yarn/*:/opt/cloudera/parcels/CDH/lib/hadoop/../hadoop-mapreduce/*:/opt/cloudera/parcels/SPARK-0.9.0-1.cdh4.6.0.p0.98/lib/spark/lib/scala-library.jar:/opt/cloudera/parcels/SPARK-0.9.0-1.cdh4.6.0.p0.98/lib/spark/lib/scala-compiler.jar:/opt/cloudera/parcels/SPARK-0.9.0-1.cdh4.6.0.p0.98/lib/spark/lib/jline.jar
-Dspark.akka.logLifecycleEvents=true
-Djava.library.path=/opt/cloudera/parcels/SPARK-0.9.0-1.cdh4.6.0.p0.98/lib/spark/lib:/opt/cloudera/parcels/CDH/lib/hadoop/lib/native
-Xms512m -Xmx512m org.apache.spark.deploy.master.Master --ip hivecluster2
--port 7077 --webui-port 18080

root     12715  0.0  0.0 148028   656 ?        S    May27   0:00 sudo
/opt/cloudera/parcels/SPARK/lib/spark/bin/spark-class
org.apache.spark.deploy.worker.Worker spark://hivecluster2:7077

root     12716  0.3  1.1 4155884 191340 ?      Sl   May27  30:21 java -cp
:/opt/cloudera/parcels/SPARK/lib/spark/conf:/opt/cloudera/parcels/SPARK/lib/spark/core/lib/*:/opt/cloudera/parcels/SPARK/lib/spark/repl/lib/*:/opt/cloudera/parcels/SPARK/lib/spark/examples/lib/*:/opt/cloudera/parcels/SPARK/lib/spark/bagel/lib/*:/opt/cloudera/parcels/SPARK/lib/spark/mllib/lib/*:/opt/cloudera/parcels/SPARK/lib/spark/streaming/lib/*:/opt/cloudera/parcels/SPARK/lib/spark/lib/*:/etc/hadoop/conf:/opt/cloudera/parcels/CDH/lib/hadoop/*:/opt/cloudera/parcels/CDH/lib/hadoop/../hadoop-hdfs/*:/opt/cloudera/parcels/CDH/lib/hadoop/../hadoop-yarn/*:/opt/cloudera/parcels/CDH/lib/hadoop/../hadoop-mapreduce/*:/opt/cloudera/parcels/SPARK/lib/spark/lib/scala-library.jar:/opt/cloudera/parcels/SPARK/lib/spark/lib/scala-compiler.jar:/opt/cloudera/parcels/SPARK/lib/spark/lib/jline.jar
-Dspark.akka.logLifecycleEvents=true
-Djava.library.path=/opt/cloudera/parcels/SPARK/lib/spark/lib:/opt/cloudera/parcels/CDH/lib/hadoop/lib/native
-Xms512m -Xmx512m org.apache.spark.deploy.worker.Worker
spark://hivecluster2:7077




On Sun, Jun 1, 2014 at 7:41 PM, Aaron Davidson <ilike...@gmail.com> wrote:

> Sounds like you have two shells running, and the first one is talking all
> your resources. Do a "jps" and kill the other guy, then try again.
>
> By the way, you can look at http://localhost:8080 (replace localhost with
> the server your Spark Master is running on) to see what applications are
> currently started, and what resource allocations they have.
>
>
> On Sun, Jun 1, 2014 at 6:47 PM, Russell Jurney <russell.jur...@gmail.com>
> wrote:
>
>> Thanks again. Run results here:
>> https://gist.github.com/rjurney/dc0efae486ba7d55b7d5
>>
>> This time I get a port already in use exception on 4040, but it isn't
>> fatal. Then when I run rdd.first, I get this over and over:
>>
>> 14/06/01 18:35:40 WARN scheduler.TaskSchedulerImpl: Initial job has not 
>> accepted any resources; check your cluster UI to ensure that workers are 
>> registered and have sufficient memory
>>
>>
>>
>>
>>
>>
>> On Sun, Jun 1, 2014 at 3:09 PM, Aaron Davidson <ilike...@gmail.com>
>> wrote:
>>
>>> You can avoid that by using the constructor that takes a SparkConf, a la
>>>
>>> val conf = new SparkConf()
>>> conf.setJars("avro.jar", ...)
>>> val sc = new SparkContext(conf)
>>>
>>>
>>> On Sun, Jun 1, 2014 at 2:32 PM, Russell Jurney <russell.jur...@gmail.com
>>> > wrote:
>>>
>>>> Followup question: the docs to make a new SparkContext require that I
>>>> know where $SPARK_HOME is. However, I have no idea. Any idea where that
>>>> might be?
>>>>
>>>>
>>>> On Sun, Jun 1, 2014 at 10:28 AM, Aaron Davidson <ilike...@gmail.com>
>>>> wrote:
>>>>
>>>>> Gotcha. The easiest way to get your dependencies to your Executors
>>>>> would probably be to construct your SparkContext with all necessary jars
>>>>> passed in (as the "jars" parameter), or inside a SparkConf with setJars().
>>>>> Avro is a "necessary jar", but it's possible your application also needs 
>>>>> to
>>>>> distribute other ones to the cluster.
>>>>>
>>>>> An easy way to make sure all your dependencies get shipped to the
>>>>> cluster is to create an assembly jar of your application, and then you 
>>>>> just
>>>>> need to tell Spark about that jar, which includes all your application's
>>>>> transitive dependencies. Maven and sbt both have pretty straightforward
>>>>> ways of producing assembly jars.
>>>>>
>>>>>
>>>>> On Sat, May 31, 2014 at 11:23 PM, Russell Jurney <
>>>>> russell.jur...@gmail.com> wrote:
>>>>>
>>>>>> Thanks for the fast reply.
>>>>>>
>>>>>> I am running CDH 4.4 with the Cloudera Parcel of Spark 0.9.0, in
>>>>>> standalone mode.
>>>>>>
>>>>>>
>>>>>> On Saturday, May 31, 2014, Aaron Davidson <ilike...@gmail.com> wrote:
>>>>>>
>>>>>>> First issue was because your cluster was configured incorrectly. You
>>>>>>> could probably read 1 file because that was done on the driver node, but
>>>>>>> when it tried to run a job on the cluster, it failed.
>>>>>>>
>>>>>>> Second issue, it seems that the jar containing avro is not getting
>>>>>>> propagated to the Executors. What version of Spark are you running on? 
>>>>>>> What
>>>>>>> deployment mode (YARN, standalone, Mesos)?
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 31, 2014 at 9:37 PM, Russell Jurney <
>>>>>>> russell.jur...@gmail.com> wrote:
>>>>>>>
>>>>>>> Now I get this:
>>>>>>>
>>>>>>> scala> rdd.first
>>>>>>>
>>>>>>> 14/05/31 21:36:28 INFO spark.SparkContext: Starting job: first at
>>>>>>> <console>:41
>>>>>>>
>>>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Got job 4 (first at
>>>>>>> <console>:41) with 1 output partitions (allowLocal=true)
>>>>>>>
>>>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Final stage: Stage 4
>>>>>>> (first at <console>:41)
>>>>>>>
>>>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Parents of final
>>>>>>> stage: List()
>>>>>>>
>>>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Missing parents:
>>>>>>> List()
>>>>>>>
>>>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Computing the
>>>>>>> requested partition locally
>>>>>>>
>>>>>>> 14/05/31 21:36:28 INFO rdd.HadoopRDD: Input split:
>>>>>>> hdfs://hivecluster2/securityx/web_proxy_mef/2014/05/29/22/part-m-00000.avro:0+3864
>>>>>>>
>>>>>>> 14/05/31 21:36:28 INFO spark.SparkContext: Job finished: first at
>>>>>>> <console>:41, took 0.037371256 s
>>>>>>>
>>>>>>> 14/05/31 21:36:28 INFO spark.SparkContext: Starting job: first at
>>>>>>> <console>:41
>>>>>>>
>>>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Got job 5 (first at
>>>>>>> <console>:41) with 16 output partitions (allowLocal=true)
>>>>>>>
>>>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Final stage: Stage 5
>>>>>>> (first at <console>:41)
>>>>>>>
>>>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Parents of final
>>>>>>> stage: List()
>>>>>>>
>>>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Missing parents:
>>>>>>> List()
>>>>>>>
>>>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Submitting Stage 5
>>>>>>> (HadoopRDD[0] at hadoopRDD at <console>:37), which has no missing 
>>>>>>> parents
>>>>>>>
>>>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Submitting 16 missing
>>>>>>> tasks from Stage 5 (HadoopRDD[0] at hadoopRDD at <console>:37)
>>>>>>>
>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSchedulerImpl: Adding task set
>>>>>>> 5.0 with 16 tasks
>>>>>>>
>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:0
>>>>>>> as TID 92 on executor 2: hivecluster3 (NODE_LOCAL)
>>>>>>>
>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task
>>>>>>> 5.0:0 as 1294 bytes in 1 ms
>>>>>>>
>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:3
>>>>>>> as TID 93 on executor 1: hivecluster5.labs.lan (NODE_LOCAL)
>>>>>>>
>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task
>>>>>>> 5.0:3 as 1294 bytes in 0 ms
>>>>>>>
>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:1
>>>>>>> as TID 94 on executor 4: hivecluster4 (NODE_LOCAL)
>>>>>>>
>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task
>>>>>>> 5.0:1 as 1294 bytes in 1 ms
>>>>>>>
>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:2
>>>>>>> as TID 95 on executor 0: hivecluster6.labs.lan (NODE_LOCAL)
>>>>>>>
>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task
>>>>>>> 5.0:2 as 1294 bytes in 0 ms
>>>>>>>
>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:4
>>>>>>> as TID 96 on executor 3: hivecluster1.labs.lan (NODE_LOCAL)
>>>>>>>
>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task
>>>>>>> 5.0:4 as 1294 bytes in 0 ms
>>>>>>>
>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:6
>>>>>>> as TID 97 on executor 2: hivecluster3 (NODE_LOCAL)
>>>>>>>
>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task
>>>>>>> 5.0:6 as 1294 bytes in 0 ms
>>>>>>>
>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:5
>>>>>>> as TID 98 on executor 1: hivecluster5.labs.lan (NODE_LOCAL)
>>>>>>>
>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task
>>>>>>> 5.0:5 as 1294 bytes in 0 ms
>>>>>>>
>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:8
>>>>>>> as TID 99 on executor 4: hivecluster4 (NODE_LOCAL)
>>>>>>>
>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task
>>>>>>> 5.0:8 as 1294 bytes in 0 ms
>>>>>>>
>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:7
>>>>>>> as TID 100 on executor 0: hivecluster6.labs.lan (NODE_LOCAL)
>>>>>>>
>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task
>>>>>>> 5.0:7 as 1294 bytes in 0 ms
>>>>>>>
>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task
>>>>>>> 5.0:10 as TID 101 on executor 3: hivecluster1.labs.lan (NODE_LOCAL)
>>>>>>>
>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task
>>>>>>> 5.0:10 as 1294 bytes in 0 ms
>>>>>>>
>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task
>>>>>>> 5.0:14 as TID 102 on executor 2: hivecluster3 (NODE_LOCAL)
>>>>>>>
>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task
>>>>>>> 5.0:14 as 1294 bytes in 0 ms
>>>>>>>
>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:9
>>>>>>> as TID 103 on executor 1: hivecluster5.labs.lan (NODE_LOCAL)
>>>>>>>
>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task
>>>>>>> 5.0:9 as 1294 bytes in 0 ms
>>>>>>>
>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task
>>>>>>> 5.0:11 as TID 104 on executor 4: hivecluster4 (N
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com
>>>>>> datasyndrome.com
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com
>>>> datasyndrome.com
>>>>
>>>
>>>
>>
>>
>> --
>> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.
>> com
>>
>
>


-- 
Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com

Re: hadoopRDD stalls reading entire directory

Reply via email to