Re: hadoopRDD stalls reading entire directory

Aaron Davidson Sun, 01 Jun 2014 19:43:18 -0700

Sounds like you have two shells running, and the first one is talking all
your resources. Do a "jps" and kill the other guy, then try again.


By the way, you can look at http://localhost:8080 (replace localhost with
the server your Spark Master is running on) to see what applications are
currently started, and what resource allocations they have.


On Sun, Jun 1, 2014 at 6:47 PM, Russell Jurney <russell.jur...@gmail.com>
wrote:

> Thanks again. Run results here:
> https://gist.github.com/rjurney/dc0efae486ba7d55b7d5
>
> This time I get a port already in use exception on 4040, but it isn't
> fatal. Then when I run rdd.first, I get this over and over:
>
> 14/06/01 18:35:40 WARN scheduler.TaskSchedulerImpl: Initial job has not 
> accepted any resources; check your cluster UI to ensure that workers are 
> registered and have sufficient memory
>
>
>
>
>
> On Sun, Jun 1, 2014 at 3:09 PM, Aaron Davidson <ilike...@gmail.com> wrote:
>
>> You can avoid that by using the constructor that takes a SparkConf, a la
>>
>> val conf = new SparkConf()
>> conf.setJars("avro.jar", ...)
>> val sc = new SparkContext(conf)
>>
>>
>> On Sun, Jun 1, 2014 at 2:32 PM, Russell Jurney <russell.jur...@gmail.com>
>> wrote:
>>
>>> Followup question: the docs to make a new SparkContext require that I
>>> know where $SPARK_HOME is. However, I have no idea. Any idea where that
>>> might be?
>>>
>>>
>>> On Sun, Jun 1, 2014 at 10:28 AM, Aaron Davidson <ilike...@gmail.com>
>>> wrote:
>>>
>>>> Gotcha. The easiest way to get your dependencies to your Executors
>>>> would probably be to construct your SparkContext with all necessary jars
>>>> passed in (as the "jars" parameter), or inside a SparkConf with setJars().
>>>> Avro is a "necessary jar", but it's possible your application also needs to
>>>> distribute other ones to the cluster.
>>>>
>>>> An easy way to make sure all your dependencies get shipped to the
>>>> cluster is to create an assembly jar of your application, and then you just
>>>> need to tell Spark about that jar, which includes all your application's
>>>> transitive dependencies. Maven and sbt both have pretty straightforward
>>>> ways of producing assembly jars.
>>>>
>>>>
>>>> On Sat, May 31, 2014 at 11:23 PM, Russell Jurney <
>>>> russell.jur...@gmail.com> wrote:
>>>>
>>>>> Thanks for the fast reply.
>>>>>
>>>>> I am running CDH 4.4 with the Cloudera Parcel of Spark 0.9.0, in
>>>>> standalone mode.
>>>>>
>>>>>
>>>>> On Saturday, May 31, 2014, Aaron Davidson <ilike...@gmail.com> wrote:
>>>>>
>>>>>> First issue was because your cluster was configured incorrectly. You
>>>>>> could probably read 1 file because that was done on the driver node, but
>>>>>> when it tried to run a job on the cluster, it failed.
>>>>>>
>>>>>> Second issue, it seems that the jar containing avro is not getting
>>>>>> propagated to the Executors. What version of Spark are you running on? 
>>>>>> What
>>>>>> deployment mode (YARN, standalone, Mesos)?
>>>>>>
>>>>>>
>>>>>> On Sat, May 31, 2014 at 9:37 PM, Russell Jurney <
>>>>>> russell.jur...@gmail.com> wrote:
>>>>>>
>>>>>> Now I get this:
>>>>>>
>>>>>> scala> rdd.first
>>>>>>
>>>>>> 14/05/31 21:36:28 INFO spark.SparkContext: Starting job: first at
>>>>>> <console>:41
>>>>>>
>>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Got job 4 (first at
>>>>>> <console>:41) with 1 output partitions (allowLocal=true)
>>>>>>
>>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Final stage: Stage 4
>>>>>> (first at <console>:41)
>>>>>>
>>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Parents of final
>>>>>> stage: List()
>>>>>>
>>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Missing parents: List()
>>>>>>
>>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Computing the
>>>>>> requested partition locally
>>>>>>
>>>>>> 14/05/31 21:36:28 INFO rdd.HadoopRDD: Input split:
>>>>>> hdfs://hivecluster2/securityx/web_proxy_mef/2014/05/29/22/part-m-00000.avro:0+3864
>>>>>>
>>>>>> 14/05/31 21:36:28 INFO spark.SparkContext: Job finished: first at
>>>>>> <console>:41, took 0.037371256 s
>>>>>>
>>>>>> 14/05/31 21:36:28 INFO spark.SparkContext: Starting job: first at
>>>>>> <console>:41
>>>>>>
>>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Got job 5 (first at
>>>>>> <console>:41) with 16 output partitions (allowLocal=true)
>>>>>>
>>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Final stage: Stage 5
>>>>>> (first at <console>:41)
>>>>>>
>>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Parents of final
>>>>>> stage: List()
>>>>>>
>>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Missing parents: List()
>>>>>>
>>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Submitting Stage 5
>>>>>> (HadoopRDD[0] at hadoopRDD at <console>:37), which has no missing parents
>>>>>>
>>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Submitting 16 missing
>>>>>> tasks from Stage 5 (HadoopRDD[0] at hadoopRDD at <console>:37)
>>>>>>
>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSchedulerImpl: Adding task set
>>>>>> 5.0 with 16 tasks
>>>>>>
>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:0
>>>>>> as TID 92 on executor 2: hivecluster3 (NODE_LOCAL)
>>>>>>
>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task
>>>>>> 5.0:0 as 1294 bytes in 1 ms
>>>>>>
>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:3
>>>>>> as TID 93 on executor 1: hivecluster5.labs.lan (NODE_LOCAL)
>>>>>>
>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task
>>>>>> 5.0:3 as 1294 bytes in 0 ms
>>>>>>
>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:1
>>>>>> as TID 94 on executor 4: hivecluster4 (NODE_LOCAL)
>>>>>>
>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task
>>>>>> 5.0:1 as 1294 bytes in 1 ms
>>>>>>
>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:2
>>>>>> as TID 95 on executor 0: hivecluster6.labs.lan (NODE_LOCAL)
>>>>>>
>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task
>>>>>> 5.0:2 as 1294 bytes in 0 ms
>>>>>>
>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:4
>>>>>> as TID 96 on executor 3: hivecluster1.labs.lan (NODE_LOCAL)
>>>>>>
>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task
>>>>>> 5.0:4 as 1294 bytes in 0 ms
>>>>>>
>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:6
>>>>>> as TID 97 on executor 2: hivecluster3 (NODE_LOCAL)
>>>>>>
>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task
>>>>>> 5.0:6 as 1294 bytes in 0 ms
>>>>>>
>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:5
>>>>>> as TID 98 on executor 1: hivecluster5.labs.lan (NODE_LOCAL)
>>>>>>
>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task
>>>>>> 5.0:5 as 1294 bytes in 0 ms
>>>>>>
>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:8
>>>>>> as TID 99 on executor 4: hivecluster4 (NODE_LOCAL)
>>>>>>
>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task
>>>>>> 5.0:8 as 1294 bytes in 0 ms
>>>>>>
>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:7
>>>>>> as TID 100 on executor 0: hivecluster6.labs.lan (NODE_LOCAL)
>>>>>>
>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task
>>>>>> 5.0:7 as 1294 bytes in 0 ms
>>>>>>
>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:10
>>>>>> as TID 101 on executor 3: hivecluster1.labs.lan (NODE_LOCAL)
>>>>>>
>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task
>>>>>> 5.0:10 as 1294 bytes in 0 ms
>>>>>>
>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:14
>>>>>> as TID 102 on executor 2: hivecluster3 (NODE_LOCAL)
>>>>>>
>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task
>>>>>> 5.0:14 as 1294 bytes in 0 ms
>>>>>>
>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:9
>>>>>> as TID 103 on executor 1: hivecluster5.labs.lan (NODE_LOCAL)
>>>>>>
>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task
>>>>>> 5.0:9 as 1294 bytes in 0 ms
>>>>>>
>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:11
>>>>>> as TID 104 on executor 4: hivecluster4 (N
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com
>>>>> datasyndrome.com
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome
>>> .com
>>>
>>
>>
>
>
> --
> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.
> com
>

Re: hadoopRDD stalls reading entire directory

Reply via email to