Re: hadoopRDD stalls reading entire directory

Aaron Davidson Sun, 01 Jun 2014 10:29:13 -0700

Gotcha. The easiest way to get your dependencies to your Executors would
probably be to construct your SparkContext with all necessary jars passed
in (as the "jars" parameter), or inside a SparkConf with setJars(). Avro is
a "necessary jar", but it's possible your application also needs to
distribute other ones to the cluster.


An easy way to make sure all your dependencies get shipped to the cluster
is to create an assembly jar of your application, and then you just need to
tell Spark about that jar, which includes all your application's transitive
dependencies. Maven and sbt both have pretty straightforward ways of
producing assembly jars.


On Sat, May 31, 2014 at 11:23 PM, Russell Jurney <russell.jur...@gmail.com>
wrote:

> Thanks for the fast reply.
>
> I am running CDH 4.4 with the Cloudera Parcel of Spark 0.9.0, in
> standalone mode.
>
>
> On Saturday, May 31, 2014, Aaron Davidson <ilike...@gmail.com> wrote:
>
>> First issue was because your cluster was configured incorrectly. You
>> could probably read 1 file because that was done on the driver node, but
>> when it tried to run a job on the cluster, it failed.
>>
>> Second issue, it seems that the jar containing avro is not getting
>> propagated to the Executors. What version of Spark are you running on? What
>> deployment mode (YARN, standalone, Mesos)?
>>
>>
>> On Sat, May 31, 2014 at 9:37 PM, Russell Jurney <russell.jur...@gmail.com
>> > wrote:
>>
>> Now I get this:
>>
>> scala> rdd.first
>>
>> 14/05/31 21:36:28 INFO spark.SparkContext: Starting job: first at
>> <console>:41
>>
>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Got job 4 (first at
>> <console>:41) with 1 output partitions (allowLocal=true)
>>
>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Final stage: Stage 4
>> (first at <console>:41)
>>
>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Parents of final stage:
>> List()
>>
>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Missing parents: List()
>>
>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Computing the requested
>> partition locally
>>
>> 14/05/31 21:36:28 INFO rdd.HadoopRDD: Input split:
>> hdfs://hivecluster2/securityx/web_proxy_mef/2014/05/29/22/part-m-00000.avro:0+3864
>>
>> 14/05/31 21:36:28 INFO spark.SparkContext: Job finished: first at
>> <console>:41, took 0.037371256 s
>>
>> 14/05/31 21:36:28 INFO spark.SparkContext: Starting job: first at
>> <console>:41
>>
>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Got job 5 (first at
>> <console>:41) with 16 output partitions (allowLocal=true)
>>
>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Final stage: Stage 5
>> (first at <console>:41)
>>
>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Parents of final stage:
>> List()
>>
>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Missing parents: List()
>>
>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Submitting Stage 5
>> (HadoopRDD[0] at hadoopRDD at <console>:37), which has no missing parents
>>
>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Submitting 16 missing
>> tasks from Stage 5 (HadoopRDD[0] at hadoopRDD at <console>:37)
>>
>> 14/05/31 21:36:28 INFO scheduler.TaskSchedulerImpl: Adding task set 5.0
>> with 16 tasks
>>
>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:0 as
>> TID 92 on executor 2: hivecluster3 (NODE_LOCAL)
>>
>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:0 as
>> 1294 bytes in 1 ms
>>
>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:3 as
>> TID 93 on executor 1: hivecluster5.labs.lan (NODE_LOCAL)
>>
>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:3 as
>> 1294 bytes in 0 ms
>>
>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:1 as
>> TID 94 on executor 4: hivecluster4 (NODE_LOCAL)
>>
>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:1 as
>> 1294 bytes in 1 ms
>>
>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:2 as
>> TID 95 on executor 0: hivecluster6.labs.lan (NODE_LOCAL)
>>
>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:2 as
>> 1294 bytes in 0 ms
>>
>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:4 as
>> TID 96 on executor 3: hivecluster1.labs.lan (NODE_LOCAL)
>>
>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:4 as
>> 1294 bytes in 0 ms
>>
>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:6 as
>> TID 97 on executor 2: hivecluster3 (NODE_LOCAL)
>>
>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:6 as
>> 1294 bytes in 0 ms
>>
>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:5 as
>> TID 98 on executor 1: hivecluster5.labs.lan (NODE_LOCAL)
>>
>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:5 as
>> 1294 bytes in 0 ms
>>
>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:8 as
>> TID 99 on executor 4: hivecluster4 (NODE_LOCAL)
>>
>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:8 as
>> 1294 bytes in 0 ms
>>
>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:7 as
>> TID 100 on executor 0: hivecluster6.labs.lan (NODE_LOCAL)
>>
>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:7 as
>> 1294 bytes in 0 ms
>>
>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:10 as
>> TID 101 on executor 3: hivecluster1.labs.lan (NODE_LOCAL)
>>
>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:10
>> as 1294 bytes in 0 ms
>>
>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:14 as
>> TID 102 on executor 2: hivecluster3 (NODE_LOCAL)
>>
>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:14
>> as 1294 bytes in 0 ms
>>
>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:9 as
>> TID 103 on executor 1: hivecluster5.labs.lan (NODE_LOCAL)
>>
>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:9 as
>> 1294 bytes in 0 ms
>>
>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:11 as
>> TID 104 on executor 4: hivecluster4 (N
>>
>>
>
> --
> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.
> com
>

Re: hadoopRDD stalls reading entire directory

Reply via email to