Spark application Runtime Measurement

2016-07-09 Thread Fei Hu
Dear all, I have a question about how to measure the runtime for a Spak application. Here is an example: - On the Spark UI: the total duration time is 2.0 minutes = 120 seconds as following [image: Screen Shot 2016-07-09 at 11.45.44 PM.png] - However, when I check the jobs launched by

[no subject]

2016-10-10 Thread Fei Hu
Hi All, I am running some spark scala code on zeppelin on CDH 5.5.1 (Spark version 1.5.0). I customized the Spark interpreter to use org.apache.spark.serializer.KryoSerializer as spark.serializer. And in the dependency I added Kyro-3.0.3 as following: com.esotericsoftware:kryo:3.0.3 When I wrot

Kryo on Zeppelin

2016-10-10 Thread Fei Hu
Hi All, I am running some spark scala code on zeppelin on CDH 5.5.1 (Spark version 1.5.0). I customized the Spark interpreter to use org.apache.spark. serializer.KryoSerializer as spark.serializer. And in the dependency I added Kyro-3.0.3 as following: com.esotericsoftware:kryo:3.0.3 When I wro

RDD Location

2016-12-29 Thread Fei Hu
Dear all, Is there any way to change the host location for a certain partition of RDD? "protected def getPreferredLocations(split: Partition)" can be used to initialize the location, but how to change it after the initialization? Thanks, Fei

Re: RDD Location

2016-12-30 Thread Fei Hu
de the > getPreferredLocations() to implement the logic of dynamic changing of the > locations. > > On Dec 30, 2016, at 12:06, Fei Hu wrote: > > > > Dear all, > > > > Is there any way to change the host location for a certain partition of > RDD? > > > > "

context.runJob() was suspended in getPreferredLocations() function

2016-12-30 Thread Fei Hu
Dear all, I tried to customize my own RDD. In the getPreferredLocations() function, I used the following code to query anonter RDD, which was used as an input to initialize this customized RDD: * val results: Array[Array[DataChunkPartition]] = context.runJob(partitionsRDD, (con

Re: RDD Location

2016-12-30 Thread Fei Hu
Job inside getPreferredLocations(). > You can take a look at the source code of HadoopRDD to help you implement > getPreferredLocations() > appropriately. > > On Dec 31, 2016, at 09:48, Fei Hu wrote: > > That is a good idea. > > I tried add the following code to get ge

Equally split a RDD partition into two partition at the same node

2017-01-14 Thread Fei Hu
Dear all, I want to equally divide a RDD partition into two partitions. That means, the first half of elements in the partition will create a new partition, and the second half of elements in the partition will generate another new partition. But the two new partitions are required to be at the sa

Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Fei Hu
locality. Thanks, Fei On Sun, Jan 15, 2017 at 2:33 AM, Rishi Yadav wrote: > Can you provide some more details: > 1. How many partitions does RDD have > 2. How big is the cluster > On Sat, Jan 14, 2017 at 3:59 PM Fei Hu wrote: > >> Dear all, >> >> I want to equall

Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Fei Hu
iting to HDFS, but it might still be a narrow dependency (satisfying your > requirements) if you increase the # of partitions. > > Best, > Anastasios > > On Sun, Jan 15, 2017 at 12:58 AM, Fei Hu wrote: > >> Dear all, >> >> I want to equally divide a RDD partition

Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Fei Hu
CoalescedRDD code to implement your > requirement. > > Good luck! > Cheers, > Anastasios > > > On Sun, Jan 15, 2017 at 5:39 PM, Fei Hu wrote: > >> Hi Anastasios, >> >> Thanks for your reply. If I just increase the numPartitions to be twice >> larger

Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Fei Hu
l be a narrow dependency (satisfying > > your > > requirements) if you increase the # of partitions. > > > > Best, > > Anastasios > > > > On Sun, Jan 15, 2017 at 12:58 AM, Fei Hu < > > > hufei68@ > > > > wrote: > > > &g

Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Fei Hu
don’t think > RDD number of partitions will be increased. > > > > Thanks, > > Jasbir > > > > *From:* Fei Hu [mailto:hufe...@gmail.com] > *Sent:* Sunday, January 15, 2017 10:10 PM > *To:* zouz...@cs.toronto.edu > *Cc:* user @spark ; dev@spark.apache.org > *Su

Re: Equally split a RDD partition into two partition at the same node

2017-01-16 Thread Fei Hu
need to add few logic in compute() to > decide which half of the parent partition is needed to output. And you need > to get the correct preferred locations for the partitions sharing the same > parent partition. > > > Fei Hu wrote > > Hi Liang-Chi, > > > > Yes, y

Re: Equally split a RDD partition into two partition at the same node

2017-01-16 Thread Fei Hu
, 2017 at 2:07 PM, Pradeep Gollakota wrote: > Usually this kind of thing can be done at a lower level in the InputFormat > usually by specifying the max split size. Have you looked into that > possibility with your InputFormat? > > On Sun, Jan 15, 2017 at 9:42 PM, Fei Hu wrote: