t much larger than the fast finished tasks, is that
>>>> normal?
>>>>
>>>> I am also interested in this case, as from statistics on the UI, how it
>>>> indicates the task could have skew data?
>>>>
>>>> Yong
>>>>
&g
t;> of this task is not much larger than the fast finished tasks, is that
>>> normal?
>>>
>>> I am also interested in this case, as from statistics on the UI, how it
>>> indicates the task could have skew data?
>>>
>>> Yong
>>>
>>
ew data?
>>
>> Yong
>>
>> --------------
>> Date: Mon, 13 Apr 2015 12:58:12 -0400
>> Subject: Re: Equi Join is taking for ever. 1 Task is Running while other
>> 199 are complete
>> From: jcove...@gmail.com
>> To: deepuj...@gmail.
as from statistics on the UI, how it
> indicates the task could have skew data?
>
> Yong
>
> --
> Date: Mon, 13 Apr 2015 12:58:12 -0400
> Subject: Re: Equi Join is taking for ever. 1 Task is Running while other
> 199 are complete
> From: jcove...
from statistics on the UI, how it
indicates the task could have skew data?
Yong
Date: Mon, 13 Apr 2015 12:58:12 -0400
Subject: Re: Equi Join is taking for ever. 1 Task is Running while other 199
are complete
From: jcove...@gmail.com
To: deepuj...@gmail.com
CC: user@spark.apache.org
I can promise
I can promise you that this is also a problem in the pig world :) not sure
why it's not a problem for this data set, though... are you sure that the
two are doing the exact same code?
you should inspect your source data. Make a histogram for each and see what
the data distribution looks like. If t
You mean there is a tuple in either RDD, that has itemID = 0 or null ?
And what is catch all ?
That implies is it a good idea to run a filter on each RDD first ? We do
not do this using Pig on M/R. Is it required in Spark world ?
On Mon, Apr 13, 2015 at 9:58 PM, Jonathan Coveney
wrote:
> My gue
My guess would be data skew. Do you know if there is some item id that is a
catch all? can it be null? item id 0? lots of data sets have this sort of
value and it always kills joins
2015-04-13 11:32 GMT-04:00 ÐΞ€ρ@Ҝ (๏̯͡๏) :
> Code:
>
> val viEventsWithListings: RDD[(Long, (DetailInputRecord, VIS
Code:
val viEventsWithListings: RDD[(Long, (DetailInputRecord, VISummary, Long))]
= lstgItem.join(viEvents).map {
case (itemId, (listing, viDetail)) =>
val viSummary = new VISummary
viSummary.leafCategoryId = listing.getLeafCategId().toInt
viSummary.itemSiteId = listi