hey Ted,

Event table is like this: UserID, EventType, EventKey, TimeStamp,
MetaData.  I just parse it from Json and save as Parquet, did not change
the partition.

Annoyingly, every day's incoming Event data having duplicates among each
other.  One same event could show up in Day1 and Day2 and probably Day3.

I only want to keep single Event table and each day it come so many
duplicates.

Is there a way I could just insert into Parquet and if duplicate found,
just ignore?

Thanks,
Gavin







On Fri, Jan 8, 2016 at 2:18 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> Is your Parquet data source partitioned by date ?
>
> Can you dedup within partitions ?
>
> Cheers
>
> On Fri, Jan 8, 2016 at 2:10 PM, Gavin Yue <yue.yuany...@gmail.com> wrote:
>
>> I tried on Three day's data.  The total input is only 980GB, but the
>> shuffle write Data is about 6.2TB, then the job failed during shuffle read
>> step, which should be another 6.2TB shuffle read.
>>
>> I think to Dedup, the shuffling can not be avoided. Is there anything I
>> could do to stablize this process?
>>
>> Thanks.
>>
>>
>> On Fri, Jan 8, 2016 at 2:04 PM, Gavin Yue <yue.yuany...@gmail.com> wrote:
>>
>>> Hey,
>>>
>>> I got everyday's Event table and want to merge them into a single Event
>>> table. But there so many duplicates among each day's data.
>>>
>>> I use Parquet as the data source.  What I am doing now is
>>>
>>> EventDay1.unionAll(EventDay2).distinct().write.parquet("a new parquet
>>> file").
>>>
>>> Each day's Event is stored in their own Parquet file
>>>
>>> But it failed at the stage2 which keeps losing connection to one
>>> executor. I guess this is due to the memory issue.
>>>
>>> Any suggestion how I do this efficiently?
>>>
>>> Thanks,
>>> Gavin
>>>
>>
>>
>

Reply via email to