Anyone can shed some light on this?

On Tue, Mar 17, 2015 at 5:23 PM, Chen Song <chen.song...@gmail.com> wrote:

> I have a map reduce job that reads from three logs and joins them on some
> key column. The underlying data is protobuf messages in sequence
> files. Between mappers and reducers, the underlying raw byte arrays for
> protobuf messages are shuffled . Roughly, for 1G input from HDFS, there is
> 2G data output from map phase.
>
> I am testing spark jobs (v1.3.0) on the same input. I found that shuffle
> write is 3 - 4 times input size. I tried passing protobuf Message object
> and ArrayByte but neither gives good shuffle write output.
>
> Is there any good practice on shuffling
>
> * protobuf messages
> * raw byte array
>
> Chen
>
>


-- 
Chen Song

Reply via email to