Anyone can shed some light on this? On Tue, Mar 17, 2015 at 5:23 PM, Chen Song <chen.song...@gmail.com> wrote:
> I have a map reduce job that reads from three logs and joins them on some > key column. The underlying data is protobuf messages in sequence > files. Between mappers and reducers, the underlying raw byte arrays for > protobuf messages are shuffled . Roughly, for 1G input from HDFS, there is > 2G data output from map phase. > > I am testing spark jobs (v1.3.0) on the same input. I found that shuffle > write is 3 - 4 times input size. I tried passing protobuf Message object > and ArrayByte but neither gives good shuffle write output. > > Is there any good practice on shuffling > > * protobuf messages > * raw byte array > > Chen > > -- Chen Song