Why is shuffle write size so large when joining Dataset with nested structure?

2016-11-25 Thread taozhuo
The Dataset is defined as case class with many fields with nested structure(Map, List of another case class etc.) The size of the Dataset is only 1T when saving to disk as Parquet file. But when joining it, the shuffle write size becomes as large as 12T. Is there a way to cut it down without changi

spark-submit hangs forever after all tasks finish(spark 2.0.0 stable version on yarn)

2016-07-30 Thread taozhuo
below is the error messages that seem run infinitely: 16/07/30 23:25:38 DEBUG ProtobufRpcEngine: Call: getApplicationReport took 1ms 16/07/30 23:25:39 DEBUG Client: IPC Client (1735131305) connection to /10.80.1.168:8032 from zhuotao sending #147247 16/07/30 23:25:39 DEBUG Client: IPC Client (173