Hi, Reynold Using glom() is because it is easy to adapt to calculation logic already implemented in MR. And o be clear, we are still in POC. Since the results shows there is almost no difference between this glom stage and the MR mapper, using glom here might not be the issue. I was trying to monitor the network traffic when repartition happens, and it showed that the traffic peek is about 200 - 300MB/s while it stayed at speed of about 3-4MB/s for a long time. Have you guys got any idea about it?
Reynold Xin <r...@databricks.com>于2015年10月23日周五 上午2:43写道: > Why do you do a glom? It seems unnecessarily expensive to materialize each > partition in memory. > > > On Thu, Oct 22, 2015 at 2:02 AM, 周千昊 <qhz...@apache.org> wrote: > >> Hi, spark community >> I have an application which I try to migrate from MR to Spark. >> It will do some calculations from Hive and output to hfile which >> will be bulk load to HBase Table, details as follow: >> >> Rdd<Element> input = getSourceInputFromHive() >> Rdd<Tuple2<byte[], byte[]>> mapSideResult = >> input.glom().mapPartitions(/*some calculation, equivalent to MR mapper*/) >> // PS: the result in each partition has already been sorted >> according to the lexicographical order during the calculation >> mapSideResult.repartitionAndSortWithPartitions(/*partition with >> byte[][] which is HTable split key, equivalent to MR shuffle >> */).map(/*transform >> Tuple2<byte[], byte[]> to Tuple2<ImmutableBytesWritable, >> KeyValue>/*equivalent >> to MR reducer without output*/).saveAsNewAPIHadoopFile(/*write to >> hfile*/) >> >> This all works fine on a small dataset, and spark outruns MR by >> about 10%. However when I apply it on a dataset of 150 million records, MR >> is about 100% faster than spark.(*MR 25min spark 50min*) >> After exploring into the application UI, it shows that in the >> repartitionAndSortWithinPartitions stage is very slow, and in the shuffle >> phase a 6GB size shuffle cost about 18min which is quite unreasonable >> *Can anyone help with this issue and give me some advice on >> this? **It’s not iterative processing, however I believe Spark could be >> the same fast at minimal.* >> >> Here are the cluster info: >> vm: 8 nodes * (128G mem + 64 core) >> hadoop cluster: hdp 2.2.6 >> spark running mode: yarn-client >> spark version: 1.5.1 >> >> >