Re: Re: repartitionAndSortWithinPartitions task shuffle phase is very slow

2015-10-26 Thread
I have replace default java serialization with Kyro. It indeed reduce the shuffle size and the performance has been improved, however the shuffle speed remains unchanged. I am quite newbie to Spark, does anyone have idea about towards which direction I should go to find the root cause? 周千昊 于2015年

Re: Re: repartitionAndSortWithinPartitions task shuffle phase is very slow

2015-10-23 Thread
ta, cause we kinda copied MR implementation > into Spark. > > Let us know if more info is needed. > > On Fri, Oct 23, 2015 at 10:24 AM, 周千昊 wrote: > > > +kylin dev list > > > > 周千昊 于2015年10月23日周五 上午10:20写道: > > > > > Hi, Reynold > > >

Re: repartitionAndSortWithinPartitions task shuffle phase is very slow

2015-10-22 Thread
+kylin dev list 周千昊 于2015年10月23日周五 上午10:20写道: > Hi, Reynold > Using glom() is because it is easy to adapt to calculation logic > already implemented in MR. And o be clear, we are still in POC. > Since the results shows there is almost no difference between this > glo

Re: repartitionAndSortWithinPartitions task shuffle phase is very slow

2015-10-22 Thread
ems unnecessarily expensive to materialize each > partition in memory. > > > On Thu, Oct 22, 2015 at 2:02 AM, 周千昊 wrote: > >> Hi, spark community >> I have an application which I try to migrate from MR to Spark. >> It will do some calculations from Hive and o

repartitionAndSortWithinPartitions task shuffle phase is very slow

2015-10-22 Thread
Hi, spark community I have an application which I try to migrate from MR to Spark. It will do some calculations from Hive and output to hfile which will be bulk load to HBase Table, details as follow: Rdd input = getSourceInputFromHive() Rdd> mapSideResult = input.glom().mapP

Re: avoid creating small objects

2015-08-14 Thread
I am thinking that creating a shared object outside the closure, use this object to hold the byte array. will this work? 周千昊 于2015年8月14日周五 下午4:02写道: > Hi, > All I want to do is that, > 1. read from some source > 2. do some calculation to get some byte array > 3.

avoid creating small objects

2015-08-14 Thread
Hi, All I want to do is that, 1. read from some source 2. do some calculation to get some byte array 3. write the byte array to hdfs In hadoop, I can share an ImmutableByteWritable, and do some System.arrayCopy, it will prevent the application from creating a lot of small object

Re: please help with ClassNotFoundException

2015-08-13 Thread
ing spark in production? Spark 1.3 is better than spark1.4. > > -- 原始邮件 ------ > *发件人:* "周千昊";; > *发送时间:* 2015年8月14日(星期五) 中午11:14 > *收件人:* "Sea"<261810...@qq.com>; "dev@spark.apache.org"< > dev@spark.apache.org>;

Re: please help with ClassNotFoundException

2015-08-13 Thread
Hi Sea I have updated spark to 1.4.1, however the problem still exists, any idea? Sea <261810...@qq.com>于2015年8月14日周五 上午12:36写道: > Yes, I guess so. I see this bug before. > > > -- 原始邮件 ------ > *发件人:* "周千昊";; > *发送时间:* 2015年8月13日

Re: please help with ClassNotFoundException

2015-08-13 Thread
Hi sea Is it the same issue as https://issues.apache.org/jira/browse/SPARK-8368 Sea <261810...@qq.com>于2015年8月13日周四 下午6:52写道: > Are you using 1.4.0? If yes, use 1.4.1 > > > -- 原始邮件 ------ > *发件人:* "周千昊";; > *发送时间:* 2015年8月13日

please help with ClassNotFoundException

2015-08-13 Thread
Hi, I am using spark 1.4 when an issue occurs to me. I am trying to use the aggregate function: JavaRdd rdd = some rdd; HashMap zeroValue = new HashMap(); // add initial key-value pair for zeroValue rdd.aggregate(zeroValue, new Function2,