Re: Suggestion on Join Approach with Spark

2019-05-15 Thread Chetan Khatri
ION_BY_LINE_ID"), >>> "left_outer").filter(deltaDF.col("sys_change_column").isNull) >>> .drop(deltaDF.col("sys_change_column")) >>> >>> val mergedDataDF = syncDataDF.union(deltaDF) >>> >>> I believe, With

Re: Suggestion on Join Approach with Spark

2019-05-15 Thread Nicholas Chammas
"left_outer").filter(deltaDF.col("sys_change_column").isNull) >> .drop(deltaDF.col("sys_change_column")) >> >> val mergedDataDF = syncDataDF.union(deltaDF) >> >> I believe, Without doing *union *, only with Join this can

Re: Suggestion on Join Approach with Spark

2019-05-15 Thread Chetan Khatri
.isNull) > .drop(deltaDF.col("sys_change_column")) > > val mergedDataDF = syncDataDF.union(deltaDF) > > I believe, Without doing *union *, only with Join this can be done. Please > suggest best approach. > > As I can't write back *mergedDataDF * to

Suggestion on Join Approach with Spark

2019-05-15 Thread Chetan Khatri
. As I can't write back *mergedDataDF * to the path of historyDF, because from there I am only reading. What I am doing is to write at temp path and then read from there and write back! Which is bad Idea, I need suggestion here... mergedDataDF.write.mode(SaveMode.Overwrite).parquet("

Re: [SQL] [Suggestion] Add top() to Dataset

2018-02-02 Thread Yacine Mazari
I see, thanks a lot for the clarifications. -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [Core][Suggestion] sortWithinPartitions and aggregateWithinPartitions for RDD

2018-02-01 Thread Mridul Muralidharan
On Wed, Jan 31, 2018 at 1:15 AM, Ruifeng Zheng wrote: > HI all: > > > >1, Dataset API supports operation “sortWithinPartitions”, but in RDD > API there is no counterpart (I know there is > “repartitionAndSortWithinPartitions”, but I don’t want to repartition the > RDD), I have to convert R

Re: [Core][Suggestion] sortWithinPartitions and aggregateWithinPartitions for RDD

2018-01-31 Thread Ruifeng Zheng
Do you mean in-memory processing? It works fine if all partitions are small. But when some partition don’t fit in memory, it will cause OOM. 发件人: Reynold Xin 日期: 2018年2月1日 星期四 下午3:14 收件人: Ruifeng Zheng 抄送: 主题: Re: [Core][Suggestion] sortWithinPartitions and aggregateWithinPartitions

Re: [Core][Suggestion] sortWithinPartitions and aggregateWithinPartitions for RDD

2018-01-31 Thread Reynold Xin
You can just do that with mapPartitions pretty easily can’t you? On Wed, Jan 31, 2018 at 11:08 PM Ruifeng Zheng wrote: > HI all: > > > >1, Dataset API supports operation “sortWithinPartitions”, but in > RDD API there is no counterpart (I know there is > “repartitionAndSortWithinPartition

[Core][Suggestion] sortWithinPartitions and aggregateWithinPartitions for RDD

2018-01-31 Thread Ruifeng Zheng
HI all: 1, Dataset API supports operation “sortWithinPartitions”, but in RDD API there is no counterpart (I know there is “repartitionAndSortWithinPartitions”, but I don’t want to repartition the RDD), I have to convert RDD to Dataset for this function. Would it make sense to add a “s

Re: [SQL] [Suggestion] Add top() to Dataset

2018-01-30 Thread Wenchen Fan
You can use `Dataset.limit`, which return a new `Dataset` instead of an Array. Then you can transform it and still get the top k optimization from Spark. On Wed, Jan 31, 2018 at 3:39 PM, Yacine Mazari wrote: > Thanks for the quick reply and explanation @rxin. > > So if one does not want to colle

Re: [SQL] [Suggestion] Add top() to Dataset

2018-01-30 Thread Yacine Mazari
Thanks for the quick reply and explanation @rxin. So if one does not want to collect()/take() but want the top k as a dataset to do further transformations there is no optimized API, that's why I am suggesting adding this "top()" as a public method. If that sounds like a good idea, I will open a

Re: [SQL] [Suggestion] Add top() to Dataset

2018-01-30 Thread Reynold Xin
For the DataFrame/Dataset API, the optimizer rewrites orderBy followed by a take into a priority queue based top implementation actually. On Tue, Jan 30, 2018 at 11:10 PM, Yacine Mazari wrote: > Hi All, > > Would it make sense to add a "top()" method to the Dataset API? > This method would retu

[SQL] [Suggestion] Add top() to Dataset

2018-01-30 Thread Yacine Mazari
Hi All, Would it make sense to add a "top()" method to the Dataset API? This method would return a Dataset containing the top k elements, the caller may then do further processing on the Dataset or call collect(). This is in contrast with RDD's top() which returns a collected array. In terms of i

Re: 答复: Limit Query Performance Suggestion

2017-01-18 Thread Liang-Chi Hsieh
hua > > > -邮件原件- > 发件人: Liang-Chi Hsieh [mailto: > viirya@ > ] > 发送时间: 2017年1月18日 15:48 > 收件人: > dev@.apache > 主题: Re: Limit Query Performance Suggestion > > > Hi Sujith, > > I saw your updated post. Seems it makes sense to me now. >

答复: Limit Query Performance Suggestion

2017-01-18 Thread wangzhenhua (G)
: 2017年1月18日 15:48 收件人: dev@spark.apache.org 主题: Re: Limit Query Performance Suggestion Hi Sujith, I saw your updated post. Seems it makes sense to me now. If you use a very big limit number, the shuffling before `GlobalLimit` would be a bottleneck for performance, of course, even it can even

Re: Limit Query Performance Suggestion

2017-01-17 Thread Liang-Chi Hsieh
-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Limit-Query-Performance-Suggestion-tp20570p20652.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --

Re: Limit Query Performance Suggestion

2017-01-17 Thread sujith71955
ort with sample data and also figuring out a solution for this problem. Please let me know for any clarifications or suggestions regarding this issue. Regards, Sujith -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Limit-Query-Performance-Suggest

Re: Limit Query Performance Suggestion

2017-01-15 Thread Liang-Chi Hsieh
Hi Sujith, Thanks for suggestion. The codes you quoted are from `CollectLimitExec` which will be in the plan if a logical `Limit` is the final operator in an logical plan. But in the physical plan you showed, there are `GlobalLimit` and `LocalLimit` for the logical `Limit` operation, so the

Limit Query Performance Suggestion

2017-01-12 Thread sujith chacko
by grouping data from all partitions into single partition, this can create overhead since all partitions will return limit n data , so while grouping there will be N partition * limit N which can be very huge, in both scenarios mentioned above this logic can be a bottle neck. My suggestion for

Re: Suggestion in README.md for guiding pull requests/JIRAs (probably about linking CONTRIBUTING.md or wiki)

2016-10-09 Thread Reynold Xin
Github already links to CONTRIBUTING.md. -- of course, a lot of people ignore that. One thing we can do is to add an explicit link to the wiki contributing page in the template (but note that even that introduces some overhead for every pull request). Aside from that, I am not sure if the other su

Re: Suggestion in README.md for guiding pull requests/JIRAs (probably about linking CONTRIBUTING.md or wiki)

2016-10-09 Thread Reynold Xin
Actually let's move the discussion to the JIRA ticket, given there is a ticket. On Sun, Oct 9, 2016 at 5:36 PM, Reynold Xin wrote: > Github already links to CONTRIBUTING.md. -- of course, a lot of people > ignore that. One thing we can do is to add an explicit link to the wiki > contributing pa

Re: Suggestion in README.md for guiding pull requests/JIRAs (probably about linking CONTRIBUTING.md or wiki)

2016-10-09 Thread Felix Cheung
Should we just link to https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark On Sun, Oct 9, 2016 at 10:09 AM -0700, "Hyukjin Kwon" mailto:gurwls...@gmail.com>> wrote: Thanks for confirming this, Sean. I filed this in https://issues.apache.org/jira/browse/SPARK-17840 I wou

Re: Suggestion in README.md for guiding pull requests/JIRAs (probably about linking CONTRIBUTING.md or wiki)

2016-10-09 Thread Hyukjin Kwon
Thanks for confirming this, Sean. I filed this in https://issues.apache.org/jira/browse/SPARK-17840 I would appreciate if anyone who has a better writing skills better than me tries to fix this. I don't want to let reviewers make an effort to correct the grammar. On 10 Oct 2016 1:34 a.m., "Sean

Re: Suggestion in README.md for guiding pull requests/JIRAs (probably about linking CONTRIBUTING.md or wiki)

2016-10-09 Thread Sean Owen
Yes, it's really CONTRIBUTING.md that's more relevant, because github displays a link to it when opening pull requests. https://github.com/apache/spark/blob/master/CONTRIBUTING.md There is also the pull request template: https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE I

Suggestion in README.md for guiding pull requests/JIRAs (probably about linking CONTRIBUTING.md or wiki)

2016-10-09 Thread Hyukjin Kwon
Hi all, I just noticed the README.md (https://github.com/apache/spark) does not describe the steps or links to follow for creating a PR or JIRA directly. I know probably it is sensible to search google about the contribution guides first before trying to make a PR/JIRA but I think it seems not en

Re: Suggestion for SPARK-1825

2014-07-25 Thread Patrick Wendell
Yeah I agree reflection is the best solution. Whenever we do reflection we should clearly document in the code which YARN API version corresponds to which code path. I'm guessing since YARN is adding new features... we'll just have to do this over time. - Patrick On Fri, Jul 25, 2014 at 3:35 PM,

Re: Suggestion for SPARK-1825

2014-07-25 Thread Reynold Xin
Actually reflection is probably a better, lighter weight process for this. An extra project brings more overhead for something simple. On Fri, Jul 25, 2014 at 3:09 PM, Colin McCabe wrote: > So, I'm leaning more towards using reflection for this. Maven profiles > could work, but it's tough s

Re: Suggestion for SPARK-1825

2014-07-25 Thread Colin McCabe
So, I'm leaning more towards using reflection for this. Maven profiles could work, but it's tough since we have new stuff coming in in 2.4, 2.5, etc. and the number of profiles will multiply quickly if we have to do it that way. Reflection is the approach HBase took in a similar situation. best

Re: Suggestion for SPARK-1825

2014-07-25 Thread Colin McCabe
I have a similar issue with SPARK-1767. There are basically three ways to resolve the issue: 1. Use reflection to access classes newer than 0.21 (or whatever the oldest version of Hadoop is that Spark supports) 2. Add a build variant (in Maven this would be a profile) that deals with this. 3. Aut

Suggestion for SPARK-1825

2014-07-22 Thread innowireless TaeYun Kim
(I'm resending this mail since it seems that it was not sent. Sorry if this was already sent.) Hi, A couple of month ago, I made a pull request to fix https://issues.apache.org/jira/browse/SPARK-1825. My pull request is here: https://github.com/apache/spark/pull/899 But that pull request

Suggestion for SPARK-1825

2014-07-21 Thread innowireless TaeYun Kim
Hi, A couple of month ago, I made a pull request to fix https://issues.apache.org/jira/browse/SPARK-1825. My pull request is here: https://github.com/apache/spark/pull/899 But that pull request has problems: l It is Hadoop 2.4.0+ only. It won't compile on the versions below it. l The

Re: Suggestion: rdd.compute()

2014-06-10 Thread Ankur Dave
You can achieve an equivalent effect by calling rdd.foreach(x => {}), which is the lightest possible action that forces materialization of the whole RDD. Ankur

Suggestion: rdd.compute()

2014-06-10 Thread innowireless TaeYun Kim
Hi, Regarding the following scenario, Would it be nice to have an action method named like 'compute()' that does nothing but computing/materializing the whole partitions of an RDD? It can also be useful for the profiling. -Original Message- From: innowireless TaeYun Kim [mailto:taeyun...

Suggestion or question: Adding rdd.cancelCache() method

2014-05-29 Thread innowireless TaeYun Kim
What I understand is that rdd.cache() is really rdd.cache_this_rdd_when_it_actually_materializes(). So, somewhat esoteric problem may occur. The example is as follows: void method1() { JavaRDD<...> rdd = sc.textFile(...) .map(...); rdd.cache(); // since the follo

RE: Suggestion: RDD cache depth

2014-05-29 Thread innowireless TaeYun Kim
Opened a JIRA issue. (https://issues.apache.org/jira/browse/SPARK-1962) Thanks. -Original Message- From: Matei Zaharia [mailto:matei.zaha...@gmail.com] Sent: Thursday, May 29, 2014 3:54 PM To: dev@spark.apache.org Subject: Re: Suggestion: RDD cache depth This is a pretty cool idea

Re: Suggestion: RDD cache depth

2014-05-28 Thread Matei Zaharia
This is a pretty cool idea — instead of cache depth I’d call it something like reference counting. Would you mind opening a JIRA issue about it? The issue of really composing together libraries that use RDDs nicely isn’t fully explored, but this is certainly one thing that would help with it. I’

Suggestion: RDD cache depth

2014-05-28 Thread innowireless TaeYun Kim
It would be nice if the RDD cache() method incorporate a depth information. That is, void test() { JavaRDD<.> rdd = .; rdd.cache(); // to depth 1. actual caching happens. rdd.cache(); // to depth 2. Nop as long as the storage level is the same. Else, exception. . rdd.uncache(); // t

Re: Any suggestion about JIRA 1006 "MLlib ALS gets stack overflow with too many iterations"?

2014-04-12 Thread Nick Pentreath
I think having the option of seeding the factors from HDFS rather than random is a good one (well, actually providing additional optional arguments initialUserFactors and initialItemFactors as RDD[(Int, Array[Double])]) On Mon, Apr 7, 2014 at 8:09 AM, Debasish Das wrote: > Sorry not persist...I

Re: Suggestion

2014-04-11 Thread Sandy Ryza
Hi Priya, Here's a good place to start: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark -Sandy On Fri, Apr 11, 2014 at 12:05 PM, priya arora wrote: > Hi, > > May I know how one can contribute in this project > http://spark.apache.org/mllib/ or in any other project. I am

Suggestion

2014-04-11 Thread priya arora
Hi, May I know how one can contribute in this project http://spark.apache.org/mllib/ or in any other project. I am very eager to contribute. Do let me know. Thanks & Regards, Priya Arora

Re: Any suggestion about JIRA 1006 "MLlib ALS gets stack overflow with too many iterations"?

2014-04-06 Thread Debasish Das
Sorry not persist...I meant adding a user parameter k which does checkpoint after every k iterations...out of N ALS iterations...We have hdfs installed so not a big deal...is there an issue of adding this user parameter in ALS.scala ? If it is then I can add it to our internal branch... For me tip

Re: Any suggestion about JIRA 1006 "MLlib ALS gets stack overflow with too many iterations"?

2014-04-06 Thread Xiangrui Meng
Btw, explicit ALS doesn't need persist because each intermediate factor is only used once. -Xiangrui On Sun, Apr 6, 2014 at 9:13 PM, Xiangrui Meng wrote: > The persist used in implicit ALS doesn't help StackOverflow problem. > Persist doesn't cut lineage. We need to call count() and then > checkp

Re: Any suggestion about JIRA 1006 "MLlib ALS gets stack overflow with too many iterations"?

2014-04-06 Thread Xiangrui Meng
The persist used in implicit ALS doesn't help StackOverflow problem. Persist doesn't cut lineage. We need to call count() and then checkpoint() to cut the lineage. Did you try the workaround mentioned in https://issues.apache.org/jira/browse/SPARK-958: "I tune JVM thread stack size to 512k via opt

Re: Any suggestion about JIRA 1006 "MLlib ALS gets stack overflow with too many iterations"?

2014-04-06 Thread Debasish Das
At the head I see persist option in implicitPrefs but more cases like the ones mentioned above why don't we use similar technique and take an input that which iteration should we persist in explicit runs as well ? for (iter <- 1 to iterations) { // perform ALS update logInfo("Re-co

Re: Any suggestion about JIRA 1006 "MLlib ALS gets stack overflow with too many iterations"?

2014-03-27 Thread Debasish Das
Hi Matei, I am hitting similar problems with 10 ALS iterations...I am running with 24 gb executor memory on 10 nodes for 20M x 3 M matrix with rank =50 The first iteration of flatMaps run fine which means that the memory requirements are good per iteration... If I do check-pointing on RDD, most

Re: [SUGGESTION] suggest contributors to run sbt scalastyle before run sbt test

2014-03-03 Thread Reynold Xin
Thanks for the suggestion. Just did it. On Mon, Mar 3, 2014 at 7:38 AM, Nan Zhu wrote: > Hi, all > > I noticed this because...my two PRs failed for the style error (exceeding > for 3 - 5 chars) yesterday > > Maybe we can explicitly suggest contributors to run sbt scalastyle

[SUGGESTION] suggest contributors to run sbt scalastyle before run sbt test

2014-03-03 Thread Nan Zhu
Hi, all I noticed this because…my two PRs failed for the style error (exceeding for 3 - 5 chars) yesterday Maybe we can explicitly suggest contributors to run sbt scalastyle before they run test cases https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark Just add one sente