Approximate rank-based statistics (median, 95-th percentile, etc.) for Spark

2015-04-06 Thread Grega Kešpret
Hi! I'd like to get community's opinion on implementing a generic quantile approximation algorithm for Spark that is O(n) and requires limited memory. I would find it useful and I haven't found any existing implementation. The plan was basically to wrap t-digest

Re: Approximate rank-based statistics (median, 95-th percentile, etc.) for Spark

2015-04-06 Thread Reynold Xin
I think those are great to have. I would put them in the DataFrame API though, since this is applying to structured data. Many of the advanced functions on the PairRDDFunctions should really go into the DataFrame API now we have it. One thing that would be great to understand is what state-of-the-

Re: Wrong initial bias in GraphX SVDPlusPlus?

2015-04-06 Thread Sean Owen
See now: https://issues.apache.org/jira/browse/SPARK-6710 On Mon, Apr 6, 2015 at 4:27 AM, Reynold Xin wrote: > Adding Jianping Wang to the thread, since he contributed the SVDPlusPlus > implementaiton. > > Jianping, > > Can you take a look at this message? Thanks. > > > On Fri, Apr 3, 2015 at 8:4

Re: [VOTE] Release Apache Spark 1.3.1

2015-04-06 Thread Sean Owen
SPARK-6673 is not, in the end, relevant for 1.3.x I believe; we just resolved it for 1.4 anyway. False alarm there. I back-ported SPARK-6205 into the 1.3 branch for next time. We'll pick it up if there's another RC, but by itself is not something that needs a new RC. (I will give the same treatmen

Re: Spark + Kinesis

2015-04-06 Thread Vadim Bichutskiy
Hi all, I am wondering, has anyone on this list been able to successfully implement Spark on top of Kinesis? Best, Vadim ᐧ On Sun, Apr 5, 2015 at 1:50 PM, Vadim Bichutskiy wrote: > ᐧ > Hi all, > > Below is the output that I am getting. My Kinesis stream has 1 shard, and > my Spark cluster on E

RE: Stochastic gradient descent performance

2015-04-06 Thread Ulanov, Alexander
Batch size impacts convergence, so bigger batch means more iterations. There are some approaches to deal with it (such as http://www.cs.cmu.edu/~muli/file/minibatch_sgd.pdf), but they need to be implemented and tested. Nonetheless, could you share your thoughts regarding reducing this overhead

Re: [VOTE] Release Apache Spark 1.3.1

2015-04-06 Thread York, Brennon
+1 (non-binding) Tested GraphX, build infrastructure, & core test suite on OSX 10.9 w/ Java 1.7/1.8 On 4/6/15, 5:21 AM, "Sean Owen" wrote: >SPARK-6673 is not, in the end, relevant for 1.3.x I believe; we just >resolved it for 1.4 anyway. False alarm there. > >I back-ported SPARK-6205 into the 1

Re: [VOTE] Release Apache Spark 1.3.1

2015-04-06 Thread Hari Shreedharan
It does not look like https://issues.apache.org/jira/browse/SPARK-6222 made it. It was targeted towards this release.  Thanks, Hari On Mon, Apr 6, 2015 at 11:04 AM, York, Brennon wrote: > +1 (non-binding) > Tested GraphX, build infrastructure, & core test suite on OSX 10.9 w/ Java > 1.7/1.8

Re: [VOTE] Release Apache Spark 1.3.1

2015-04-06 Thread Mark Hamstra
Is that correct, or is the JIRA just out of sync, since TD's PR was merged? https://github.com/apache/spark/pull/5008 On Mon, Apr 6, 2015 at 11:10 AM, Hari Shreedharan wrote: > It does not look like https://issues.apache.org/jira/browse/SPARK-6222 > made it. It was targeted towards this release.

Re: [VOTE] Release Apache Spark 1.3.1

2015-04-06 Thread Patrick Wendell
I believe TD just forgot to set the fix version on the JIRA. There is a fix for this in 1.3: https://github.com/apache/spark/commit/03e263f5b527cf574f4ffcd5cd886f7723e3756e - Patrick On Mon, Apr 6, 2015 at 2:31 PM, Mark Hamstra wrote: > Is that correct, or is the JIRA just out of sync, since TD

Re: [VOTE] Release Apache Spark 1.3.1

2015-04-06 Thread Hari Shreedharan
Ah, ok. It was missing in the list of jiras. So +1. Thanks, Hari On Mon, Apr 6, 2015 at 11:36 AM, Patrick Wendell wrote: > I believe TD just forgot to set the fix version on the JIRA. There is > a fix for this in 1.3: > https://github.com/apache/spark/commit/03e263f5b527cf574f4ffcd5cd886f772

Re: [VOTE] Release Apache Spark 1.2.2

2015-04-06 Thread Mark Hamstra
+1 On Sun, Apr 5, 2015 at 4:24 PM, Patrick Wendell wrote: > Please vote on releasing the following candidate as Apache Spark version > 1.2.2! > > The tag to be voted on is v1.2.2-rc1 (commit 7531b50): > > https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7531b50e406ee2e3301b009ceea7

Re: [VOTE] Release Apache Spark 1.3.1

2015-04-06 Thread Mark Hamstra
+1 On Sat, Apr 4, 2015 at 5:09 PM, Patrick Wendell wrote: > Please vote on releasing the following candidate as Apache Spark version > 1.3.1! > > The tag to be voted on is v1.3.1-rc1 (commit 0dcb5d9f): > > https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=0dcb5d9f31b713ed90bcec63ebc

Re: [VOTE] Release Apache Spark 1.2.2

2015-04-06 Thread Reynold Xin
+1 too On Sun, Apr 5, 2015 at 4:24 PM, Patrick Wendell wrote: > Please vote on releasing the following candidate as Apache Spark version > 1.2.2! > > The tag to be voted on is v1.2.2-rc1 (commit 7531b50): > > https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7531b50e406ee2e3301b009c

Re: [VOTE] Release Apache Spark 1.2.2

2015-04-06 Thread Krishna Sankar
+1 On Sun, Apr 5, 2015 at 4:24 PM, Patrick Wendell wrote: > Please vote on releasing the following candidate as Apache Spark version > 1.2.2! > > The tag to be voted on is v1.2.2-rc1 (commit 7531b50): > > https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7531b50e406ee2e3301b009ceea7

Re: [VOTE] Release Apache Spark 1.3.1

2015-04-06 Thread Sean McNamara
+1 > On Apr 4, 2015, at 6:11 PM, Patrick Wendell wrote: > > Please vote on releasing the following candidate as Apache Spark version > 1.3.1! > > The tag to be voted on is v1.3.1-rc1 (commit 0dcb5d9f): > https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=0dcb5d9f31b713ed90bcec63ebc

Re: Spark + Kinesis

2015-04-06 Thread Tathagata Das
Cc'ing Chris Fregly, who wrote the Kinesis integration. Maybe he can help. On Mon, Apr 6, 2015 at 9:23 AM, Vadim Bichutskiy wrote: > Hi all, > > I am wondering, has anyone on this list been able to successfully > implement Spark on top of Kinesis? > > Best, > Vadim > ᐧ > > On Sun, Apr 5, 2015 at

Re: Stochastic gradient descent performance

2015-04-06 Thread Xiangrui Meng
The gap sampling is triggered when the sampling probability is small and the directly underlying storage has constant time lookups, in particular, ArrayBuffer. This is a very strict requirement. If rdd is cached in memory, we use ArrayBuffer to store its elements and rdd.sample will trigger gap sam

Re: Stochastic gradient descent performance

2015-04-06 Thread Reynold Xin
Note that we can do this in DataFrames and use Catalyst to push Sample down beneath Projection :) On Mon, Apr 6, 2015 at 12:42 PM, Xiangrui Meng wrote: > The gap sampling is triggered when the sampling probability is small > and the directly underlying storage has constant time lookups, in > par

Re: Support parallelized online matrix factorization for Collaborative Filtering

2015-04-06 Thread Xiangrui Meng
This is being discussed in https://issues.apache.org/jira/browse/SPARK-6407. Let's move the discussion there. Thanks for providing references! -Xiangrui On Sun, Apr 5, 2015 at 11:48 PM, Chunnan Yao wrote: > On-line Collaborative Filtering(CF) has been widely used and studied. To > re-train a CF m

Re: Experience using binary packages on various Hadoop distros

2015-04-06 Thread Dean Chen
This would be great for those of us running on HDP. At eBay we recently ran in to few problems using the generic Hadoop lib. Two off of the top of my head: * Needed to included our custom Hadoop client due to custom keberos integration * Minor difference in HDFS protocol causing the following erro

Zinc now required?

2015-04-06 Thread mjhb
Today I cannot build the 1.2 branch: [INFO] [INFO] Building Spark Project Networking 1.2.3-SNAPSHOT [INFO] [INFO] --- scala-maven-plugin:3.2.0:comp

Re: Zinc now required?

2015-04-06 Thread Sean Owen
I don't think it's required. This looks like zinc is running (it seems to find the process on port 3030), but, something is wrong with zinc then. If you aren't running your own zinc, then it's the copy downloaded by Spark. Maybe try deleting that and shutting down the zinc process, and trying a cle

[mllib] Deprecate static train and use builder instead for Scala/Java

2015-04-06 Thread Yu Ishikawa
Hi all, Joseph proposed an idea about using just builder methods, instead of static train() methods for Scala/Java. I agree with that idea. Because we have many duplicated static train() method. If you have any thoughts on that please share it with us. [SPARK-6682] Deprecate static train and u

Re: Zinc now required?

2015-04-06 Thread mjhb
Killing zinc resolved the problem building with scala-2.10 - thank you. (adding that to my build script) Having problems building with scala-2.11 - will post separately for that if reproducible. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Zinc-no

Re: Zinc now required?

2015-04-06 Thread Marcelo Vanzin
I ran into this recently and I had to upgrade my version of zinc to fix it... On Mon, Apr 6, 2015 at 5:40 PM, mjhb wrote: > Today I cannot build the 1.2 branch: > > [INFO] > > [INFO] Building Spark Project Networking 1.2.3-S

1.3 Build Error with Scala-2.11

2015-04-06 Thread mjhb
$dev/change-version-to-2.11.sh $build/mvn -e -DskipTests clean package [ERROR] Failed to execute goal on project spark-core_2.11: Could not resolve dependencies for project org.apache.spark:spark-core_2.11:jar:1.3.2-SNAPSHOT: The following artifacts could not be resolved: org.apache.spark:spark-ne

Re: 1.3 Build Error with Scala-2.11

2015-04-06 Thread mjhb
Similar problem on 1.2 branch: [ERROR] Failed to execute goal on project spark-core_2.11: Could not resolve dependencies for project org.apache.spark:spark-core_2.11:jar:1.2.3-SNAPSHOT: The following artifacts could not be resolved: org.apache.spark:spark-network-common_2.10:jar:1.2.3-SNAPSHOT, or

Re: 1.3 Build Error with Scala-2.11

2015-04-06 Thread Patrick Wendell
What if you don't run zinc? I.e. just download maven and run that "mvn package...". It might take longer, but I wonder if it will work. On Mon, Apr 6, 2015 at 10:26 PM, mjhb wrote: > Similar problem on 1.2 branch: > > [ERROR] Failed to execute goal on project spark-core_2.11: Could not resolve >

Re: 1.3 Build Error with Scala-2.11

2015-04-06 Thread Marty Bower
I'm killing zinc (if it's running) before running each build attempt. Trying to build as "clean" as possible. On Mon, Apr 6, 2015 at 7:31 PM Patrick Wendell wrote: > What if you don't run zinc? I.e. just download maven and run that "mvn > package...". It might take longer, but I wonder if it w

Re: 1.3 Build Error with Scala-2.11

2015-04-06 Thread Patrick Wendell
The issue is that if you invoke "build/mvn" it will start zinc again if it sees that it is killed. The absolute most "sterile" thing to do is this: 1. Kill any zinc processes. 2. Clean up spark "git clean -fdx" (WARNING: this will delete any staged changes you have, if you have code modifications

Re: 1.3 Build Error with Scala-2.11

2015-04-06 Thread Patrick Wendell
One thing that I think can cause issues is if you run build/mvn with Scala 2.10, then try to run it with 2.11, since I think we may store some downloaded jars relating to zinc that will get screwed up. Not sure that's what is happening, just an idea. On Mon, Apr 6, 2015 at 10:54 PM, Patrick Wendel

Re: 1.3 Build Error with Scala-2.11

2015-04-06 Thread mjhb
I resorted to deleting the spark directory between each build earlier today (attempting maximum sterility) and then re-cloning from github and switching to the 1.2 or 1.3 branch. Does anything persist outside of the spark directory? Are you able to build either 1.2 or 1.3 w/ Scala-2.11? -- Vie

Re: 1.3 Build Error with Scala-2.11

2015-04-06 Thread Patrick Wendell
The only think that can persist outside of Spark is if there is still a live Zinc process. We took care to make sure this was a generally stateless mechanism. Both the 1.2.X and 1.3.X releases are built with Scala 2.11 for packaging purposes. And these have been built as recently as in the last fe

Re: 1.3 Build Error with Scala-2.11

2015-04-06 Thread mjhb
I even deleted my local maven repository (.m2) but still stuck when attempting to build w/ Scala-2.11: [ERROR] Failed to execute goal on project spark-core_2.11: Could not resolve dependencies for project org.apache.spark:spark-core_2.11:jar:1.3.2-SNAPSHOT: The following artifacts could not be res

Re: 1.3 Build Error with Scala-2.11

2015-04-06 Thread Patrick Wendell
Hmm.. Make sure you are building with the right flags. I think you need to pass -Dscala-2.11 to maven. Take a look at the upstream docs - on my phone now so can't easily access. On Apr 7, 2015 1:01 AM, "mjhb" wrote: > I even deleted my local maven repository (.m2) but still stuck when > attempti