Re: Updating docs for running on Mesos

2014-05-13 Thread Tim St Clair
Perhaps linking to a Mesos page, which then can list the various package incantations. Cheers, Tim - Original Message - > From: "Matei Zaharia" > To: dev@spark.apache.org > Sent: Tuesday, May 13, 2014 2:59:42 AM > Subject: Re: Updating docs for running on Mesos > > I’ll ask the Mesos

Re: Updating docs for running on Mesos

2014-05-13 Thread Gerard Maas
Great work!. I just left some comments in the PR. In summary, it would be great to have more background on how Spark works on Mesos and how the different elements interact. That will (hopefully) help understanding the practicalities of the common assembly location (http/hdfs) and how the jobs are d

Re: Preliminary Parquet numbers and including .count() in Catalyst

2014-05-13 Thread Andrew Ash
Thanks for filing -- I'm keeping my eye out for updates on that ticket. Cheers! Andrew On Tue, May 13, 2014 at 2:40 PM, Michael Armbrust wrote: > > > > It looks like currently the .count() on parquet is handled incredibly > > inefficiently and all the columns are materialized. But if I select

Re: Class-based key in groupByKey?

2014-05-13 Thread Matei Zaharia
Your key needs to implement hashCode in addition to equals. Matei On May 13, 2014, at 3:30 PM, Michael Malak wrote: > Is it permissible to use a custom class (as opposed to e.g. the built-in > String or Int) for the key in groupByKey? It doesn't seem to be working for > me on Spark 0.9.0/Scal

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-13 Thread Madhu
I just built rc5 on Windows 7 and tried to reproduce the problem described in https://issues.apache.org/jira/browse/SPARK-1712 It works on my machine: 14/05/13 21:06:47 INFO DAGScheduler: Stage 1 (sum at :17) finished in 4.548 s 14/05/13 21:06:47 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whos

Re: Serializable different behavior Spark Shell vs. Scala Shell

2014-05-13 Thread Anand Avati
On Tue, May 13, 2014 at 8:26 AM, Michael Malak wrote: > Reposting here on dev since I didn't see a response on user: > > I'm seeing different Serializable behavior in Spark Shell vs. Scala Shell. > In the Spark Shell, equals() fails when I use the canonical equals() > pattern of match{}, but works

Class-based key in groupByKey?

2014-05-13 Thread Michael Malak
Is it permissible to use a custom class (as opposed to e.g. the built-in String or Int) for the key in groupByKey? It doesn't seem to be working for me on Spark 0.9.0/Scala 2.10.3: import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ class C(val s:String) extends Serializ

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-13 Thread Nan Zhu
+1, replaced rc3 with rc5, all applications are working fine Best, -- Nan Zhu On Tuesday, May 13, 2014 at 8:03 PM, Madhu wrote: > I built rc5 using sbt/sbt assembly on Linux without any problems. > There used to be an sbt.cmd for Windows build, has that been deprecated? > If so, I can docume

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-13 Thread witgo
-1 The following bug should be fixed: https://issues.apache.org/jira/browse/SPARK-1817 https://issues.apache.org/jira/browse/SPARK-1712 -- Original -- From: "Patrick Wendell";; Date: Wed, May 14, 2014 04:07 AM To: "dev@spark.apache.org"; Subject: Re: [VOTE]

Re: Serializable different behavior Spark Shell vs. Scala Shell

2014-05-13 Thread Michael Malak
Thank you for your investigation into this! Just for completeness, I've confirmed it's a problem only in REPL, not in compiled Spark programs. But within REPL, a direct consequence of non-same classes after serialization/deserialization also means that lookup() doesn't work: scala> class C(val

Re: Class-based key in groupByKey?

2014-05-13 Thread Andrew Ash
In Scala, if you override .equals() you also need to override .hashCode(), just like in Java: http://www.scala-lang.org/api/2.10.3/index.html#scala.AnyRef I suspect if your .hashCode() delegates to just the hashcode of s then you'd be good. On Tue, May 13, 2014 at 3:30 PM, Michael Malak wrote:

Re: Preliminary Parquet numbers and including .count() in Catalyst

2014-05-13 Thread Michael Armbrust
> > It looks like currently the .count() on parquet is handled incredibly > inefficiently and all the columns are materialized. But if I select just > that relevant column and then count, then the column-oriented storage of > Parquet really shines. > > There ought to be a potential optimization he

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-13 Thread Patrick Wendell
Hey all - there were some earlier RC's that were not presented to the dev list because issues were found with them. Also, there seems to be some issues with the reliability of the dev list e-mail. Just a heads up. I'll lead with a +1 for this. On Tue, May 13, 2014 at 8:07 AM, Nan Zhu wrote: > ju

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-13 Thread Nan Zhu
Ah, I see, thanks -- Nan Zhu On Tuesday, May 13, 2014 at 12:59 PM, Mark Hamstra wrote: > There were a few early/test RCs this cycle that were never put to a vote. > > > On Tue, May 13, 2014 at 8:07 AM, Nan Zhu (mailto:zhunanmcg...@gmail.com)> wrote: > > > just curious, where is rc4 VOTE?

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-13 Thread Andrew Or
+1 2014-05-13 6:49 GMT-07:00 Sean Owen : > On Tue, May 13, 2014 at 9:36 AM, Patrick Wendell > wrote: > > The release files, including signatures, digests, etc. can be found at: > > http://people.apache.org/~pwendell/spark-1.0.0-rc5/ > > Good news is that the sigs, MD5 and SHA are all correct. >

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-13 Thread Nan Zhu
just curious, where is rc4 VOTE? I searched my gmail but didn't find that? On Tue, May 13, 2014 at 9:49 AM, Sean Owen wrote: > On Tue, May 13, 2014 at 9:36 AM, Patrick Wendell > wrote: > > The release files, including signatures, digests, etc. can be found at: > > http://people.apache.org/~

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-13 Thread Mark Hamstra
There were a few early/test RCs this cycle that were never put to a vote. On Tue, May 13, 2014 at 8:07 AM, Nan Zhu wrote: > just curious, where is rc4 VOTE? > > I searched my gmail but didn't find that? > > > > > On Tue, May 13, 2014 at 9:49 AM, Sean Owen wrote: > > > On Tue, May 13, 2014 at 9

Re: Multinomial Logistic Regression

2014-05-13 Thread DB Tsai
Hi Deb, For K possible outcomes in multinomial logistic regression, we can have K-1 independent binary logistic regression models, in which one outcome is chosen as a "pivot" and then the other K-1 outcomes are separately regressed against the pivot outcome. See my presentation for technical deta

Serializable different behavior Spark Shell vs. Scala Shell

2014-05-13 Thread Michael Malak
Reposting here on dev since I didn't see a response on user: I'm seeing different Serializable behavior in Spark Shell vs. Scala Shell. In the Spark Shell, equals() fails when I use the canonical equals() pattern of match{}, but works when I subsitute with isInstanceOf[]. I am using Spark 0.9.0

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-13 Thread Sean Owen
On Tue, May 13, 2014 at 9:36 AM, Patrick Wendell wrote: > The release files, including signatures, digests, etc. can be found at: > http://people.apache.org/~pwendell/spark-1.0.0-rc5/ Good news is that the sigs, MD5 and SHA are all correct. Tiny note: the Maven artifacts use SHA1, while the bina

[VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-13 Thread Patrick Wendell
Please vote on releasing the following candidate as Apache Spark version 1.0.0! The tag to be voted on is v1.0.0-rc5 (commit 18f0623): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=18f062303303824139998e8fc8f4158217b0dbc3 The release files, including signatures, digests, etc. can

Multinomial Logistic Regression

2014-05-13 Thread Debasish Das
Hi, Is there a PR for multinomial logistic regression which does one-vs-all and compare it to the other possibilities ? @dbtsai in your strata presentation you used one vs all ? Did you add some constraints on the fact that you penalize if mis-predicted labels are not very far from the true label

Sparse vector toLibSvm API

2014-05-13 Thread Debasish Das
Hi, In the sparse vector the toString API is as follows: override def toString: String = { "(" + size + "," + indices.zip(values).mkString("[", "," ,"]") + ")" } Does it make sense to keep it consistent with libsvm format ? What does each line of libsvm format looks like ? Thanks. De

Re: Updating docs for running on Mesos

2014-05-13 Thread Gerard Maas
Andrew, Mesosphere has binary releases here: http://mesosphere.io/downloads/ (Anecdote: I actually burned a CPU building Mesos from source. No kidding - it was coming, as the laptop was crashing from time to time, but the mesos build was that one drop too much) kr, Gerard. On Tue, May 13, 201

Re: Kryo not default?

2014-05-13 Thread Dmitriy Lyubimov
On Mon, May 12, 2014 at 2:47 PM, Anand Avati wrote: > Hi, > Can someone share the reason why Kryo serializer is not the default? why should it be? On top of it, the only way to serialize a closure into the backend (even now) is java serialization (which means java serialization is required of a

Re: Updating docs for running on Mesos

2014-05-13 Thread Andrew Ash
Completely agree about preferring to link to the upstream project rather than a company's -- the only reason I'm using mesosphere's now is that I see no alternative from mesos.apache.org I included instructions for both using Mesosphere's packages and building from scratch in the PR: https://githu

Re: Preliminary Parquet numbers and including .count() in Catalyst

2014-05-13 Thread Andrew Ash
These numbers were run on git commit 756c96 (a few days after the 1.0.0-rc3 tag). Do you have a link to the patch that avoids scanning all columns for count(*) or count(1)? I'd like to give it a shot. Andrew On Mon, May 12, 2014 at 11:41 PM, Reynold Xin wrote: > Thanks for the experiments an

Re: Updating docs for running on Mesos

2014-05-13 Thread Matei Zaharia
I’ll ask the Mesos folks about this. Unfortunately it might be tough to link only to a company’s builds; but we can perhaps include them in addition to instructions for building Mesos from Apache. Matei On May 12, 2014, at 11:55 PM, Gerard Maas wrote: > Andrew, > > Mesosphere has binary rele

Is this supported? : Spark on Windows, Hadoop YARN on Linux.

2014-05-13 Thread innowireless TaeYun Kim
I'm trying to run spark-shell on Windows that uses Hadoop YARN on Linux. Specifically, the environment is as follows: - Client - OS: Windows 7 - Spark version: 1.0.0-SNAPSHOT (git cloned 2014.5.8) - Server - Platform: hortonworks sandbox 2.1 I has to modify the spark source code to apply ht