how spark partition data when creating table like " create table xxx as select * from xxx"

2014-05-29 Thread qingyang li
hi, spark-developers, i am using shark/spark, and i am puzzled by such question, and can not find any info from the web, so i ask you. 1. how spark partition data in memory when creating table when using "create table a tblproperties("shark.cache"="memory") as select * from table b " , in anot

Suggestion or question: Adding rdd.cancelCache() method

2014-05-29 Thread innowireless TaeYun Kim
What I understand is that rdd.cache() is really rdd.cache_this_rdd_when_it_actually_materializes(). So, somewhat esoteric problem may occur. The example is as follows: void method1() { JavaRDD<...> rdd = sc.textFile(...) .map(...); rdd.cache(); // since the follo

Re: [RESULT][VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-29 Thread Andy Konwinski
Yes great work all. Special thanks to Patrick (and TD) for excellent leadership! On May 29, 2014 5:39 PM, "Usman Ghani" wrote: > Congrats everyone. Really pumped about this. > > > On Thu, May 29, 2014 at 2:57 PM, Henry Saputra > wrote: > > > Congrats guys! Another milestone for Apache Spark inde

Re: [RESULT][VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-29 Thread Usman Ghani
Congrats everyone. Really pumped about this. On Thu, May 29, 2014 at 2:57 PM, Henry Saputra wrote: > Congrats guys! Another milestone for Apache Spark indeed =) > > - Henry > > On Thu, May 29, 2014 at 2:08 PM, Matei Zaharia > wrote: > > Yup, congrats all. The most impressive thing is the numbe

Re: [RESULT][VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-29 Thread Henry Saputra
Congrats guys! Another milestone for Apache Spark indeed =) - Henry On Thu, May 29, 2014 at 2:08 PM, Matei Zaharia wrote: > Yup, congrats all. The most impressive thing is the number of contributors to > this release — with over 100 contributors, it’s becoming hard to even write > the credits.

Re: [RESULT][VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-29 Thread Matei Zaharia
Yup, congrats all. The most impressive thing is the number of contributors to this release — with over 100 contributors, it’s becoming hard to even write the credits. Look forward to the Apache press release tomorrow. Matei On May 29, 2014, at 1:33 PM, Patrick Wendell wrote: > Congrats everyo

Re: [RESULT][VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-29 Thread Patrick Wendell
Congrats everyone! This is a huge accomplishment! On Thu, May 29, 2014 at 1:24 PM, Tathagata Das wrote: > Hello everyone, > > The vote on Spark 1.0.0 RC11 passes with13 "+1" votes, one "0" vote and no > "-1" vote. > > Thanks to everyone who tested the RC and voted. Here are the totals: > > +1: (1

Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-29 Thread Tathagata Das
Let me put in my +1 as well! This voting is now closed, and it successfully passes with 13 "+1" votes and one "0" vote. Thanks to everyone who tested the RC and voted. Here are the totals: +1: (13 votes) Matei Zaharia* Mark Hamstra* Holden Karau Nick Pentreath* Will Benton Henry Saputra Sean McNa

[RESULT][VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-29 Thread Tathagata Das
Hello everyone, The vote on Spark 1.0.0 RC11 passes with13 "+1" votes, one "0" vote and no "-1" vote. Thanks to everyone who tested the RC and voted. Here are the totals: +1: (13 votes) Matei Zaharia* Mark Hamstra* Holden Karau Nick Pentreath* Will Benton Henry Saputra Sean McNamara* Xiangrui Me

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-29 Thread Patrick Wendell
[tl;dr stable API's are important - sorry, this is slightly meandering] Hey - just wanted to chime in on this as I was travelling. Sean, you bring up great points here about the velocity and stability of Spark. Many projects have fairly customized semantics around what versions actually mean (HBas

Re: Timestamp support in v1.0

2014-05-29 Thread Michael Armbrust
You should be able to get away with only doing it locally. This bug is happening during analysis which only occurs on the driver. On Thu, May 29, 2014 at 10:17 AM, dataginjaninja < rickett.stepha...@gmail.com> wrote: > Darn, I was hoping just to sneak it in that file. I am not the only person >

Re: Timestamp support in v1.0

2014-05-29 Thread Michael Armbrust
Yes, you'll need to download the code from that PR and reassemble Spark (sbt/sbt assembly). On Thu, May 29, 2014 at 10:02 AM, dataginjaninja < rickett.stepha...@gmail.com> wrote: > Michael, > > Will I have to rebuild after adding the change? Thanks > > > > -- > View this message in context: > ht

Re: Timestamp support in v1.0

2014-05-29 Thread dataginjaninja
Darn, I was hoping just to sneak it in that file. I am not the only person working on the cluster; if I rebuild it that means I have to redeploy everything to all the nodes as well. So I cannot do that ... today. If someone else doesn't beat me to it. I can rebuild at another time. - Cheer

Re: Timestamp support in v1.0

2014-05-29 Thread dataginjaninja
Michael, Will I have to rebuild after adding the change? Thanks -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Timestamp-support-in-v1-0-tp6850p6855.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: Timestamp support in v1.0

2014-05-29 Thread dataginjaninja
Yes, I get the same error: scala> val hc = new org.apache.spark.sql.hive.HiveContext(sc) 14/05/29 16:53:40 INFO deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive 14/05/29 16:53:40 INFO deprecation: mapred.max.split.size is depre

Re: Timestamp support in v1.0

2014-05-29 Thread Michael Armbrust
Thanks for reporting this! https://issues.apache.org/jira/browse/SPARK-1964 https://github.com/apache/spark/pull/913 If you could test out that PR and see if it fixes your problems I'd really appreciate it! Michael On Thu, May 29, 2014 at 9:09 AM, Andrew Ash wrote: > I can confirm that the c

Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-29 Thread Patrick Wendell
+1 I spun up a few EC2 clusters and ran my normal audit checks. Tests passing, sigs, CHANGES and NOTICE look good Thanks TD for helping cut this RC! On Wed, May 28, 2014 at 9:38 PM, Kevin Markey wrote: > +1 > > Built -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 > Ran current version of one of my

Re: Timestamp support in v1.0

2014-05-29 Thread Andrew Ash
I can confirm that the commit is included in the 1.0.0 release candidates (it was committed before branch-1.0 split off from master), but I can't confirm that it works in PySpark. Generally the Python and Java interfaces lag a little behind the Scala interface to Spark, but we're working to keep t

Timestamp support in v1.0

2014-05-29 Thread dataginjaninja
Can anyone verify which rc [SPARK-1360] Add Timestamp Support for SQL #275 is included in? I am running rc3, but receiving errors with TIMESTAMP as a datatype in my Hive tables when trying to use them in pyspark. *The error I get: * 14/05/29 15:44:47 I

Re: Standard preprocessing/scaling

2014-05-29 Thread dataginjaninja
I do see the issue for centering sparse data. Actually, the centering is less important than the scaling by the standard deviation. Not having unit variance causes the convergence issues and long runtimes. RowMatrix will compute variance of a column? -- View this message in context: http://ap

Please change instruction about "Launching Applications Inside the Cluster"

2014-05-29 Thread Lizhengbing (bing, BIPA)
The instruction address is in http://spark.apache.org/docs/0.9.0/spark-standalone.html#launching-applications-inside-the-cluster or http://spark.apache.org/docs/0.9.1/spark-standalone.html#launching-applications-inside-the-cluster Origin instruction is: "./bin/spark-class org.apache.spark.deplo

Re: LogisticRegression: Predicting continuous outcomes

2014-05-29 Thread Bharath Ravi Kumar
Xiangrui, Christopher, Thanks for responding. I'll go through the code in detail to evaluate if the loss function used is suitable to our dataset. I'll also go through the referred paper since I was unaware of the underlying theory. Thanks again. -Bharath On Thu, May 29, 2014 at 8:16 AM, Chri

RE: Suggestion: RDD cache depth

2014-05-29 Thread innowireless TaeYun Kim
Opened a JIRA issue. (https://issues.apache.org/jira/browse/SPARK-1962) Thanks. -Original Message- From: Matei Zaharia [mailto:matei.zaha...@gmail.com] Sent: Thursday, May 29, 2014 3:54 PM To: dev@spark.apache.org Subject: Re: Suggestion: RDD cache depth This is a pretty cool idea - in