Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-22 Thread Mark Hamstra
So what are we expecting of Hive 0.12.0 builds with this RC? I know not every combination of Hadoop and Hive versions, etc., can be supported, but even an example build from the "Building Spark" page isn't looking too good to me. Working from f97b0d4, the example build command works: mvn -Pyarn -

Re: Spark SQL - Long running job

2015-02-22 Thread nitin
I believe calling processedSchemaRdd.persist(DISK) and processedSchemaRdd.checkpoint() only persists data and I will lose all the RDD metadata and when I re-start my driver, that data is kind of useless for me (correct me if I am wrong). I thought of doing processedSchemaRdd.saveAsParquetFile (hdf

Re: Spark SQL - Long running job

2015-02-22 Thread Cheng Lian
How about persisting the computed result table first before caching it? So that you only need to cache the result table after restarting your service without recomputing it. Somewhat like checkpointing. Cheng On 2/22/15 12:55 AM, nitin wrote: Hi All, I intend to build a long running spark ap

Re: Have Friedman's glmnet algo running in Spark

2015-02-22 Thread Joseph Bradley
Hi Mike, glmnet has definitely been very successful, and it would be great to see how we can improve optimization in MLlib! There is some related work ongoing; here are the JIRAs: GLMNET implementation in Spark LinearRegression with L1/L2 (elas

Re: textFile() ordering and header rows

2015-02-22 Thread Nicholas Chammas
I guess on a technicality the docs just say "first item in this RDD", not "first line in the source text file". AFAIK there is no way apart from filtering to remove header lines . As long as first() always returns the same value for a given RDD, I think

textFile() ordering and header rows

2015-02-22 Thread Michael Malak
Since RDDs are generally unordered, aren't things like textFile().first() not guaranteed to return the first row (such as looking for a header row)? If so, doesn't that make the example in http://spark.apache.org/docs/1.2.1/quick-start.html#basics misleading? ---

Re: Improving metadata in Spark JIRA

2015-02-22 Thread Nicholas Chammas
Open pull request count is down to 254 right now from ~325 several weeks ago. This great. Ideally, we need to get this down to < 50 and keep it there. Having so many open pull requests is just a bad signal to contributors. But it will take some time to get there. - 1+ Component Sean, do you

Re: Improving metadata in Spark JIRA

2015-02-22 Thread Sean Owen
Open pull request count is down to 254 right now from ~325 several weeks ago. Open JIRA count is down slightly to 1262 from a peak over ~1320. Obviously, in the face of an ever faster and larger stream of contributions. There's a real positive impact of JIRA being a little more meaningful, a littl

Git Achievements

2015-02-22 Thread Nicholas Chammas
For fun: http://acha-acha.co/#/repo/https://github.com/apache/spark I just added Spark to this site. Some of these “achievements” are hilarious. Leo Tolstoy: More than 10 lines in a commit message Dangerous Game: Commit after 6PM friday Nick ​