Re: Where are the docs for the SparkSQL DataTypes?

2014-12-11 Thread Michael Armbrust
I'd suggest looking at the reference in the programming guide: http://spark.apache.org/docs/latest/sql-programming-guide.html#spark-sql-datatype-reference On Thu, Dec 11, 2014 at 6:45 PM, Alessandro Baretta wrote: > Thanks. This is useful. > > Alex > > On Thu, Dec 11, 2014 at 4:35 PM, Cheng, Ha

Re: Where are the docs for the SparkSQL DataTypes?

2014-12-11 Thread Alessandro Baretta
Thanks. This is useful. Alex On Thu, Dec 11, 2014 at 4:35 PM, Cheng, Hao wrote: > > Part of it can be found at: > > https://github.com/apache/spark/pull/3429/files#diff-f88c3e731fcb17b1323b778807c35b38R34 > > Sorry it's a TO BE reviewed PR, but still should be informative. > > Cheng Hao > >

Re: Tachyon in Spark

2014-12-11 Thread Reynold Xin
Actually HY emailed me offline about this and this is supported in the latest version of Tachyon. It is a hard problem to push this into storage; need to think about how to handle isolation, resource allocation, etc. https://github.com/amplab/tachyon/blob/master/core/src/main/java/tachyon/master/D

Is there any document to explain how to build the hive jars for spark?

2014-12-11 Thread Yi Tian
Hi, all We found some bugs in hive-0.12, but we could not wait for hive community fixing them. We want to fix these bugs in our lab and build a new release which could be recognized by spark. As we know, spark depends on a special release of hive, like: | org.spark-project.hive hive-me

RE: Where are the docs for the SparkSQL DataTypes?

2014-12-11 Thread Cheng, Hao
Part of it can be found at: https://github.com/apache/spark/pull/3429/files#diff-f88c3e731fcb17b1323b778807c35b38R34 Sorry it's a TO BE reviewed PR, but still should be informative. Cheng Hao -Original Message- From: Alessandro Baretta [mailto:alexbare...@gmail.com] Sent: Friday, Decem

running the Terasort example

2014-12-11 Thread Tim Harsch
Hi all, I just joined the list, so I don¹t have a message history that would allow me to reply to this post: http://apache-spark-developers-list.1001551.n3.nabble.com/Terasort-example- td9284.html I am interested in running the terasort example. I cloned the repo https://github.com/ehiggs/spark a

Re: Tachyon in Spark

2014-12-11 Thread Reynold Xin
I don't think the lineage thing is even turned on in Tachyon - it was mostly a research prototype, so I don't think it'd make sense for us to use that. On Thu, Dec 11, 2014 at 3:51 PM, Andrew Ash wrote: > I'm interested in understanding this as well. One of the main ways Tachyon > is supposed

Re: Tachyon in Spark

2014-12-11 Thread Andrew Ash
I'm interested in understanding this as well. One of the main ways Tachyon is supposed to realize performance gains without sacrificing durability is by storing the lineage of data rather than full copies of it (similar to Spark). But if Spark isn't sending lineage information into Tachyon, then

Re: Evaluation Metrics for Spark's MLlib

2014-12-11 Thread Joseph Bradley
Hi, I'd recommend starting by checking out the existing helper functionality for these tasks. There are helper methods to do K-fold cross-validation in MLUtils: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala The experimental spark.ml API

Re: jenkins downtime: 730-930am, 12/12/14

2014-12-11 Thread shane knapp
here's the plan... reboots, of course, come last. :) pause build queue at 7am, kill off (and eventually retrigger) any stragglers at 8am. then begin maintenance: all systems: * yum update all servers (amp-jekins-master, amp-jenkins-slave-{01..05}, amp-jenkins-worker-{01..08}) * reboots jenkin

Where are the docs for the SparkSQL DataTypes?

2014-12-11 Thread Alessandro Baretta
Michael & other Spark SQL junkies, As I read through the Spark API docs, in particular those for the org.apache.spark.sql package, I can't seem to find details about the Scala classes representing the various SparkSQL DataTypes, for instance DecimalType. I find DataType classes in org.apache.spark

Evaluation Metrics for Spark's MLlib

2014-12-11 Thread kidynamit
Hi, I would like to contribute to Spark's Machine Learning library by adding evaluation metrics that would be used to gauge the accuracy of a model given a certain features' set. In particular, I seek to contribute the k-fold validation metrics, f-beta metric among others on top of the current ML

Re: [VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-11 Thread Sandy Ryza
+1 (non-binding). Tested on Ubuntu against YARN. On Thu, Dec 11, 2014 at 9:38 AM, Reynold Xin wrote: > +1 > > Tested on OS X. > > On Wednesday, December 10, 2014, Patrick Wendell > wrote: > > > Please vote on releasing the following candidate as Apache Spark version > > 1.2.0! > > > > The tag

Re: [VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-11 Thread Reynold Xin
+1 Tested on OS X. On Wednesday, December 10, 2014, Patrick Wendell wrote: > Please vote on releasing the following candidate as Apache Spark version > 1.2.0! > > The tag to be voted on is v1.2.0-rc2 (commit a428c446e2): > > https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=a428c44

Re: [VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-11 Thread Sean Owen
Signatures and checksums are OK. License and notice still looks fine. The plain-vanilla source release compiles with Maven 3.2.1 and passes tests, on OS X 10.10 + Java 8. On Wed, Dec 10, 2014 at 9:08 PM, Patrick Wendell wrote: > Please vote on releasing the following candidate as Apache Spark ver

Re: HA support for Spark

2014-12-11 Thread Jun Feng Liu
Interesting, you saying StreamContext checkpoint can regenerate DAG stuff? Best Regards Jun Feng Liu IBM China Systems & Technology Laboratory in Beijing Phone: 86-10-82452683 E-mail: liuj...@cn.ibm.com BLD 28,ZGC Software Park No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193 C

Re: [VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-11 Thread Madhu
+1 (non-binding) Built and tested on Windows 7: cd apache-spark git fetch git checkout v1.2.0-rc2 sbt assembly [warn] ... [warn] [success] Total time: 720 s, completed Dec 11, 2014 8:57:36 AM dir assembly\target\scala-2.10\spark-assembly-1.2.0-hadoop1.0.4.jar 110,361,054 spark-assembly-1.2.0-had

Re: HA support for Spark

2014-12-11 Thread Tathagata Das
Spark Streaming essentially does this by saving the DAG of DStreams, which can deterministically regenerate the DAG of RDDs upon recovery from failure. Along with that the progress information (which batches have finished, which batches are queued, etc.) is also saved, so that upon recovery the sys