Getting the execution times of spark job

2014-09-02 Thread Niranda Perera
Hi, I have been playing around with spark for a couple of days. I am using spark-1.0.1-bin-hadoop1 and the Java API. The main idea of the implementation is to run Hive queries on Spark. I used JavaHiveContext to achieve this (As per the examples). I have 2 questions. 1. I am wondering how I could

Re: Getting the execution times of spark job

2014-09-02 Thread Zongheng Yang
For your second question: hql() (as well as sql()) does not launch a Spark job immediately; instead, it fires off the Spark SQL parser/optimizer/planner pipeline first, and a Spark job will be started after the a physical execution plan is selected. Therefore, your hand-rolled end-to-end measuremen

Re: Spark SQL Query and join different data sources.

2014-09-02 Thread Yin Huai
Actually, with HiveContext, you can join hive tables with registered temporary tables. On Fri, Aug 22, 2014 at 9:07 PM, chutium wrote: > oops, thanks Yan, you are right, i got > > scala> sqlContext.sql("select * from a join b").take(10) > java.lang.RuntimeException: Table Not Found: b >

about spark assembly jar

2014-09-02 Thread scwf
hi, all I suggest spark not use assembly jar as default run-time dependency(spark-submit/spark-class depend on assembly jar),use a library of all 3rd dependency jar like hadoop/hive/hbase more reasonable. 1 assembly jar packaged all 3rd jars into a big one, so we need rebuild this jar if w

Re: about spark assembly jar

2014-09-02 Thread Sean Owen
Hm, are you suggesting that the Spark distribution be a bag of 100 JARs? It doesn't quite seem reasonable. It does not remove version conflicts, just pushes them to run-time, which isn't good. The assembly is also necessary because that's where shading happens. In development, you want to run again

Re: about spark assembly jar

2014-09-02 Thread scwf
yes, i am not sure what happens when building assembly jar and in my understanding it just package all the dependency jars to a big one. On 2014/9/2 16:45, Sean Owen wrote: Hm, are you suggesting that the Spark distribution be a bag of 100 JARs? It doesn't quite seem reasonable. It does not rem

Re: about spark assembly jar

2014-09-02 Thread Ye Xianjin
Sorry, The quick reply didn't cc the dev list. Sean, sometimes I have to use the spark-shell to confirm some behavior change. In that case, I have to reassembly the whole project. is there another way around, not use the the big jar in development? For the original question, I have no comments

Re: about spark assembly jar

2014-09-02 Thread scwf
Hi sean owen, here are some problems when i used assembly jar 1 i put spark-assembly-*.jar to the lib directory of my application, it throw compile error Error:scalac: Error: class scala.reflect.BeanInfo not found. scala.tools.nsc.MissingRequirementError: class scala.reflect.BeanInfo not found.

Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Will Benton
Zongheng pointed out in my SPARK-3329 PR (https://github.com/apache/spark/pull/2220) that Aaron had already fixed this issue but that it had gotten inadvertently clobbered by another patch. I don't know how the project handles this kind of problem, but I've rewritten my SPARK-3329 branch to ch

Re: about spark assembly jar

2014-09-02 Thread Sandy Ryza
This doesn't help for every dependency, but Spark provides an option to build the assembly jar without Hadoop and its dependencies. We make use of this in CDH packaging. -Sandy On Tue, Sep 2, 2014 at 2:12 AM, scwf wrote: > Hi sean owen, > here are some problems when i used assembly jar > 1 i

hive client.getAllPartitions in lookupRelation can take a very long time

2014-09-02 Thread chutium
in our hive warehouse there are many tables with a lot of partitions, such as scala> hiveContext.sql("use db_external") scala> val result = hiveContext.sql("show partitions et_fullorders").count result: Long = 5879 i noticed that this part of code: https://github.com/apache/spark/blob/9d006c97371d

hey spark developers! intro from shane knapp, devops engineer @ AMPLab

2014-09-02 Thread shane knapp
so, i had a meeting w/the databricks guys on friday and they recommended i send an email out to the list to say 'hi' and give you guys a quick intro. :) hi! i'm shane knapp, the new AMPLab devops engineer, and will be spending time getting the jenkins build infrastructure up to production qualit

Re: hey spark developers! intro from shane knapp, devops engineer @ AMPLab

2014-09-02 Thread Reynold Xin
Welcome, Shane! On Tuesday, September 2, 2014, shane knapp wrote: > so, i had a meeting w/the databricks guys on friday and they recommended i > send an email out to the list to say 'hi' and give you guys a quick intro. > :) > > hi! i'm shane knapp, the new AMPLab devops engineer, and will be

Re: hey spark developers! intro from shane knapp, devops engineer @ AMPLab

2014-09-02 Thread Nicholas Chammas
Hi Shane! Thank you for doing the Jenkins upgrade last week. It's nice to know that infrastructure is gonna get some dedicated TLC going forward. Welcome aboard! Nick On Tue, Sep 2, 2014 at 1:35 PM, shane knapp wrote: > so, i had a meeting w/the databricks guys on friday and they recommended

Re: about spark assembly jar

2014-09-02 Thread Reynold Xin
Having a SSD help tremendously with assembly time. Without that, you can do the following in order for Spark to pick up the compiled classes before assembly at runtime. export SPARK_PREPEND_CLASSES=true On Tue, Sep 2, 2014 at 9:10 AM, Sandy Ryza wrote: > This doesn't help for every dependency

Re: hey spark developers! intro from shane knapp, devops engineer @ AMPLab

2014-09-02 Thread Patrick Wendell
Hey Shane, Thanks for your work so far and I'm really happy to see investment in this infrastructure. This is a key productivity tool for us and something we'd love to expand over time to improve the development process of Spark. - Patrick On Tue, Sep 2, 2014 at 10:47 AM, Nicholas Chammas wrote

Re: hey spark developers! intro from shane knapp, devops engineer @ AMPLab

2014-09-02 Thread Christopher Nguyen
Welcome, Shane. As a former prof and eng dir at Google, I've been expecting this to be a first-class engineering college subject. I just didn't expect it to come through this route :-) So congrats, and I hope you represent the beginning of a great new trend at universities. Sent while mobile. Ple

Resource allocation

2014-09-02 Thread rapelly kartheek
Hi, I want to incorporate some intelligence while choosing the resources for rdd replication. I thought, if we replicate rdd on specially chosen nodes based on the capabilities, the next application that requires this rdd can be executed more efficiently. But, I found that an rdd creatd by an appp

Re: about spark assembly jar

2014-09-02 Thread Cheng Lian
Yea, SSD + SPARK_PREPEND_CLASSES totally changed my life :) Maybe we should add a "developer notes" page to document all these useful black magic. On Tue, Sep 2, 2014 at 10:54 AM, Reynold Xin wrote: > Having a SSD help tremendously with assembly time. > > Without that, you can do the following

Re: about spark assembly jar

2014-09-02 Thread Josh Rosen
SPARK_PREPEND_CLASSES is documented on the Spark Wiki (which could probably be easier to find):  https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools On September 2, 2014 at 11:53:49 AM, Cheng Lian (lian.cs@gmail.com) wrote: Yea, SSD + SPARK_PREPEND_CLASSES totally chang

Re: about spark assembly jar

2014-09-02 Thread Cheng Lian
Cool, didn't notice that, thanks Josh! On Tue, Sep 2, 2014 at 11:55 AM, Josh Rosen wrote: > SPARK_PREPEND_CLASSES is documented on the Spark Wiki (which could > probably be easier to find): > https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools > > > On September 2, 2014 at

Re: hey spark developers! intro from shane knapp, devops engineer @ AMPLab

2014-09-02 Thread Henry Saputra
Welcome Shane =) - Henry On Tue, Sep 2, 2014 at 10:35 AM, shane knapp wrote: > so, i had a meeting w/the databricks guys on friday and they recommended i > send an email out to the list to say 'hi' and give you guys a quick intro. > :) > > hi! i'm shane knapp, the new AMPLab devops engineer,

Re: hey spark developers! intro from shane knapp, devops engineer @ AMPLab

2014-09-02 Thread Cheng Lian
Welcome Shane! Glad to see that finally a hero jumping out to tame Jenkins :) On Tue, Sep 2, 2014 at 12:44 PM, Henry Saputra wrote: > Welcome Shane =) > > > - Henry > > On Tue, Sep 2, 2014 at 10:35 AM, shane knapp wrote: > > so, i had a meeting w/the databricks guys on friday and they recommen

Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Will Benton
+1 Tested Scala/MLlib apps on Fedora 20 (OpenJDK 7) and OS X 10.9 (Oracle JDK 8). best, wb - Original Message - > From: "Patrick Wendell" > To: dev@spark.apache.org > Sent: Saturday, August 30, 2014 5:07:52 PM > Subject: [VOTE] Release Apache Spark 1.1.0 (RC3) > > Please vote on rele

Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Cheng Lian
+1 - Tested Thrift server and SQL CLI locally on OSX 10.9. - Checked datanucleus dependencies in distribution tarball built by make-distribution.sh without SPARK_HIVE defined. ​ On Tue, Sep 2, 2014 at 2:30 PM, Will Benton wrote: > +1 > > Tested Scala/MLlib apps on Fedora 20 (OpenJDK

Checkpointing Pregel

2014-09-02 Thread Jeffrey Picard
Hey guys, I’m trying to run connected components on graphs that end up running for a fairly large number of iterations (25-30) and take 5-6 hours. I find more than half the time I end up getting fetch failures and losing an executor after a number of iterations. Then it has to go back and recom

Ask something about spark

2014-09-02 Thread Sanghoon Lee
Hi, I am phoenixlee and a Spark programmer in Korea. And be a good chance this time, it tries to teach college students and office workers to Spark. This course will be done with the support of the government. Can I use the data(pictures, samples, etc.) in the spark homepage for this course? Of co

Re: Ask something about spark

2014-09-02 Thread Reynold Xin
I think in general that is fine. It would be great if your slides come with proper attribution. On Tue, Sep 2, 2014 at 3:31 PM, Sanghoon Lee wrote: > Hi, I am phoenixlee and a Spark programmer in Korea. > > And be a good chance this time, it tries to teach college students and > office workers

Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Reynold Xin
+1 On Tue, Sep 2, 2014 at 3:08 PM, Cheng Lian wrote: > +1 > >- Tested Thrift server and SQL CLI locally on OSX 10.9. >- Checked datanucleus dependencies in distribution tarball built by >make-distribution.sh without SPARK_HIVE defined. > > ​ > > > On Tue, Sep 2, 2014 at 2:30 PM, Wil

Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Kan Zhang
+1 Verified PySpark InputFormat/OutputFormat examples. On Tue, Sep 2, 2014 at 4:10 PM, Reynold Xin wrote: > +1 > > > On Tue, Sep 2, 2014 at 3:08 PM, Cheng Lian wrote: > > > +1 > > > >- Tested Thrift server and SQL CLI locally on OSX 10.9. > >- Checked datanucleus dependencies in distr

quick jenkins restart

2014-09-02 Thread shane knapp
since our queue is really short, i'm waiting for a couple of builds to finish and will be restarting jenkins to install/update some plugins. the github pull request builder looks like it has some fixes to reduce spammy github calls, and reduce any potential rate limiting. i'll let everyone know w

Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Matei Zaharia
+1 Tested on Mac OS X. Matei On September 2, 2014 at 5:03:19 PM, Kan Zhang (kzh...@apache.org) wrote: +1 Verified PySpark InputFormat/OutputFormat examples. On Tue, Sep 2, 2014 at 4:10 PM, Reynold Xin wrote: > +1 > > > On Tue, Sep 2, 2014 at 3:08 PM, Cheng Lian wrote: > >

Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Michael Armbrust
+1 On Tue, Sep 2, 2014 at 5:18 PM, Matei Zaharia wrote: > +1 > > Tested on Mac OS X. > > Matei > > On September 2, 2014 at 5:03:19 PM, Kan Zhang (kzh...@apache.org) wrote: > > +1 > > Verified PySpark InputFormat/OutputFormat examples. > > > On Tue, Sep 2, 2014 at 4:10 PM, Reynold Xin wrote: >

Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Denny Lee
+1  Tested on Mac OSX, Thrift Server, SparkSQL On September 2, 2014 at 17:29:29, Michael Armbrust (mich...@databricks.com) wrote: +1 On Tue, Sep 2, 2014 at 5:18 PM, Matei Zaharia wrote: > +1 > > Tested on Mac OS X. > > Matei > > On September 2, 2014 at 5:03:19 PM, Kan Zhan

RE: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Sean McNamara
+1 From: Patrick Wendell [pwend...@gmail.com] Sent: Saturday, August 30, 2014 4:08 PM To: dev@spark.apache.org Subject: [VOTE] Release Apache Spark 1.1.0 (RC3) Please vote on releasing the following candidate as Apache Spark version 1.1.0! The tag to be vo

RE: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Jeremy Freeman
+1 -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-1-0-RC3-tp8147p8211.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: quick jenkins restart

2014-09-02 Thread shane knapp
and we're back and building! On Tue, Sep 2, 2014 at 5:07 PM, shane knapp wrote: > since our queue is really short, i'm waiting for a couple of builds to > finish and will be restarting jenkins to install/update some plugins. the > github pull request builder looks like it has some fixes to red

Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Paolo Platter
+1 Tested on HDP 2.1 Sandbox, Thrift Server with Simba Shark ODBC Paolo Da: Jeremy Freeman Data invio: ?mercoled?? ?3? ?settembre? ?2014 ?02?:?34 A: d...@spark.incubator.apache.org +1 -- View this message in context: ht

Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Nicholas Chammas
In light of the discussion on SPARK-, I'll revoke my "-1" vote. The issue does not appear to be serious. On Sun, Aug 31, 2014 at 5:14 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > -1: I believe I've found a regression from 1.0.2. The report is captured > in SPARK-

Re: about spark assembly jar

2014-09-02 Thread scwf
Yea, SSD + SPARK_PREPEND_CLASSES is great for iterative development! Then why it is ok with a bag of 3rd jars but throw error with assembly jar, any one have idea? On 2014/9/3 2:57, Cheng Lian wrote: Cool, didn't notice that, thanks Josh! On Tue, Sep 2, 2014 at 11:55 AM, Josh Rosen mailto:ro

Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Patrick Wendell
Thanks everyone for voting on this. There were two minor issues (one a blocker) were found that warrant cutting a new RC. For those who voted +1 on this release, I'd encourage you to +1 rc4 when it comes out unless you have been testing issues specific to the EC2 scripts. This will move the release

Re: [Spark SQL] off-heap columnar store

2014-09-02 Thread Evan Chan
On Sun, Aug 31, 2014 at 8:27 PM, Ian O'Connell wrote: > I'm not sure what you mean here? Parquet is at its core just a format, you > could store that data anywhere. > > Though it sounds like you saying, correct me if i'm wrong: you basically > want a columnar abstraction layer where you can provid