Re: [VOTE] Release Apache Spark 1.5.0 (RC1)

2015-08-20 Thread Shivaram Venkataraman
FYI The staging repository published as version 1.5.0 is at https://repository.apache.org/content/repositories/orgapachespark-1136 while the staging repository published as version 1.5.0-rc1 is at https://repository.apache.org/content/repositories/orgapachespark-1137 Thanks Shivaram On Thu, Aug

[VOTE] Release Apache Spark 1.5.0 (RC1)

2015-08-20 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version 1.5.0! The vote is open until Monday, Aug 17, 2015 at 20:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.5.0 [ ] -1 Do not release this package because ...

Re: Dataframe aggregation with Tungsten unsafe

2015-08-20 Thread Reynold Xin
Not sure what's going on or how you measure the time, but the difference here is pretty big when I test on my laptop. Maybe you set the wrong config variables? (spark.sql.* are sql variables that you set in sqlContext.setConf -- and in 1.5, they are consolidated into a single flag: spark.sql.tungst

RE: Dataframe aggregation with Tungsten unsafe

2015-08-20 Thread Ulanov, Alexander
Did git pull :) Now I do get the difference in time between on/off Tungsten unsafe: it is 24-25 seconds (unsafe on) vs 32-26 seconds (unsafe off) for the example below. Why I am not getting the improvement as advertised on Spark Summit (slide 23)? http://www.slideshare.net/SparkSummit/deep-dive-

Re: Dataframe aggregation with Tungsten unsafe

2015-08-20 Thread Reynold Xin
Please git pull :) On Thu, Aug 20, 2015 at 5:35 PM, Ulanov, Alexander wrote: > I am using Spark 1.5 cloned from master on June 12. (The aggregate unsafe > feature was added to Spark on April 29.) > > > > *From:* Reynold Xin [mailto:r...@databricks.com] > *Sent:* Thursday, August 20, 2015 5:26 P

RE: Dataframe aggregation with Tungsten unsafe

2015-08-20 Thread Ulanov, Alexander
I am using Spark 1.5 cloned from master on June 12. (The aggregate unsafe feature was added to Spark on April 29.) From: Reynold Xin [mailto:r...@databricks.com] Sent: Thursday, August 20, 2015 5:26 PM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: Dataframe aggregation with Tungsten

Re: Dataframe aggregation with Tungsten unsafe

2015-08-20 Thread Reynold Xin
Yes - DataFrame and SQL are the same thing. Which version are you running? Spark 1.4 doesn't run Janino --- but you have a Janino exception? On Thu, Aug 20, 2015 at 5:01 PM, Ulanov, Alexander wrote: > When I add the following option: > > spark.sql.codegen true > > > > Spark crashed on the

RE: Dataframe aggregation with Tungsten unsafe

2015-08-20 Thread Ulanov, Alexander
When I add the following option: spark.sql.codegen true Spark crashed on the “df.count” with concurrentException (below). Are you sure that I need to set this flag to get unsafe? It looks like SQL flag, and I don’t use sql. java.util.concurrent.ExecutionException: org.codehaus.commons.co

Re: Dataframe aggregation with Tungsten unsafe

2015-08-20 Thread Reynold Xin
I think you might need to turn codegen on also in order for the unsafe stuff to work. On Thu, Aug 20, 2015 at 4:09 PM, Ulanov, Alexander wrote: > Hi Reynold, > > Thank you for suggestion. This code takes around 30 sec on my setup (5 > workers with 32GB). My issue is that I don't see the change

Re: PySpark on PyPi

2015-08-20 Thread westurner
On Aug 20, 2015 4:57 PM, "Justin Uang [via Apache Spark Developers List]" < ml-node+s1001551n13766...@n3.nabble.com> wrote: > > One other question: Do we have consensus on publishing the pip-installable source distribution to PyPI? If so, is that something that the maintainers need to add to the pr

Re: Dataframe aggregation with Tungsten unsafe

2015-08-20 Thread Ulanov, Alexander
Hi Reynold, Thank you for suggestion. This code takes around 30 sec on my setup (5 workers with 32GB). My issue is that I don't see the change in time if I unset the unsafe flags. Could you explain why it might happen? 20 авг. 2015 г., в 15:32, Reynold Xin mailto:r...@databricks.com>> написал(

Re: Dataframe aggregation with Tungsten unsafe

2015-08-20 Thread Reynold Xin
BTW one other thing -- don't use the count() to do benchmark, since the optimizer is smart enough to figure out that you don't actually need to run the sum. For the purpose of benchmarking, you can use df.foreach(i => do nothing) On Thu, Aug 20, 2015 at 3:31 PM, Reynold Xin wrote: > I did

Re: Dataframe aggregation with Tungsten unsafe

2015-08-20 Thread Reynold Xin
I didn't wait long enough earlier. Actually it did finish when I raised memory to 8g. In 1.5 with Tungsten (which should be the same as 1.4 with your unsafe flags), the query took 40s with 4G of mem. In 1.4, it took 195s with 8G of mem. This is not a scientific benchmark and I only ran it once.

Re: Dataframe aggregation with Tungsten unsafe

2015-08-20 Thread Reynold Xin
How did you run this? I couldn't run your query with 4G of RAM in 1.4, but in 1.5 it ran. Also I recommend just dumping the data to parquet on disk to evaluate, rather than using the in-memory cache, which is super slow and we are thinking of removing/replacing with something else. val size = 10

Dataframe aggregation with Tungsten unsafe

2015-08-20 Thread Ulanov, Alexander
Dear Spark developers, I am trying to benchmark the new Dataframe aggregation implemented under the project Tungsten and released with Spark 1.4 (I am using the latest Spark from the repo, i.e. 1.5): https://github.com/apache/spark/pull/5725 It tells that the aggregation should be faster due to

Re: PySpark on PyPi

2015-08-20 Thread Justin Uang
One other question: Do we have consensus on publishing the pip-installable source distribution to PyPI? If so, is that something that the maintainers need to add to the process that they use to publish releases? On Thu, Aug 20, 2015 at 5:44 PM Justin Uang wrote: > I would prefer to just do it wi

Re: PySpark on PyPi

2015-08-20 Thread Justin Uang
I would prefer to just do it without the jar first as well. My hunch is that to run spark the way it is intended, we need the wrapper scripts, like spark-submit. Does anyone know authoritatively if that is the case? On Thu, Aug 20, 2015 at 4:54 PM Olivier Girardot < o.girar...@lateral-thoughts.com

Re: PySpark on PyPi

2015-08-20 Thread Brian Granger
I would start with just the plain python package without the JAR and then see if it makes sense to add the JAR over time. On Thu, Aug 20, 2015 at 12:27 PM, Auberon Lopez wrote: > Hi all, > > I wanted to bubble up a conversation from the PR to this discussion to see > if there is support the idea

Re: [ANNOUNCE] Spark 1.5.0-preview package

2015-08-20 Thread Reynold Xin
Thanks for reporting back, Mark. I will soon post a release candidate. On Thursday, August 20, 2015, mkhaitman wrote: > Turns out it was a mix of user-error as well as a bug in the sbt/sbt build > that has since been fixed in the current 1.5 branch (I built from this > commit: b4f4e91c395cb69ce

Re: Fwd: [ANNOUNCE] Spark 1.5.0-preview package

2015-08-20 Thread mkhaitman
Turns out it was a mix of user-error as well as a bug in the sbt/sbt build that has since been fixed in the current 1.5 branch (I built from this commit: b4f4e91c395cb69ced61d9ff1492d1b814f96828) I've been testing out the dynamic allocation specifically and it's looking pretty solid! Haven't come

Re: [survey] [spark-ec2] What do you like/dislike about spark-ec2?

2015-08-20 Thread Nicholas Chammas
I'm planning to close the survey to further responses early next week. If you haven't chimed in yet, the link to the survey is here: http://goo.gl/forms/erct2s6KRR We already have some great responses, which you can view. I'll share a summary after the survey is closed. Cheers! Nick On Mon,

Re: PySpark on PyPi

2015-08-20 Thread Brian Granger
Auberon, can you also post this to the Jupyter Google Group? On Wed, Aug 19, 2015 at 12:23 PM, Auberon Lopez wrote: > Hi all, > > I've created an updated PR for this based off of the previous work of > @prabinb: > https://github.com/apache/spark/pull/8318 > > I am not very familiar with python pa

Re: [spark-csv] how to build with Hadoop 2.6.0?

2015-08-20 Thread Mohit Jaggi
2.2.0 is the default version spark uses if a specific version of hadoop is not specified while building it. spark-csv uses spark-packages to "link" with spark. ideally, it would not care about any specific hadoop version. also ideally, spark-csv should not have that hadoop import at all. your worka