date:20150825

RE: Dataframe aggregation with Tungsten unsafe

2015-08-25 Thread Wang, Yanping

Hi, Reynold and others I agree with your comments on mid-tenured objects and GC. In fact, dealing with mid-tenured objects are the major challenge for all java GC implementations. I am wondering if anyone has played -XX:+PrintTenuringDistribution flags and see how exactly ages distribution look

[VOTE] Release Apache Spark 1.5.0 (RC2)

2015-08-25 Thread Reynold Xin

Please vote on releasing the following candidate as Apache Spark version 1.5.0. The vote is open until Friday, Aug 29, 2015 at 5:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.5.0 [ ] -1 Do not release this package because ... To

Re: Spark builds: allow user override of project version at buildtime

2015-08-25 Thread Michael Armbrust

This isn't really answering the question, but for what it is worth, I manage several different branches of Spark and publish custom named versions regularly to an internal repository, and this is *much* easier with SBT than with maven. You can actually link the Spark SBT build into an external SBT

Re: Dataframe aggregation with Tungsten unsafe

2015-08-25 Thread Reynold Xin

There are a lot of GC activity due to the non-code-gen path being sloppy about garbage creation. This is not actually what happens, but just as an example: rdd.map { i: Int => i + 1 } This under the hood becomes a closure that boxes on every input and every output, creating two extra objects. Th

Re: Dataframe aggregation with Tungsten unsafe

2015-08-25 Thread Ulanov, Alexander

Thank you for the explanation. The size if the 100M data is ~1.4GB in memory and each worker has 32GB of memory. It seems to be a lot of free memory available. I wonder how Spark can hit GC with such setup? Reynold Xin mailto:r...@databricks.com>> On Fri, Aug 21, 2015 at 11:07 AM, Ulanov, Alex

Re: Dataframe aggregation with Tungsten unsafe

2015-08-25 Thread Reynold Xin

On Fri, Aug 21, 2015 at 11:07 AM, Ulanov, Alexander wrote: > > > It seems that there is a nice improvement with Tungsten enabled given that > data is persisted in memory 2x and 3x. However, the improvement is not that > nice for parquet, it is 1.5x. What’s interesting, with Tungsten enabled > per

Re: [VOTE] Release Apache Spark 1.5.0 (RC1)

2015-08-25 Thread Doug Balog

It works for me in cluster mode. I’m running on Hortonworks 2.2.4.12 in secure mode with Hive 0.14 I built with ./make-distribution —tgz -Phive -Phive-thriftserver -Phbase-provided -Pyarn -Phadoop-2.6 Doug > On Aug 25, 2015, at 4:56 PM, Tom Graves wrote: > > Anyone using HiveContext with

Re: Paring down / tagging tests (or some other way to avoid timeouts)?

2015-08-25 Thread Marcelo Vanzin

I chatted with Patrick briefly offline. It would be interesting to know whether the scripts have some way of saying "run a smaller version of certain tests" (e.g. by setting a system property that the tests look at to decide what to run). That way, if there are no changes under sql/, we could still

Re: [VOTE] Release Apache Spark 1.5.0 (RC1)

2015-08-25 Thread Tom Graves

Anyone using HiveContext with secure Hive with Spark 1.5 and have it working? We have a non standard version of hive but was pulling our hive jars and its failing to authenticate. It could be something in our hive version but wondering if spark isn't forwarding credentials properly. Tom

Re: Paring down / tagging tests (or some other way to avoid timeouts)?

2015-08-25 Thread Michael Armbrust

I'd be okay skipping the HiveCompatibilitySuite for core-only changes. They do often catch bugs in changes to catalyst or sql though. Same for HashJoinCompatibilitySuite/VersionsSuite. HiveSparkSubmitSuite/CliSuite should probably stay, as they do test things like addJar that have been broken by

Re: Paring down / tagging tests (or some other way to avoid timeouts)?

2015-08-25 Thread Patrick Wendell

There is already code in place that restricts which tests run depending on which code is modified. However, changes inside of Spark's core currently require running all dependent tests. If you have some ideas about how to improve that heuristic, it would be great. - Patrick On Tue, Aug 25, 2015 a

Paring down / tagging tests (or some other way to avoid timeouts)?

2015-08-25 Thread Marcelo Vanzin

Hello y'all, So I've been getting kinda annoyed with how many PR tests have been timing out. I took one of the logs from one of my PRs and started to do some crunching on the data from the output, and here's a list of the 5 slowest suites: 307.14s HiveSparkSubmitSuite 382.641s VersionsSuite 398s

Re: [VOTE] Release Apache Spark 1.5.0 (RC1)

2015-08-25 Thread Tom Graves

Is there a jira to update the sql hive docs?Spark SQL and DataFrames - Spark 1.5.0 Documentation | | | | | | | | | Spark SQL and DataFrames - Spark 1.5.0 DocumentationSpark SQL and DataFrame Guide Overview DataFrames Starting Point: SQLContext Creating DataFrames DataFrame Operation

Re: [survey] [spark-ec2] What do you like/dislike about spark-ec2?

2015-08-25 Thread Nicholas Chammas

Final chance to fill out the survey! http://goo.gl/forms/erct2s6KRR I'm gonna close it to new responses tonight and send out a summary of the results. Nick On Thu, Aug 20, 2015 at 2:08 PM Nicholas Chammas wrote: > I'm planning to close the survey to further responses early next week. > > If y

Re: Spark builds: allow user override of project version at buildtime

2015-08-25 Thread Marcelo Vanzin

On Tue, Aug 25, 2015 at 2:17 AM, wrote: > Then, if I wanted to do a build against a specific profile, I could also > pass in a -Dspark.version=1.4.1-custom-string and have the output artifacts > correctly named. The default behaviour should be the same. Child pom files > would need to reference $

Re: Spark (1.2.0) submit fails with exception saying log directory already exists

2015-08-25 Thread Marcelo Vanzin

This probably means your app is failing and the second attempt is hitting that issue. You may fix the "directory already exists" error by setting spark.eventLog.overwrite=true in your conf, but most probably that will just expose the actual error in your app. On Tue, Aug 25, 2015 at 9:37 AM, Varad

Spark (1.2.0) submit fails with exception saying log directory already exists

2015-08-25 Thread Varadhan, Jawahar

Here is the error yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: Log directory hdfs://Sandbox/user/spark/applicationHistory/application_1438113296105_0302 already exists!) I am using cloudera 5.3.2 with Spark 1.2.0 Any help is appreciated. Th

Re: Introduce a sbt plugin to deploy and submit jobs to a spark cluster on ec2

2015-08-25 Thread pishen

Thank you for the suggestions, actually this project is already on spark-packages for 1~2 months. Then I think what I need is some promotions :P 2015-08-25 23:51 GMT+08:00 saurfang [via Apache Spark Developers List] < ml-node+s1001551n1380...@n3.nabble.com>: > This is very cool. I also have a sbt

Re: Introduce a sbt plugin to deploy and submit jobs to a spark cluster on ec2

2015-08-25 Thread Akhil Das

You can add it to the spark packages i guess http://spark-packages.org/ Thanks Best Regards On Fri, Aug 14, 2015 at 1:45 PM, pishen tsai wrote: > Sorry for previous line-breaking format, try to resend the mail again. > > I have written a sbt plugin called spark-deployer, which is able to deploy

Spark builds: allow user override of project version at buildtime

2015-08-25 Thread andrew.rowson

I've got an interesting challenge in building Spark. For various reasons we do a few different builds of spark, typically with a few different profile options (e.g. against different versions of Hadoop, some with/without Hive etc.). We mirror the spark repo internally and have a buildserver that bu

RE: Dataframe aggregation with Tungsten unsafe

[VOTE] Release Apache Spark 1.5.0 (RC2)

Re: Spark builds: allow user override of project version at buildtime

Re: Dataframe aggregation with Tungsten unsafe

Re: Dataframe aggregation with Tungsten unsafe

Re: Dataframe aggregation with Tungsten unsafe

Re: [VOTE] Release Apache Spark 1.5.0 (RC1)

Re: Paring down / tagging tests (or some other way to avoid timeouts)?

Re: [VOTE] Release Apache Spark 1.5.0 (RC1)

Re: Paring down / tagging tests (or some other way to avoid timeouts)?

Re: Paring down / tagging tests (or some other way to avoid timeouts)?

Paring down / tagging tests (or some other way to avoid timeouts)?

Re: [VOTE] Release Apache Spark 1.5.0 (RC1)

Re: [survey] [spark-ec2] What do you like/dislike about spark-ec2?

Re: Spark builds: allow user override of project version at buildtime

Re: Spark (1.2.0) submit fails with exception saying log directory already exists

Spark (1.2.0) submit fails with exception saying log directory already exists

Re: Introduce a sbt plugin to deploy and submit jobs to a spark cluster on ec2

Re: Introduce a sbt plugin to deploy and submit jobs to a spark cluster on ec2

Spark builds: allow user override of project version at buildtime

20 matches

Site Navigation

Mail list logo

Footer information