jdbc/save DataFrameWriter implementation change

2016-04-12 Thread Justin.Pihony
Hi, I have a ticket open on how save should delegate to the jdbc method, however I went to implement this and it just didn't seem clean. Please take a look at my comment on https://issues.apache.org/jira/browse/SPARK-14525 and let me know if you agree with the second approach or not. Thanks, Just

Accessing Secure Hadoop from Mesos cluster

2016-04-12 Thread Tony Kinsley
I have been working towards getting some spark streaming jobs to run in Mesos cluster mode (using docker containers) and write data periodically to a secure HDFS cluster. Unfortunately this does not seem to be well supported currently in spark ( https://issues.apache.org/jira/browse/SPARK-12909). T

Re: Different maxBins value for categorical and continuous features in RandomForest implementation.

2016-04-12 Thread Joseph Bradley
That sounds useful. Would you mind creating a JIRA for it? Thanks! Joseph On Mon, Apr 11, 2016 at 2:06 AM, Rahul Tanwani wrote: > Hi, > > Currently the RandomForest algo takes a single maxBins value to decide the > number of splits to take. This sometimes causes training time to go very > high

Re: Spark on Mesos 0.28 issue

2016-04-12 Thread Timothy Chen
Hi Yang, Can you share the master log/slave log? Tim > On Apr 12, 2016, at 2:05 PM, Yang Lei wrote: > > I have been able to run spark submission in docker container (HOST network) > through Marathon on mesos and target to Mesos cluster (zk address) for at > least Spark 1.6, 1.5.2 over Mesos

Re: Spark 1.6.1 packages on S3 corrupt?

2016-04-12 Thread Nicholas Chammas
Yes, this is a known issue. The core devs are already aware of it. [CC dev] FWIW, I believe the Spark 1.6.1 / Hadoop 2.6 package on S3 is not corrupt. It may be the only 1.6.1 package that is not corrupt, though. :/ Nick On Tue, Apr 12, 2016 at 9:00 PM Augustus Hong wrote: > Hi all, > > I'm t

Spark on Mesos 0.28 issue

2016-04-12 Thread Yang Lei
I have been able to run spark submission in docker container (HOST network) through Marathon on mesos and target to Mesos cluster (zk address) for at least Spark 1.6, 1.5.2 over Mesos 0.26, 0.27. I do need to define SPARK_PUBLIC_DNS and SPARK_LOCAL_IP so that the spark driver can announce the

Re: SparkSQL - Limit pushdown on BroadcastHashJoin

2016-04-12 Thread Herman van Hövell tot Westerflier
I am not sure if you can push a limit through a join. This becomes problematic if not all keys are present on both sides; in such a case a limit can produce fewer rows than the set limit. This might be a rare case in which whole stage codegen is slower, due to the fact that we need to buffer the r

SparkSQL - Limit pushdown on BroadcastHashJoin

2016-04-12 Thread Rajesh Balamohan
Hi, I ran the following query in spark (latest master codebase) and it took a lot of time to complete even though it was a broadcast hash join. It appears that limit computation is done only after computing complete join condition. Shouldn't the limit condition be pushed to BroadcastHashJoin (wh

Possible deadlock in registering applications in the recovery mode

2016-04-12 Thread Niranda Perera
Hi all, I have encountered a small issue in the standalone recovery mode. Let's say there was an application A running in the cluster. Due to some issue, the entire cluster, together with the application A goes down. Then later on, cluster comes back online, and the master then goes into the 're