Re: The driver hangs at DataFrame.rdd in Spark 2.1.0

2017-02-22 Thread StanZhai
Could this be related to https://issues.apache.org/jira/browse/SPARK-17733 ? -- Original -- From: "Cheng Lian-3 [via Apache Spark Developers List]";; Send time: Thursday, Feb 23, 2017 9:43 AM To: "Stan Zhai"; Subject: Re: The driver hangs at DataFrame.rdd in

[Spark Namespace]: Expanding Spark ML under Different Namespace?

2017-02-22 Thread Shouheng Yi
Hi Spark developers, Currently my team at Microsoft is extending Spark's machine learning functionalities to include new learners and transformers. We would like users to use these within spark pipelines so that they can mix and match with existing Spark learners/transformers, and overall have

The driver hangs at DataFrame.rdd in Spark 2.1.0

2017-02-22 Thread StanZhai
Hi all, The driver hangs at DataFrame.rdd in Spark 2.1.0 when the DataFrame(SQL) is complex, Following thread dump of my driver: org.apache.spark.sql.catalyst.expressions.AttributeReference.equals(namedExpressions.scala:230) org.apache.spark.sql.catalyst.expressions.IsNotNull.equals(nullExpr

Re: Output Committers for S3

2017-02-22 Thread Matthew Schauer
Well, the issue I'm trying to solve is slow writing due to S3's implementation of move as copy/delete. It seems like your S3 committers and S3Guard both ameliorate that somewhat by parallelizing the copy. I assume there's no better way to solve this issue without sacrificing safety. Even if ther

Re: A DataFrame cache bug

2017-02-22 Thread gen tang
Hi, The example that I provided is not very clear. And I add a more clear example in jira. Thanks Cheers Gen On Wed, Feb 22, 2017 at 3:47 PM, gen tang wrote: > Hi Kazuaki Ishizaki > > Thanks a lot for your help. It works. However, a more strange bug appears > as follows: > > import org.apache.