Re: Scala left join with multiple columns Join condition is missing or trivial. Use the CROSS JOIN syntax to allow cartesian products between these relations.

2017-04-05 Thread gjohnson35
Thanks Andrew. I completely missed that. It worked by removing the null safe join condition. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Scala-left-join-with-multiple-columns-Join-condition-is-missing-or-trivial-Use-the-CROSS-JOIN-syntax-tp21297p2

Re: [VOTE] Apache Spark 2.1.1 (RC2)

2017-04-05 Thread Holden Karau
Following up, the issues with missing pypandoc/pandoc on the packaging machine has been resolved. On Tue, Apr 4, 2017 at 3:54 PM, Holden Karau wrote: > See SPARK-20216, if Michael can let me know which machine is being used > for packaging I can see if I can install pandoc on it (should be simpl

[Pyspark, SQL] Very slow IN operator

2017-04-05 Thread Maciej Bryński
Hi, I'm trying to run queries with many values in IN operator. The result is that for more than 10K values IN operator is getting slower. For example this code is running about 20 seconds. df = spark.range(0,10,1,1) df.where('id in ({})'.format(','.join(map(str,range(10).count() Any

Re: [Pyspark, SQL] Very slow IN operator

2017-04-05 Thread Garren Staubli
Query building time is significant because it's a simple query but a long one at almost 4,000 characters alone. Task deserialization time takes up an inordinate amount of time (0.9s) when I run your test and building the query itself is several seconds. I would recommend using a JOIN (a broadcast

Re: [Pyspark, SQL] Very slow IN operator

2017-04-05 Thread Garren Staubli
Query building time is significant because it's a simple query but a long one at almost 4,000 characters alone. Task deserialization time takes up an inordinate amount of time (0.9s) when I run your test and building the query itself is several seconds. I would recommend using a JOIN (a broadcast

Re: [Pyspark, SQL] Very slow IN operator

2017-04-05 Thread Michael Segel
Just out of curiosity, what would happen if you put your 10K values in to a temp table and then did a join against it? > On Apr 5, 2017, at 4:30 PM, Maciej Bryński wrote: > > Hi, > I'm trying to run queries with many values in IN operator. > > The result is that for more than 10K values IN op