Thanks Andrew. I completely missed that. It worked by removing the null safe
join condition.
--
View this message in context:
http://apache-spark-developers-list.1001551.n3.nabble.com/Scala-left-join-with-multiple-columns-Join-condition-is-missing-or-trivial-Use-the-CROSS-JOIN-syntax-tp21297p2
Following up, the issues with missing pypandoc/pandoc on the packaging
machine has been resolved.
On Tue, Apr 4, 2017 at 3:54 PM, Holden Karau wrote:
> See SPARK-20216, if Michael can let me know which machine is being used
> for packaging I can see if I can install pandoc on it (should be simpl
Hi,
I'm trying to run queries with many values in IN operator.
The result is that for more than 10K values IN operator is getting slower.
For example this code is running about 20 seconds.
df = spark.range(0,10,1,1)
df.where('id in ({})'.format(','.join(map(str,range(10).count()
Any
Query building time is significant because it's a simple query but a long
one at almost 4,000 characters alone.
Task deserialization time takes up an inordinate amount of time (0.9s) when
I run your test and building the query itself is several seconds.
I would recommend using a JOIN (a broadcast
Query building time is significant because it's a simple query but a long
one at almost 4,000 characters alone.
Task deserialization time takes up an inordinate amount of time (0.9s) when
I run your test and building the query itself is several seconds.
I would recommend using a JOIN (a broadcast
Just out of curiosity, what would happen if you put your 10K values in to a
temp table and then did a join against it?
> On Apr 5, 2017, at 4:30 PM, Maciej Bryński wrote:
>
> Hi,
> I'm trying to run queries with many values in IN operator.
>
> The result is that for more than 10K values IN op