[Pyspark, SQL] Very slow IN operator

Maciej Bryński Wed, 05 Apr 2017 14:31:16 -0700

Hi,
I'm trying to run queries with many values in IN operator.

The result is that for more than 10K values IN operator is getting slower.


For example this code is running about 20 seconds.

df = spark.range(0,100000,1,1)
df.where('id in ({})'.format(','.join(map(str,range(100000))))).count()

Any ideas how to improve this ?
Is it a bug ?
-- 
Maciek Bryński

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

[Pyspark, SQL] Very slow IN operator

Reply via email to