Re: [Pyspark, SQL] Very slow IN operator

2017-04-05 Thread Garren Staubli
Query building time is significant because it's a simple query but a long one at almost 4,000 characters alone. Task deserialization time takes up an inordinate amount of time (0.9s) when I run your test and building the query itself is several seconds. I would recommend using a JOIN (a broadcast

Re: [Pyspark, SQL] Very slow IN operator

2017-04-05 Thread Garren Staubli
Query building time is significant because it's a simple query but a long one at almost 4,000 characters alone. Task deserialization time takes up an inordinate amount of time (0.9s) when I run your test and building the query itself is several seconds. I would recommend using a JOIN (a broadcast