Re: [Pyspark, SQL] Very slow IN operator

2017-04-06 Thread Fred Reiss
If you just want to emulate pushing down a join, you can just wrap the IN list query in a JDBCRelation directly: scala> val r_df = spark.read.format("jdbc").option("url", > "jdbc:h2:/tmp/testdb").option("dbtable", "R").load() > r_df: org.apache.spark.sql.DataFrame = [A: int] > scala> r_df.show > +

Re: [Pyspark, SQL] Very slow IN operator

2017-04-06 Thread Maciej Bryński
2017-04-06 4:00 GMT+02:00 Michael Segel : > Just out of curiosity, what would happen if you put your 10K values in to a > temp table and then did a join against it? The answer is predicates pushdown. In my case I'm using this kind of query on JDBC table and IN predicate is executed on DB in less

Re: [Pyspark, SQL] Very slow IN operator

2017-04-05 Thread Michael Segel
Just out of curiosity, what would happen if you put your 10K values in to a temp table and then did a join against it? > On Apr 5, 2017, at 4:30 PM, Maciej Bryński wrote: > > Hi, > I'm trying to run queries with many values in IN operator. > > The result is that for more than 10K values IN op

Re: [Pyspark, SQL] Very slow IN operator

2017-04-05 Thread Garren Staubli
l%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > -- View this

Re: [Pyspark, SQL] Very slow IN operator

2017-04-05 Thread Garren Staubli
l%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > -- View this