Take this simple join:

SELECT m.title as title, solr.aggCount as aggCount FROM movies m INNER
JOIN (SELECT movie_id, COUNT(*) as aggCount FROM ratings WHERE rating
>= 4 GROUP BY movie_id ORDER BY aggCount desc LIMIT 10) as solr ON
solr.movie_id = m.movie_id ORDER BY aggCount DESC

I would like the ability to push the inner sub-query aliased as "solr"
down into the data source engine, in this case Solr as it will
greatlly reduce the amount of data that has to be transferred from
Solr into Spark. I would imagine this issue comes up frequently if the
underlying engine is a JDBC data source as well ...

Is this possible? Of course, my example is a bit cherry-picked so
determining if a sub-query can be pushed down into the data source
engine is probably not a trivial task, but I'm wondering if Spark has
the hooks to allow me to try ;-)

Cheers,
Tim

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to