The problem is with the way Spark query plan is being created, IMO, what was happening before is that the order of the tables mattered and when the larger table is given first, it took a very long time (~53mins to complete). I changed the order of the tables with the smaller one first (including replacing the table with one element with that of the entire one) and modified the query to look like this:
SELECT c.NAME, h.name FROM counties c, hospitals h WHERE c.NAME = 'Dutchess' AND ST_Intersects(c.shape, h.location) With the above query, things worked like a charm (<1min to finish the entire execution and join on 3141 polygons with 6.5k points). Do let me know if you need more info in order to pin point the issue. Regards, Raghu On Tue, Jan 26, 2016 at 5:13 PM, Ted Yu <yuzhih...@gmail.com> wrote: > What's the type of shape column ? > > Can you disclose what SomeUDF does (by showing the code) ? > > Cheers > > On Tue, Jan 26, 2016 at 12:41 PM, raghukiran <raghuki...@gmail.com> wrote: > >> Hi, >> >> I create two tables, one counties with just one row (it actually has 2k >> rows, but I used only one) and another hospitals, which has 6k rows. The >> join command I use is as follows, which takes way too long to run and has >> never finished successfully (even after nearly 10mins). The following is >> what I have: >> >> DataFrame df1 = ... >> df1.registerTempTable("hospitals"); >> DataFrame df2 = ... >> df2.registerTempTable("counties"); //has only one row right now >> DataFrame joinDf = sqlCtx.sql("SELECT h.name, c.name FROM hospitals h >> JOIN >> counties c ON SomeUDF(c.shape, h.location)"); >> long count = joinDf.count(); //this takes too long! >> >> //whereas the following which is the exact equivalent of the above gets >> done >> very quickly! >> DataFrame joinDf = sqlCtx.sql("SELECT h.name FROM hospitals WHERE >> SomeUDF('c.shape as string', h.location)"); >> long count = joinDf.count(); //gives me the correct answer of 8 >> >> Any suggestions on what I can do to optimize and debug this piece of code? >> >> Regards, >> Raghu >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-joins-taking-too-long-tp26078.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >