I've noticed that two queries, which return identical results, have very
different performance. I'd be interested in any hints about how avoid
problems like this.

The DataFrame df contains a string field "series" and an integer "eday", the
number of days since (or before) the 1970-01-01 epoch.

I'm doing some analysis over a sliding date window and, for now, avoiding
UDAFs. I'm therefore using a self join. First, I create 

val laggard = df.withColumnRenamed("series",
"p_series").withColumnRenamed("eday", "p_eday")

Then, the following query runs in 16s:

df.join(laggard, (df("series") === laggard("p_series")) && (df("eday") ===
(laggard("p_eday") + 1))).count

while the following query runs in 4 - 6 minutes:

df.join(laggard, (df("series") === laggard("p_series")) && ((df("eday") -
laggard("p_eday")) === 1)).count

It's worth noting that the series term is necessary to keep the query from
doing a complete cartesian product over the data.

Ideally, I'd like to look at lags of more than one day, but the following is
equally slow:

df.join(laggard, (df("series") === laggard("p_series")) && (df("eday") -
laggard("p_eday")).between(1,7)).count

Any advice about the general principle at work here would be welcome.

Thanks,
David



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Differing-performance-in-self-joins-tp13864.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to