This is a very interesting problem. SparkSQL supports the Non Equi Join, but it is in very low efficiency with large tables.
One possible solution is make both table partition based and the partition keys are (cast(ds as bigint) / 240), and with each partition in dataset1, you probably can write SQL like "select * from dataset1 inner join dataset2 on abs(cast(dataset1.ds as bigint) -cast(dataset2.ds as bigint)) < 240". We assume the dataset2 is always the 3 adjacent partitions and the partitions key are: key -1, key, and key + 1 (which means the dataset2 with maximum "ds" 240 seconds greater than dataset 1 and minimum timestamp 240 seconds less than the dataset1). After finish iterating every partition in dataset1, you should get the result. BTW, not sure if you really want the stream sql: https://github.com/thunderain-project/StreamSQL -----Original Message----- From: durga [mailto:durgak...@gmail.com] Sent: Tuesday, July 22, 2014 8:41 AM To: u...@spark.incubator.apache.org Subject: Joining by timestamp. Hi I have peculiar problem, I have two data sets (large ones) . Data set1: ((timestamp),iterable[Any]) => { (2014-07-10T00:02:45.045+0000,ArrayBuffer((2014-07-10T00:02:45.045+0000,98.4859,22))) (2014-07-10T00:07:32.618+0000,ArrayBuffer((2014-07-10T00:07:32.618+0000,75.4737,22))) } DataSet2: ((timestamp),iterable[Any]) =>{ (2014-07-10T00:03:16.952+0000,ArrayBuffer((2014-07-10T00:03:16.952+0000,99.6148,23))) (2014-07-10T00:08:11.329+0000,ArrayBuffer((2014-07-10T00:08:11.329+0000,80.9017,23))) } I need to join them , But the catch is , both time stamps are not same , they can be approximately 4mins +/-. those records needs to be joined Any idea is very much appreciated. I am thinking right now. file descriptor for sorted Dataset2. Read the sorted records of dataset1 . for each record , check for any record matching with the criteria , if match emit the record1,record2 if not matching continue reading record2 until it matches. I know this works for a very small files , That's the reason I need help. Thanks, D. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Joining-by-timestamp-tp10367.html Sent from the Apache Spark User List mailing list archive at Nabble.com.