This is a very interesting problem. SparkSQL supports the Non Equi Join, but it 
is in very low efficiency with large tables.

One possible solution is make both table partition based and the partition keys 
are (cast(ds as bigint) / 240), and with each partition in dataset1, you 
probably can write SQL like "select * from dataset1 inner join dataset2 on 
abs(cast(dataset1.ds as bigint) -cast(dataset2.ds as bigint)) < 240".

We assume the dataset2 is always the 3 adjacent partitions and the partitions 
key are: key -1, key, and key + 1 (which means the dataset2 with maximum "ds" 
240 seconds greater than dataset 1 and minimum timestamp 240 seconds less than 
the dataset1).

After finish iterating every partition in dataset1, you should get the result.

BTW, not sure if you really want the stream sql: 
https://github.com/thunderain-project/StreamSQL


-----Original Message-----
From: durga [mailto:durgak...@gmail.com] 
Sent: Tuesday, July 22, 2014 8:41 AM
To: u...@spark.incubator.apache.org
Subject: Joining by timestamp.

Hi

I have peculiar problem,

I have two data sets (large ones) . 
Data set1:

((timestamp),iterable[Any]) => {
(2014-07-10T00:02:45.045+0000,ArrayBuffer((2014-07-10T00:02:45.045+0000,98.4859,22)))
(2014-07-10T00:07:32.618+0000,ArrayBuffer((2014-07-10T00:07:32.618+0000,75.4737,22)))
}

DataSet2:
((timestamp),iterable[Any]) =>{
(2014-07-10T00:03:16.952+0000,ArrayBuffer((2014-07-10T00:03:16.952+0000,99.6148,23)))
(2014-07-10T00:08:11.329+0000,ArrayBuffer((2014-07-10T00:08:11.329+0000,80.9017,23)))
}

I need to join them , But the catch is , both time stamps are not same , they 
can be approximately 4mins +/-.

those records needs to be joined

Any idea is very much appreciated.

I am thinking right now.

file descriptor for sorted Dataset2.
Read the sorted records of dataset1 .
     for each record , check for any record matching with the criteria , 
        if match emit the record1,record2
        if not matching continue reading record2 until it matches.

I know this works for a very small files , That's the reason I need help.

Thanks,
D.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Joining-by-timestamp-tp10367.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to