Hi

I have peculiar problem,

I have two data sets (large ones) . 
Data set1:

((timestamp),iterable[Any]) => {
(2014-07-10T00:02:45.045+0000,ArrayBuffer((2014-07-10T00:02:45.045+0000,98.4859,22)))
(2014-07-10T00:07:32.618+0000,ArrayBuffer((2014-07-10T00:07:32.618+0000,75.4737,22)))
}

DataSet2:
((timestamp),iterable[Any]) =>{
(2014-07-10T00:03:16.952+0000,ArrayBuffer((2014-07-10T00:03:16.952+0000,99.6148,23)))
(2014-07-10T00:08:11.329+0000,ArrayBuffer((2014-07-10T00:08:11.329+0000,80.9017,23)))
}

I need to join them , But the catch is , both time stamps are not same ,
they can be approximately 4mins +/-.

those records needs to be joined

Any idea is very much appreciated.

I am thinking right now.

file descriptor for sorted Dataset2.
Read the sorted records of dataset1 .
     for each record , check for any record matching with the criteria , 
        if match emit the record1,record2
        if not matching continue reading record2 until it matches.

I know this works for a very small files , That's the reason I need help.

Thanks,
D.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Joining-by-timestamp-tp10367.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to