Hi
I have peculiar problem,
I have two data sets (large ones) .
Data set1:
((timestamp),iterable[Any]) => {
(2014-07-10T00:02:45.045+0000,ArrayBuffer((2014-07-10T00:02:45.045+0000,98.4859,22)))
(2014-07-10T00:07:32.618+0000,ArrayBuffer((2014-07-10T00:07:32.618+0000,75.4737,22)))
}
DataSet2:
((timestamp),iterable[Any]) =>{
(2014-07-10T00:03:16.952+0000,ArrayBuffer((2014-07-10T00:03:16.952+0000,99.6148,23)))
(2014-07-10T00:08:11.329+0000,ArrayBuffer((2014-07-10T00:08:11.329+0000,80.9017,23)))
}
I need to join them , But the catch is , both time stamps are not same ,
they can be approximately 4mins +/-.
those records needs to be joined
Any idea is very much appreciated.
I am thinking right now.
file descriptor for sorted Dataset2.
Read the sorted records of dataset1 .
for each record , check for any record matching with the criteria ,
if match emit the record1,record2
if not matching continue reading record2 until it matches.
I know this works for a very small files , That's the reason I need help.
Thanks,
D.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Joining-by-timestamp-tp10367.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.