Please let  me know if the following can be done in Spark:

In terms of MapReduce I need:
1) Map function:
1.1) Get Hive record.
1.2) Create a key from some fileds of the record. Register with framework
my own key comparison function. This function will make decision about key
equality by calculating some metric from fileds in two keys. This metric
allows to calculate 'approximate' equality of two keys.
2) Reduce:
2.1) Get iterator to records keyed by 'approximately' equal keys
2.2) Write records with 'approximately' equal key to a separate Hive table,
called 'block' table
3) Another MR job
3.1) For all 'block' tables do more detailed comparison of recordes in each
'block' table  to find 'duplicate' records
3.2) Write info about 'duplicate' records in yet another Hive table.

How can I do all these steps in Spark?

As I understand, to acheve all these steps I need Spark to be able to :

    Run MapReduce jobs with Hive tables as input / output
    Specify my own Scala classes (types, functions ...) as MapReduce keys
with my own algorithms to compare these keys

Reply via email to