Please let me know if the following can be done in Spark: In terms of MapReduce I need: 1) Map function: 1.1) Get Hive record. 1.2) Create a key from some fileds of the record. Register with framework my own key comparison function. This function will make decision about key equality by calculating some metric from fileds in two keys. This metric allows to calculate 'approximate' equality of two keys. 2) Reduce: 2.1) Get iterator to records keyed by 'approximately' equal keys 2.2) Write records with 'approximately' equal key to a separate Hive table, called 'block' table 3) Another MR job 3.1) For all 'block' tables do more detailed comparison of recordes in each 'block' table to find 'duplicate' records 3.2) Write info about 'duplicate' records in yet another Hive table.
How can I do all these steps in Spark? As I understand, to acheve all these steps I need Spark to be able to : Run MapReduce jobs with Hive tables as input / output Specify my own Scala classes (types, functions ...) as MapReduce keys with my own algorithms to compare these keys