Dear List, What are common approaches for addressing over a union of tables / RDDs? E.g. suppose I have a collection of log files in HDFS, one log file per day, and I want to compute the sum of some field over a date range in SQL. Using log schema, I can read each as a distinct SchemaRDD, but I want to union them all and query against one 'table'.
If this data were in MySQL, I could have a table for each day of data and use a MyISAM merge table to union these tables together and just query against the merge table. What's nice here is that MySQL persists the merge table, and the merge table is r/w, so one can just update the merge table once per day. (What's not nice is that merge tables scale poorly, backup admin is a pain, and oh hey I'd like to use Spark not MySQL). One naive and untested idea (that achieves implicit persistence): scan an HDFS directory for log files, create one RDD per file, union() the RDDs, then create a Schema RDD from that union(). A few specific questions: * Any good approaches to a merge / union table? (Other than the naive idea above). Preferably with some way to persist that table / RDD between Spark runs. (How does Impala approach this problem?) * Has anybody tried joining against such a union of tables / RDDs on a very large amount of data? When I've tried (non-spark-sql) union()ing Sequence Files, and then join()ing them against another RDD, Spark seems to try to compute the full union before doing any join() computation (and eventually OOMs the cluster because the union of Sequence Files is so big). I haven't tried something similar with Spark SQL. * Are there any plans related to this in the Spark roadmap? (This feature would be a nice compliment to, say, persistent RDD indices for interactive querying). * Related question: are there plans to use Parquet Index Pages to make Spark SQL faster? E.g. log indices over date ranges would be relevant here. All the best, -Paul