Dear List,

What are common approaches for addressing over a union of tables / RDDs?
E.g. suppose I have a collection of log files in HDFS, one log file per
day, and I want to compute the sum of some field over a date range in SQL.
Using log schema, I can read each as a distinct SchemaRDD, but I want to
union them all and query against one 'table'.

If this data were in MySQL, I could have a table for each day of data and
use a MyISAM merge table to union these tables together and just query
against the merge table.  What's nice here is that MySQL persists the merge
table, and the merge table is r/w, so one can just update the merge table
once per day.  (What's not nice is that merge tables scale poorly, backup
admin is a pain, and oh hey I'd like to use Spark not MySQL).

One naive and untested idea (that achieves implicit persistence): scan an
HDFS directory for log files, create one RDD per file, union() the RDDs,
then create a Schema RDD from that union().

A few specific questions:
 * Any good approaches to a merge / union table? (Other than the naive idea
above).  Preferably with some way to persist that table / RDD between Spark
runs.  (How does Impala approach this problem?)

 * Has anybody tried joining against such a union of tables / RDDs on a
very large amount of data?  When I've tried (non-spark-sql) union()ing
Sequence Files, and then join()ing them against another RDD, Spark seems to
try to compute the full union before doing any join() computation (and
eventually OOMs the cluster because the union of Sequence Files is so big).
I haven't tried something similar with Spark SQL.

 * Are there any plans related to this in the Spark roadmap?  (This feature
would be a nice compliment to, say, persistent RDD indices for interactive
querying).

 * Related question: are there plans to use Parquet Index Pages to make
Spark SQL faster?  E.g. log indices over date ranges would be relevant here.

All the best,
-Paul

Reply via email to