Re: Support for SQL on unions of tables (merge tables?)

Cheng Lian Tue, 20 Jan 2015 13:16:24 -0800

I think you can resort to a Hive table partitioned by datehttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-PartitionedTables


On 1/11/15 9:51 PM, Paul Wais wrote:

Dear List,
What are common approaches for addressing over a union of tables /RDDs? E.g. suppose I have a collection of log files in HDFS, one logfile per day, and I want to compute the sum of some field over a daterange in SQL. Using log schema, I can read each as a distinctSchemaRDD, but I want to union them all and query against one 'table'.
If this data were in MySQL, I could have a table for each day of dataand use a MyISAM merge table to union these tables together and justquery against the merge table. What's nice here is that MySQLpersists the merge table, and the merge table is r/w, so one can justupdate the merge table once per day. (What's not nice is that mergetables scale poorly, backup admin is a pain, and oh hey I'd like touse Spark not MySQL).
One naive and untested idea (that achieves implicit persistence): scanan HDFS directory for log files, create one RDD per file, union() theRDDs, then create a Schema RDD from that union().
A few specific questions:
* Any good approaches to a merge / union table? (Other than the naiveidea above). Preferably with some way to persist that table / RDDbetween Spark runs. (How does Impala approach this problem?)
* Has anybody tried joining against such a union of tables / RDDs ona very large amount of data? When I've tried (non-spark-sql)union()ing Sequence Files, and then join()ing them against anotherRDD, Spark seems to try to compute the full union before doing anyjoin() computation (and eventually OOMs the cluster because the unionof Sequence Files is so big). I haven't tried something similar withSpark SQL.
* Are there any plans related to this in the Spark roadmap? (Thisfeature would be a nice compliment to, say, persistent RDD indices forinteractive querying).
* Related question: are there plans to use Parquet Index Pages tomake Spark SQL faster? E.g. log indices over date ranges would berelevant here.
All the best,
-Paul



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Support for SQL on unions of tables (merge tables?)

Reply via email to