I think you can resort to a Hive table partitioned by date
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-PartitionedTables
On 1/11/15 9:51 PM, Paul Wais wrote:
Dear List,
What are common approaches for addressing over a union of tables /
RDDs? E.g. suppose I have a collection of log files in HDFS, one log
file per day, and I want to compute the sum of some field over a date
range in SQL. Using log schema, I can read each as a distinct
SchemaRDD, but I want to union them all and query against one 'table'.
If this data were in MySQL, I could have a table for each day of data
and use a MyISAM merge table to union these tables together and just
query against the merge table. What's nice here is that MySQL
persists the merge table, and the merge table is r/w, so one can just
update the merge table once per day. (What's not nice is that merge
tables scale poorly, backup admin is a pain, and oh hey I'd like to
use Spark not MySQL).
One naive and untested idea (that achieves implicit persistence): scan
an HDFS directory for log files, create one RDD per file, union() the
RDDs, then create a Schema RDD from that union().
A few specific questions:
* Any good approaches to a merge / union table? (Other than the naive
idea above). Preferably with some way to persist that table / RDD
between Spark runs. (How does Impala approach this problem?)
* Has anybody tried joining against such a union of tables / RDDs on
a very large amount of data? When I've tried (non-spark-sql)
union()ing Sequence Files, and then join()ing them against another
RDD, Spark seems to try to compute the full union before doing any
join() computation (and eventually OOMs the cluster because the union
of Sequence Files is so big). I haven't tried something similar with
Spark SQL.
* Are there any plans related to this in the Spark roadmap? (This
feature would be a nice compliment to, say, persistent RDD indices for
interactive querying).
* Related question: are there plans to use Parquet Index Pages to
make Spark SQL faster? E.g. log indices over date ranges would be
relevant here.
All the best,
-Paul
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org