Thanks Cheng! For the list, I talked with Michael Armbrust at a recent Spark meetup and his comments were: * For a union of tables, use a view and the Hive metastore * SQLContext might have the directory-traversing logic I need in it already * The union() of sequence files I saw was slow because Spark was probably trying to shuffle the whole union. A similar Spark SQL join will also be slow (or break) unless one runs statistics so that the smaller table can be broadcasted (e.g. see https://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options )
I have never used Hive, so I'll have to investigate further. On Tue, Jan 20, 2015 at 1:15 PM, Cheng Lian <lian.cs....@gmail.com> wrote: > I think you can resort to a Hive table partitioned by date > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-PartitionedTables > > > On 1/11/15 9:51 PM, Paul Wais wrote: >> >> >> Dear List, >> >> What are common approaches for addressing over a union of tables / RDDs? >> E.g. suppose I have a collection of log files in HDFS, one log file per day, >> and I want to compute the sum of some field over a date range in SQL. Using >> log schema, I can read each as a distinct SchemaRDD, but I want to union >> them all and query against one 'table'. >> >> If this data were in MySQL, I could have a table for each day of data and >> use a MyISAM merge table to union these tables together and just query >> against the merge table. What's nice here is that MySQL persists the merge >> table, and the merge table is r/w, so one can just update the merge table >> once per day. (What's not nice is that merge tables scale poorly, backup >> admin is a pain, and oh hey I'd like to use Spark not MySQL). >> >> One naive and untested idea (that achieves implicit persistence): scan an >> HDFS directory for log files, create one RDD per file, union() the RDDs, >> then create a Schema RDD from that union(). >> >> A few specific questions: >> * Any good approaches to a merge / union table? (Other than the naive >> idea above). Preferably with some way to persist that table / RDD between >> Spark runs. (How does Impala approach this problem?) >> >> * Has anybody tried joining against such a union of tables / RDDs on a >> very large amount of data? When I've tried (non-spark-sql) union()ing >> Sequence Files, and then join()ing them against another RDD, Spark seems to >> try to compute the full union before doing any join() computation (and >> eventually OOMs the cluster because the union of Sequence Files is so big). >> I haven't tried something similar with Spark SQL. >> >> * Are there any plans related to this in the Spark roadmap? (This >> feature would be a nice compliment to, say, persistent RDD indices for >> interactive querying). >> >> * Related question: are there plans to use Parquet Index Pages to make >> Spark SQL faster? E.g. log indices over date ranges would be relevant here. >> >> All the best, >> -Paul >> >> > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org