Thanks Cheng!

For the list, I talked with Michael Armbrust at a recent Spark meetup
and his comments were:
 * For a union of tables, use a view and the Hive metastore
 * SQLContext might have the directory-traversing logic I need in it already
 * The union() of sequence files I saw was slow because Spark was
probably trying to shuffle the whole union.  A similar Spark SQL join
will also be slow (or break) unless one runs statistics so that the
smaller table can be broadcasted (e.g. see
https://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options
)

I have never used Hive, so I'll have to investigate further.


On Tue, Jan 20, 2015 at 1:15 PM, Cheng Lian <lian.cs....@gmail.com> wrote:
> I think you can resort to a Hive table partitioned by date
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-PartitionedTables
>
>
> On 1/11/15 9:51 PM, Paul Wais wrote:
>>
>>
>> Dear List,
>>
>> What are common approaches for addressing over a union of tables / RDDs?
>> E.g. suppose I have a collection of log files in HDFS, one log file per day,
>> and I want to compute the sum of some field over a date range in SQL.  Using
>> log schema, I can read each as a distinct SchemaRDD, but I want to union
>> them all and query against one 'table'.
>>
>> If this data were in MySQL, I could have a table for each day of data and
>> use a MyISAM merge table to union these tables together and just query
>> against the merge table.  What's nice here is that MySQL persists the merge
>> table, and the merge table is r/w, so one can just update the merge table
>> once per day.  (What's not nice is that merge tables scale poorly, backup
>> admin is a pain, and oh hey I'd like to use Spark not MySQL).
>>
>> One naive and untested idea (that achieves implicit persistence): scan an
>> HDFS directory for log files, create one RDD per file, union() the RDDs,
>> then create a Schema RDD from that union().
>>
>> A few specific questions:
>>  * Any good approaches to a merge / union table? (Other than the naive
>> idea above).  Preferably with some way to persist that table / RDD between
>> Spark runs.  (How does Impala approach this problem?)
>>
>>  * Has anybody tried joining against such a union of tables / RDDs on a
>> very large amount of data?  When I've tried (non-spark-sql) union()ing
>> Sequence Files, and then join()ing them against another RDD, Spark seems to
>> try to compute the full union before doing any join() computation (and
>> eventually OOMs the cluster because the union of Sequence Files is so big).
>> I haven't tried something similar with Spark SQL.
>>
>>  * Are there any plans related to this in the Spark roadmap?  (This
>> feature would be a nice compliment to, say, persistent RDD indices for
>> interactive querying).
>>
>>  * Related question: are there plans to use Parquet Index Pages to make
>> Spark SQL faster?  E.g. log indices over date ranges would be relevant here.
>>
>> All the best,
>> -Paul
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to