Re: Support for local disk columnar storage for DataFrames

Reynold Xin Sun, 15 Nov 2015 21:53:14 -0800

This (updates) is something we are going to think about in the next release
or two.


On Thu, Nov 12, 2015 at 8:57 AM, Cristian O <cristian.b.op...@googlemail.com
> wrote:

> Sorry, apparently only replied to Reynold, meant to copy the list as well,
> so I'm self replying and taking the opportunity to illustrate with an
> example.
>
> Basically I want to conceptually do this:
>
> val bigDf = sqlContext.sparkContext.parallelize((1 to 1000000)).map(i => (i, 
> 1)).toDF("k", "v")
> val deltaDf = sqlContext.sparkContext.parallelize(Array(1, 50000)).map(i => 
> (i, 1)).toDF("k", "v")
>
> bigDf.cache()
>
> bigDf.registerTempTable("big")
> deltaDf.registerTempTable("delta")
>
> val newBigDf = sqlContext.sql("SELECT big.k, big.v + IF(delta.v is null, 0, 
> delta.v) FROM big LEFT JOIN delta on big.k = delta.k")
>
> newBigDf.cache()
> bigDf.unpersist()
>
>
> This is essentially an update of keys "1" and "50000" only, in a dataset
> of 1 million keys.
>
> This can be achieved efficiently if the join would preserve the cached
> blocks that have been unaffected, and only copy and mutate the 2 affected
> blocks corresponding to the matching join keys.
>
> Statistics can determine which blocks actually need mutating. Note also
> that shuffling is not required assuming both dataframes are pre-partitioned
> by the same key K.
>
> In SQL this could actually be expressed as an UPDATE statement or for a
> more generalized use as a MERGE UPDATE:
> https://technet.microsoft.com/en-us/library/bb522522(v=sql.105).aspx
>
> While this may seem like a very special case optimization, it would
> effectively implement UPDATE support for cached DataFrames, for both
> optimal and non-optimal usage.
>
> I appreciate there's quite a lot here, so thank you for taking the time to
> consider it.
>
> Cristian
>
>
>
> On 12 November 2015 at 15:49, Cristian O <cristian.b.op...@googlemail.com>
> wrote:
>
>> Hi Reynold,
>>
>> Thanks for your reply.
>>
>> Parquet may very well be used as the underlying implementation, but this
>> is more than about a particular storage representation.
>>
>> There are a few things here that are inter-related and open different
>> possibilities, so it's hard to structure, but I'll give it a try:
>>
>> 1. Checkpointing DataFrames - while a DF can be saved locally as parquet,
>> just using that as a checkpoint would currently require explicitly reading
>> it back. A proper checkpoint implementation would just save (perhaps
>> asynchronously) and prune the logical plan while allowing to continue using
>> the same DF, now backed by the checkpoint.
>>
>> It's important to prune the logical plan to avoid all kinds of issues
>> that may arise from unbounded expansion with iterative use-cases, like this
>> one I encountered recently:
>> https://issues.apache.org/jira/browse/SPARK-11596
>>
>> But really what I'm after here is:
>>
>> 2. Efficient updating of cached DataFrames - The main use case here is
>> keeping a relatively large dataset cached and updating it iteratively from
>> streaming. For example one would like to perform ad-hoc queries on an
>> incrementally updated, cached DataFrame. I expect this is already becoming
>> an increasingly common use case. Note that the dataset may require merging
>> (like adding) or overrriding values by key, so simply appending is not
>> sufficient.
>>
>> This is very similar in concept with updateStateByKey for regular RDDs,
>> i.e. an efficient copy-on-write mechanism, albeit perhaps at CachedBatch
>> level  (the row blocks for the columnar representation).
>>
>> This can be currently simulated with UNION or (OUTER) JOINs however is
>> very inefficient as it requires copying and recaching the entire dataset,
>> and unpersisting the original one. There are also the aforementioned
>> problems with unbounded logical plans (physical plans are fine)
>>
>> These two together, checkpointing and updating cached DataFrames, would
>> give fault-tolerant efficient updating of DataFrames, meaning streaming
>> apps can take advantage of the compact columnar representation and Tungsten
>> optimisations.
>>
>> I'm not quite sure if something like this can be achieved by other means
>> or has been investigated before, hence why I'm looking for feedback here.
>>
>> While one could use external data stores, they would have the added IO
>> penalty, plus most of what's available at the moment is either HDFS
>> (extremely inefficient for updates) or key-value stores that have 5-10x
>> space overhead over columnar formats.
>>
>> Thanks,
>> Cristian
>>
>>
>>
>>
>>
>>
>> On 12 November 2015 at 03:31, Reynold Xin <r...@databricks.com> wrote:
>>
>>> Thanks for the email. Can you explain what the difference is between
>>> this and existing formats such as Parquet/ORC?
>>>
>>>
>>> On Wed, Nov 11, 2015 at 4:59 AM, Cristian O <
>>> cristian.b.op...@googlemail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I was wondering if there's any planned support for local disk columnar
>>>> storage.
>>>>
>>>> This could be an extension of the in-memory columnar store, or possibly
>>>> something similar to the recently added local checkpointing for RDDs
>>>>
>>>> This could also have the added benefit of enabling iterative usage for
>>>> DataFrames by pruning the query plan through local checkpoints.
>>>>
>>>> A further enhancement would be to add update support to the columnar
>>>> format (in the immutable copy-on-write sense of course), by maintaining
>>>> references to unchanged row blocks and only copying and mutating the ones
>>>> that have changed.
>>>>
>>>> A use case here is streaming and merging updates in a large dataset
>>>> that can be efficiently stored internally in a columnar format, rather than
>>>> accessing a more inefficient external  data store like HDFS or Cassandra.
>>>>
>>>> Thanks,
>>>> Cristian
>>>>
>>>
>>>
>>
>

Re: Support for local disk columnar storage for DataFrames

Reply via email to