Thanks Reynold and Cheng. It does seem quiet a bit of heavy lifting to have
schema per row. I will for now settle with having to do a union schema of
all the schema versions and complain any incompatibilities :-)

Looking forward to do great things with the API!

Thanks,
Aniket

On Thu Jan 29 2015 at 01:09:15 Reynold Xin <r...@databricks.com> wrote:

> It's an interesting idea, but there are major challenges with per row
> schema.
>
> 1. Performance - query optimizer and execution use assumptions about
> schema and data to generate optimized query plans. Having to re-reason
> about schema for each row can substantially slow down the engine, but due
> to optimization and due to the overhead of schema information associated
> with each row.
>
> 2. Data model: per-row schema is fundamentally a different data model. The
> current relational model has gone through 40 years of research and have
> very well defined semantics. I don't think there are well defined semantics
> of a per-row schema data model. For example, what is the semantics of an
> UDF function that is operating on a data cell that has incompatible schema?
> Should we also coerce or convert the data type? If yes, will that lead to
> conflicting semantics with some other rules? We need to answer questions
> like this in order to have a robust data model.
>
>
>
>
>
> On Wed, Jan 28, 2015 at 11:26 AM, Cheng Lian <lian.cs....@gmail.com>
> wrote:
>
>> Hi Aniket,
>>
>> In general the schema of all rows in a single table must be same. This is
>> a basic assumption made by Spark SQL. Schema union does make sense, and
>> we're planning to support this for Parquet. But as you've mentioned, it
>> doesn't help if types of different versions of a column differ from each
>> other. Also, you need to reload the data source table after schema changes
>> happen.
>>
>> Cheng
>>
>>
>> On 1/28/15 2:12 AM, Aniket Bhatnagar wrote:
>>
>>> I saw the talk on Spark data sources and looking at the interfaces, it
>>> seems that the schema needs to be provided upfront. This works for many
>>> data sources but I have a situation in which I would need to integrate a
>>> system that supports schema evolutions by allowing users to change schema
>>> without affecting existing rows. Basically, each row contains a schema
>>> hint
>>> (id and version) and this allows developers to evolve schema over time
>>> and
>>> perform migration at will. Since the schema needs to be specified upfront
>>> in the data source API, one possible way would be to build a union of all
>>> schema versions and handle populating row values appropriately. This
>>> works
>>> in case columns have been added or deleted in the schema but doesn't work
>>> if types have changed. I was wondering if it is possible to change the
>>> API
>>>   to provide schema for each row instead of expecting data source to
>>> provide
>>> schema upfront?
>>>
>>> Thanks,
>>> Aniket
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>

Reply via email to