Thanks Reynold and Cheng. It does seem quiet a bit of heavy lifting to have schema per row. I will for now settle with having to do a union schema of all the schema versions and complain any incompatibilities :-)
Looking forward to do great things with the API! Thanks, Aniket On Thu Jan 29 2015 at 01:09:15 Reynold Xin <r...@databricks.com> wrote: > It's an interesting idea, but there are major challenges with per row > schema. > > 1. Performance - query optimizer and execution use assumptions about > schema and data to generate optimized query plans. Having to re-reason > about schema for each row can substantially slow down the engine, but due > to optimization and due to the overhead of schema information associated > with each row. > > 2. Data model: per-row schema is fundamentally a different data model. The > current relational model has gone through 40 years of research and have > very well defined semantics. I don't think there are well defined semantics > of a per-row schema data model. For example, what is the semantics of an > UDF function that is operating on a data cell that has incompatible schema? > Should we also coerce or convert the data type? If yes, will that lead to > conflicting semantics with some other rules? We need to answer questions > like this in order to have a robust data model. > > > > > > On Wed, Jan 28, 2015 at 11:26 AM, Cheng Lian <lian.cs....@gmail.com> > wrote: > >> Hi Aniket, >> >> In general the schema of all rows in a single table must be same. This is >> a basic assumption made by Spark SQL. Schema union does make sense, and >> we're planning to support this for Parquet. But as you've mentioned, it >> doesn't help if types of different versions of a column differ from each >> other. Also, you need to reload the data source table after schema changes >> happen. >> >> Cheng >> >> >> On 1/28/15 2:12 AM, Aniket Bhatnagar wrote: >> >>> I saw the talk on Spark data sources and looking at the interfaces, it >>> seems that the schema needs to be provided upfront. This works for many >>> data sources but I have a situation in which I would need to integrate a >>> system that supports schema evolutions by allowing users to change schema >>> without affecting existing rows. Basically, each row contains a schema >>> hint >>> (id and version) and this allows developers to evolve schema over time >>> and >>> perform migration at will. Since the schema needs to be specified upfront >>> in the data source API, one possible way would be to build a union of all >>> schema versions and handle populating row values appropriately. This >>> works >>> in case columns have been added or deleted in the schema but doesn't work >>> if types have changed. I was wondering if it is possible to change the >>> API >>> to provide schema for each row instead of expecting data source to >>> provide >>> schema upfront? >>> >>> Thanks, >>> Aniket >>> >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> For additional commands, e-mail: dev-h...@spark.apache.org >> >> >