dsv2 remaining work

Reynold Xin Wed, 12 Dec 2018 16:58:55 -0800

Unfortunately I can't make it to the DSv2 sync today. Sending an email with my 
thoughts instead. I spent a few hours thinking about this. It's evident that 
progress has been slow, because this is an important API and people from 
different perspectives have very different requirements, and the priorities are 
weighted very differently (e.g. issues that are super important to one might be 
not as important to another, and people just talk past each other arguing why 
one ignored a broader issue in a PR or proposal).


I think the only real way to make progress is to decouple the efforts into 
major areas, and make progress somewhat independently. Of course, some care is 
needed to take care of

Here's one attempt at listing some of the remaining big rocks:

1. Basic write API -- with the current SaveMode.

2. Add Overwrite (or Replace) logical plan, and the associated API in Table.

3. Add APIs for per-table metadata operations (note that I'm not calling it a 
catalog API here). Create/drop/alter table goes here. We also need to figure 
out how to do this for the file system sources in which there is no underlying 
catalog. One idea is to treat the file system as a catalog (with arbitrary 
levels of databases). To do that, it'd be great if the identifier for a table 
is not a fixed 2 or 3 part name, but just a string array.

4. Remove SaveMode. This is blocked on at least 1 + 2, and potentially 3.

5. Design a stable, fast, smaller surface row format to replace the existing 
InternalRow (and all the internal data types), which is internal and unstable. 
This can be further decoupled into the design for each data type.

The above are the big one I can think of. I probably missed some, but a lot of 
other smaller things can be improved on later.

dsv2 remaining work

Reply via email to