Hi Ryan,

Thanks for the reply, responses inline

>
>    - How do we keep track of the replaced column? Does it remain in the
>    schema? Either we would need to keep the old schemas or implement a new
>    “hidden” column state
>
> I don't think this is the case, the function metadata provides all the
state that is needed but perhaps I'm missing an edge case. In case it helps
I've included a more complete example at the bottom of the e-mail [1].

>
>    - Column predicates would need to be rewritten for older data files
>    based on the default value for the replacement column
>
> Is this for pruning or at data scan time?  At pruning time, I agree, but
this needs to be the case for changing the type regardless of approach? At
scan time, I think the same reasoning applies but is likely harder to push
down so in most cases would be done after converting the old data to the
new type. One advantage of using a new column is it allows any column to
have its type changed (whereas the implicit approach we have
restrictions on columns used in existing transforms).

>
>    - This would require some dynamic default code that doesn’t exist but
>    would amount to projecting the original column and casting it — there’s not
>    much of a functional difference besides needing more complex projection
>
> I'm not sure I fully understand this statement exactly.  I am interpreting
it as saying the code required for either approach is a similar amount of
effort and possibly slightly more for using the explicit approach described
in my original email. I agree with this, at least for whatever the next
type promotion is allowed. If we are considering many more type promotions
I think effort would start to even out. The benefit here is it makes things
simpler to reason about, adds flexibility and helps provide a sanity check
on whether type changes were requested instead of some sort of missed
edge-case.

I also don’t agree with the expanded definition of type promotion. Type
> promotion exposes a way to implicitly cast older data to the new type. That
> doesn’t allow you to choose the string format you want for a date,


I think this depends on the motivating use-case for the feature and
the eventual scope of changing types:
* If this is for CDC then other RDBMS systems allow arbitrary conversion of
the old column to the new column type (e.g. postgres supports ALTER [
COLUMN ] column_name [ SET DATA ] TYPE data_type [ COLLATE collation ] [
USING expression ]). While I don't anticipate getting anywhere that complex
immediately (maybe ever), I think there are still advantages today to using
explicit modeling, and it allows for a natural extension point to
supporting the more complex use-cases.
*  Will "bytes to string" be a candidate? There are two reasonable
approaches for underlying data either using hex-encoded value or doing an
in-place conversion assuming all data is UTF-8?
*  There is already an existing proposal
<https://github.com/apache/iceberg/issues/9065> to promote from
Long->Timestamp [2] which assumes milliseconds.  This seems like an
arbitrary choice, where one could reasonably have other time granularities
stored in Long values.
*  I'd argue that once a schema is going from "any type"->"string",
something  was fairly wrong with data modelling initially, providing more
tools to help users fix these types of issues seems beneficial in the long
run (again not something that needs to be done now but laying the
ground-work is useful).

I'll also add that being explicit about explicit transforms can be
decoupled from using a new ID for the translated field, but it fits less
nicely because it requires adding an additional attribute just for type
conversion to schema fields, rather than already using the semantics of
"initial default".

it’s a simple and portable translation that should be clearly defined by
> the format.


My intent was any functions defined for conversion would be part of the
specification (not to allow for arbitrary code instantiation) just as
partition transform functions are defined today.

Last, I'm happy to write up a more formal doc to carry on the
conversation, if the trade-offs discussed here seem worthwhile. This might
help make the conversation easier to follow but I don't want to write
something up if people don't think it is worth pursuing.

Thanks,
Micah


[1]  More concrete example

Schema before type alteration:

{"name": "col_1", "field-id": 1, "type": long}

Schema after type alteration ("alter col_1 set type string"):

{"name": "col_1", "field-id": 2, "initial-default": {
   "function_name": to_string
   "input_argument": {
       "column_id": 1,
       "column_type": long
   }
}"}

While this trivial example seems to make things even more verbose, it is
important to note that schema changes are generally assumed to be generally
infrequent, and that most schemas have many more columns.  There is still a
computational improvement at read time to not having to reparse old-schemas
to determine the columns type for a historic schema.

[2] https://github.com/apache/iceberg/issues/9065





On Mon, Aug 19, 2024 at 1:09 PM Ryan Blue <b...@databricks.com.invalid>
wrote:

> I don’t think that type promotion by replacing a column is a good
> direction to head. Right now we have a fairly narrow problem of not having
> the original type information for stats. That’s a problem with a fairly
> simple solution in the long term and it doesn’t require the added
> complexity of replacing a column:
>
>    - How do we keep track of the replaced column? Does it remain in the
>    schema? Either we would need to keep the old schemas or implement a new
>    “hidden” column state
>    - Column predicates would need to be rewritten for older data files
>    based on the default value for the replacement column
>    - This would require some dynamic default code that doesn’t exist but
>    would amount to projecting the original column and casting it — there’s not
>    much of a functional difference besides needing more complex projection
>
> I also don’t agree with the expanded definition of type promotion. Type
> promotion exposes a way to implicitly cast older data to the new type. That
> doesn’t allow you to choose the string format you want for a date, it’s a
> simple and portable translation that should be clearly defined by the
> format.
>
> I think it makes sense to go with the current way that schemas work and
> continue to use field IDs to identify columns.
>
> Ryan
>
> On Mon, Aug 19, 2024 at 9:54 AM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
>> I think continuing to define type promotion as something that happens
>> implicitly from the reader perspective has a few issues:
>>
>> 1.  It makes it difficult to reason about all additional features that
>> might require stable types to interpret.  Examples of existing filters:
>> partition statistics file, existing partition data in manifests, existing
>> statistics values.  Some potential future features/transforms like bloom
>> filters in manifest files and  default values (e.g. moving from bytes to
>> strings).
>> 2.  It lacks flexibility in handling non-obvious transforms (e.g. date to
>> string, could have many possible formats)
>> 3.  Some of the typed promotions can overflow, and clients might want to
>> handle this overflow in a variety of ways (fail on read, cap to largest
>> allowed value etc).
>>
>> Instead my preference would be handle new promotions work as follows:
>>
>> 1. Make any new type promotions require a new field ID.  This means that
>> type promotion is effectively dropping a field and adding a new one with
>> the same name. This is nice because it relies on already defined logic for
>> dropping a column and what is/isn't allowed.
>> 2.  Modelling, the transformation explicitly as an initial default
>> converting a column from one column to another.  e.g. a strawman sample
>> sample of a JSON model to long->string promotion would look like:
>>
>> "{
>>    "function_name": to_string
>>    "input_argument": {
>>        "column_id": 1
>>        "column_type": long
>>    }
>> }"
>>
>> This allows leveraging the existing on-going work of default values, and
>> provides a path forward to:
>> 1.  Allows using old statistics/partition information to the greatest
>> extent possible as an optimization, but by default would be correct if
>> readers choose not to handle this (the only thing that is necessary for
>> correct results is correct column projection resolution).
>> 2.  Add additional configuration to functions to handle potential
>> ambiguities or features the client might want (different date/numeric
>> formats, how to handle overflow).
>> 3.  Effectively makes resolution of the metadata constant time
>> (technically it would be linear in the number of promotions), instead of
>> requiring parsing/keeping old schemas for metadata about only a few fields.
>>
>> Thanks,
>> Micah
>>
>>
>>
>>
>> On Fri, Aug 16, 2024 at 4:00 PM Ryan Blue <b...@apache.org> wrote:
>>
>>> I’ve recently been working on updating the spec for new types and type
>>> promotion cases in v3.
>>>
>>> I was talking to Micah and he pointed out an issue with type promotion:
>>> the upper and lower bounds for data file columns that are kept in Avro
>>> manifests don’t have any information about the type that was used to encode
>>> the bounds.
>>>
>>> For example, when writing to a table with a float column, 4: f, the
>>> manifest’s lower_bounds and upper_bounds maps will have an entry with
>>> the type ID (4) as the key and a 4-byte encoded float for the value. If
>>> column f were later promoted to double, those maps aren’t changed. The
>>> way we currently detect that the type was promoted is to check the binary
>>> value and read it as a float if there are 4 bytes instead of 8. This
>>> prevents us from adding int to double type promotion because when there
>>> are 4 bytes we would not know whether the value was originally an int
>>> or a float.
>>>
>>> Several of the type promotion cases from my previous email hit this
>>> problem. Date/time types to string, int and long to string, and long to
>>> timestamp are all affected. I think the best path forward is to add fewer
>>> type promotion cases to v3 and support only these new cases:
>>>
>>>    - int and long to string
>>>    - date to timestamp
>>>    - null/unknown to any
>>>    - any to variant (if supported by the Variant spec)
>>>
>>> That list would allow us to keep using the current strategy and not add
>>> new metadata to track the type to our manifests. My rationale for not
>>> adding new information to track the bound types at the time that the data
>>> file metadata is created is that it would inflate the size of manifests and
>>> push out the timeline for getting v3 done. Many of us would like to get v3
>>> released to get the timestamp_ns and variant types out. And if we can get
>>> at least some of the promotion cases out that’s better.
>>>
>>> To address type promotion in the long term, I think that we should
>>> consider moving to Parquet manifests. This has been suggested a few times
>>> so that we can project just the lower and upper bounds that are needed for
>>> scan planning. That would also fix type promotion because the manifest file
>>> schema would include full type information for the stats columns. Given the
>>> complexity of releasing Parquet manifests, I think it makes more sense to
>>> get a few promotion cases done now in v3 and follow up with the rest in v4.
>>>
>>> Ryan
>>>
>>> --
>>> Ryan Blue
>>>
>>
>
> --
> Ryan Blue
> Databricks
>

Reply via email to