I don’t think that type promotion by replacing a column is a good direction
to head. Right now we have a fairly narrow problem of not having the
original type information for stats. That’s a problem with a fairly simple
solution in the long term and it doesn’t require the added complexity of
replacing a column:

   - How do we keep track of the replaced column? Does it remain in the
   schema? Either we would need to keep the old schemas or implement a new
   “hidden” column state
   - Column predicates would need to be rewritten for older data files
   based on the default value for the replacement column
   - This would require some dynamic default code that doesn’t exist but
   would amount to projecting the original column and casting it — there’s not
   much of a functional difference besides needing more complex projection

I also don’t agree with the expanded definition of type promotion. Type
promotion exposes a way to implicitly cast older data to the new type. That
doesn’t allow you to choose the string format you want for a date, it’s a
simple and portable translation that should be clearly defined by the
format.

I think it makes sense to go with the current way that schemas work and
continue to use field IDs to identify columns.

Ryan

On Mon, Aug 19, 2024 at 9:54 AM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> I think continuing to define type promotion as something that happens
> implicitly from the reader perspective has a few issues:
>
> 1.  It makes it difficult to reason about all additional features that
> might require stable types to interpret.  Examples of existing filters:
> partition statistics file, existing partition data in manifests, existing
> statistics values.  Some potential future features/transforms like bloom
> filters in manifest files and  default values (e.g. moving from bytes to
> strings).
> 2.  It lacks flexibility in handling non-obvious transforms (e.g. date to
> string, could have many possible formats)
> 3.  Some of the typed promotions can overflow, and clients might want to
> handle this overflow in a variety of ways (fail on read, cap to largest
> allowed value etc).
>
> Instead my preference would be handle new promotions work as follows:
>
> 1. Make any new type promotions require a new field ID.  This means that
> type promotion is effectively dropping a field and adding a new one with
> the same name. This is nice because it relies on already defined logic for
> dropping a column and what is/isn't allowed.
> 2.  Modelling, the transformation explicitly as an initial default
> converting a column from one column to another.  e.g. a strawman sample
> sample of a JSON model to long->string promotion would look like:
>
> "{
>    "function_name": to_string
>    "input_argument": {
>        "column_id": 1
>        "column_type": long
>    }
> }"
>
> This allows leveraging the existing on-going work of default values, and
> provides a path forward to:
> 1.  Allows using old statistics/partition information to the greatest
> extent possible as an optimization, but by default would be correct if
> readers choose not to handle this (the only thing that is necessary for
> correct results is correct column projection resolution).
> 2.  Add additional configuration to functions to handle potential
> ambiguities or features the client might want (different date/numeric
> formats, how to handle overflow).
> 3.  Effectively makes resolution of the metadata constant time
> (technically it would be linear in the number of promotions), instead of
> requiring parsing/keeping old schemas for metadata about only a few fields.
>
> Thanks,
> Micah
>
>
>
>
> On Fri, Aug 16, 2024 at 4:00 PM Ryan Blue <b...@apache.org> wrote:
>
>> I’ve recently been working on updating the spec for new types and type
>> promotion cases in v3.
>>
>> I was talking to Micah and he pointed out an issue with type promotion:
>> the upper and lower bounds for data file columns that are kept in Avro
>> manifests don’t have any information about the type that was used to encode
>> the bounds.
>>
>> For example, when writing to a table with a float column, 4: f, the
>> manifest’s lower_bounds and upper_bounds maps will have an entry with
>> the type ID (4) as the key and a 4-byte encoded float for the value. If
>> column f were later promoted to double, those maps aren’t changed. The
>> way we currently detect that the type was promoted is to check the binary
>> value and read it as a float if there are 4 bytes instead of 8. This
>> prevents us from adding int to double type promotion because when there
>> are 4 bytes we would not know whether the value was originally an int or
>> a float.
>>
>> Several of the type promotion cases from my previous email hit this
>> problem. Date/time types to string, int and long to string, and long to
>> timestamp are all affected. I think the best path forward is to add fewer
>> type promotion cases to v3 and support only these new cases:
>>
>>    - int and long to string
>>    - date to timestamp
>>    - null/unknown to any
>>    - any to variant (if supported by the Variant spec)
>>
>> That list would allow us to keep using the current strategy and not add
>> new metadata to track the type to our manifests. My rationale for not
>> adding new information to track the bound types at the time that the data
>> file metadata is created is that it would inflate the size of manifests and
>> push out the timeline for getting v3 done. Many of us would like to get v3
>> released to get the timestamp_ns and variant types out. And if we can get
>> at least some of the promotion cases out that’s better.
>>
>> To address type promotion in the long term, I think that we should
>> consider moving to Parquet manifests. This has been suggested a few times
>> so that we can project just the lower and upper bounds that are needed for
>> scan planning. That would also fix type promotion because the manifest file
>> schema would include full type information for the stats columns. Given the
>> complexity of releasing Parquet manifests, I think it makes more sense to
>> get a few promotion cases done now in v3 and follow up with the rest in v4.
>>
>> Ryan
>>
>> --
>> Ryan Blue
>>
>

-- 
Ryan Blue
Databricks

Reply via email to