I think continuing to define type promotion as something that happens implicitly from the reader perspective has a few issues:
1. It makes it difficult to reason about all additional features that might require stable types to interpret. Examples of existing filters: partition statistics file, existing partition data in manifests, existing statistics values. Some potential future features/transforms like bloom filters in manifest files and default values (e.g. moving from bytes to strings). 2. It lacks flexibility in handling non-obvious transforms (e.g. date to string, could have many possible formats) 3. Some of the typed promotions can overflow, and clients might want to handle this overflow in a variety of ways (fail on read, cap to largest allowed value etc). Instead my preference would be handle new promotions work as follows: 1. Make any new type promotions require a new field ID. This means that type promotion is effectively dropping a field and adding a new one with the same name. This is nice because it relies on already defined logic for dropping a column and what is/isn't allowed. 2. Modelling, the transformation explicitly as an initial default converting a column from one column to another. e.g. a strawman sample sample of a JSON model to long->string promotion would look like: "{ "function_name": to_string "input_argument": { "column_id": 1 "column_type": long } }" This allows leveraging the existing on-going work of default values, and provides a path forward to: 1. Allows using old statistics/partition information to the greatest extent possible as an optimization, but by default would be correct if readers choose not to handle this (the only thing that is necessary for correct results is correct column projection resolution). 2. Add additional configuration to functions to handle potential ambiguities or features the client might want (different date/numeric formats, how to handle overflow). 3. Effectively makes resolution of the metadata constant time (technically it would be linear in the number of promotions), instead of requiring parsing/keeping old schemas for metadata about only a few fields. Thanks, Micah On Fri, Aug 16, 2024 at 4:00 PM Ryan Blue <b...@apache.org> wrote: > I’ve recently been working on updating the spec for new types and type > promotion cases in v3. > > I was talking to Micah and he pointed out an issue with type promotion: > the upper and lower bounds for data file columns that are kept in Avro > manifests don’t have any information about the type that was used to encode > the bounds. > > For example, when writing to a table with a float column, 4: f, the > manifest’s lower_bounds and upper_bounds maps will have an entry with the > type ID (4) as the key and a 4-byte encoded float for the value. If column > f were later promoted to double, those maps aren’t changed. The way we > currently detect that the type was promoted is to check the binary value > and read it as a float if there are 4 bytes instead of 8. This prevents us > from adding int to double type promotion because when there are 4 bytes > we would not know whether the value was originally an int or a float. > > Several of the type promotion cases from my previous email hit this > problem. Date/time types to string, int and long to string, and long to > timestamp are all affected. I think the best path forward is to add fewer > type promotion cases to v3 and support only these new cases: > > - int and long to string > - date to timestamp > - null/unknown to any > - any to variant (if supported by the Variant spec) > > That list would allow us to keep using the current strategy and not add > new metadata to track the type to our manifests. My rationale for not > adding new information to track the bound types at the time that the data > file metadata is created is that it would inflate the size of manifests and > push out the timeline for getting v3 done. Many of us would like to get v3 > released to get the timestamp_ns and variant types out. And if we can get > at least some of the promotion cases out that’s better. > > To address type promotion in the long term, I think that we should > consider moving to Parquet manifests. This has been suggested a few times > so that we can project just the lower and upper bounds that are needed for > scan planning. That would also fix type promotion because the manifest file > schema would include full type information for the stats columns. Given the > complexity of releasing Parquet manifests, I think it makes more sense to > get a few promotion cases done now in v3 and follow up with the rest in v4. > > Ryan > > -- > Ryan Blue >