I don’t think that type promotion by replacing a column is a good direction to head. Right now we have a fairly narrow problem of not having the original type information for stats. That’s a problem with a fairly simple solution in the long term and it doesn’t require the added complexity of replacing a column:
- How do we keep track of the replaced column? Does it remain in the schema? Either we would need to keep the old schemas or implement a new “hidden” column state - Column predicates would need to be rewritten for older data files based on the default value for the replacement column - This would require some dynamic default code that doesn’t exist but would amount to projecting the original column and casting it — there’s not much of a functional difference besides needing more complex projection I also don’t agree with the expanded definition of type promotion. Type promotion exposes a way to implicitly cast older data to the new type. That doesn’t allow you to choose the string format you want for a date, it’s a simple and portable translation that should be clearly defined by the format. I think it makes sense to go with the current way that schemas work and continue to use field IDs to identify columns. Ryan On Mon, Aug 19, 2024 at 9:54 AM Micah Kornfield <emkornfi...@gmail.com> wrote: > I think continuing to define type promotion as something that happens > implicitly from the reader perspective has a few issues: > > 1. It makes it difficult to reason about all additional features that > might require stable types to interpret. Examples of existing filters: > partition statistics file, existing partition data in manifests, existing > statistics values. Some potential future features/transforms like bloom > filters in manifest files and default values (e.g. moving from bytes to > strings). > 2. It lacks flexibility in handling non-obvious transforms (e.g. date to > string, could have many possible formats) > 3. Some of the typed promotions can overflow, and clients might want to > handle this overflow in a variety of ways (fail on read, cap to largest > allowed value etc). > > Instead my preference would be handle new promotions work as follows: > > 1. Make any new type promotions require a new field ID. This means that > type promotion is effectively dropping a field and adding a new one with > the same name. This is nice because it relies on already defined logic for > dropping a column and what is/isn't allowed. > 2. Modelling, the transformation explicitly as an initial default > converting a column from one column to another. e.g. a strawman sample > sample of a JSON model to long->string promotion would look like: > > "{ > "function_name": to_string > "input_argument": { > "column_id": 1 > "column_type": long > } > }" > > This allows leveraging the existing on-going work of default values, and > provides a path forward to: > 1. Allows using old statistics/partition information to the greatest > extent possible as an optimization, but by default would be correct if > readers choose not to handle this (the only thing that is necessary for > correct results is correct column projection resolution). > 2. Add additional configuration to functions to handle potential > ambiguities or features the client might want (different date/numeric > formats, how to handle overflow). > 3. Effectively makes resolution of the metadata constant time > (technically it would be linear in the number of promotions), instead of > requiring parsing/keeping old schemas for metadata about only a few fields. > > Thanks, > Micah > > > > > On Fri, Aug 16, 2024 at 4:00 PM Ryan Blue <b...@apache.org> wrote: > >> I’ve recently been working on updating the spec for new types and type >> promotion cases in v3. >> >> I was talking to Micah and he pointed out an issue with type promotion: >> the upper and lower bounds for data file columns that are kept in Avro >> manifests don’t have any information about the type that was used to encode >> the bounds. >> >> For example, when writing to a table with a float column, 4: f, the >> manifest’s lower_bounds and upper_bounds maps will have an entry with >> the type ID (4) as the key and a 4-byte encoded float for the value. If >> column f were later promoted to double, those maps aren’t changed. The >> way we currently detect that the type was promoted is to check the binary >> value and read it as a float if there are 4 bytes instead of 8. This >> prevents us from adding int to double type promotion because when there >> are 4 bytes we would not know whether the value was originally an int or >> a float. >> >> Several of the type promotion cases from my previous email hit this >> problem. Date/time types to string, int and long to string, and long to >> timestamp are all affected. I think the best path forward is to add fewer >> type promotion cases to v3 and support only these new cases: >> >> - int and long to string >> - date to timestamp >> - null/unknown to any >> - any to variant (if supported by the Variant spec) >> >> That list would allow us to keep using the current strategy and not add >> new metadata to track the type to our manifests. My rationale for not >> adding new information to track the bound types at the time that the data >> file metadata is created is that it would inflate the size of manifests and >> push out the timeline for getting v3 done. Many of us would like to get v3 >> released to get the timestamp_ns and variant types out. And if we can get >> at least some of the promotion cases out that’s better. >> >> To address type promotion in the long term, I think that we should >> consider moving to Parquet manifests. This has been suggested a few times >> so that we can project just the lower and upper bounds that are needed for >> scan planning. That would also fix type promotion because the manifest file >> schema would include full type information for the stats columns. Given the >> complexity of releasing Parquet manifests, I think it makes more sense to >> get a few promotion cases done now in v3 and follow up with the rest in v4. >> >> Ryan >> >> -- >> Ryan Blue >> > -- Ryan Blue Databricks