Hi Micah, If we go with the approach that type promotion results in a change in the field-id, what happens when a certain field has been changed multiple times? Does it mean that we end up with tracking the lineage of field change history?
Thanks, Gang On Tue, Aug 20, 2024 at 7:34 AM Micah Kornfield <emkornfi...@gmail.com> wrote: > Hi Ryan, > > Thanks for the reply, responses inline > >> >> - How do we keep track of the replaced column? Does it remain in the >> schema? Either we would need to keep the old schemas or implement a new >> “hidden” column state >> >> I don't think this is the case, the function metadata provides all the > state that is needed but perhaps I'm missing an edge case. In case it helps > I've included a more complete example at the bottom of the e-mail [1]. > >> >> - Column predicates would need to be rewritten for older data files >> based on the default value for the replacement column >> >> Is this for pruning or at data scan time? At pruning time, I agree, but > this needs to be the case for changing the type regardless of approach? At > scan time, I think the same reasoning applies but is likely harder to push > down so in most cases would be done after converting the old data to the > new type. One advantage of using a new column is it allows any column to > have its type changed (whereas the implicit approach we have > restrictions on columns used in existing transforms). > >> >> - This would require some dynamic default code that doesn’t exist but >> would amount to projecting the original column and casting it — there’s >> not >> much of a functional difference besides needing more complex projection >> >> I'm not sure I fully understand this statement exactly. I am > interpreting it as saying the code required for either approach is a > similar amount of effort and possibly slightly more for using the explicit > approach described in my original email. I agree with this, at least for > whatever the next type promotion is allowed. If we are considering many > more type promotions I think effort would start to even out. The benefit > here is it makes things simpler to reason about, adds flexibility and helps > provide a sanity check on whether type changes were requested instead of > some sort of missed edge-case. > > I also don’t agree with the expanded definition of type promotion. Type >> promotion exposes a way to implicitly cast older data to the new type. That >> doesn’t allow you to choose the string format you want for a date, > > > I think this depends on the motivating use-case for the feature and > the eventual scope of changing types: > * If this is for CDC then other RDBMS systems allow arbitrary conversion > of the old column to the new column type (e.g. postgres supports ALTER [ > COLUMN ] column_name [ SET DATA ] TYPE data_type [ COLLATE collation ] [ > USING expression ]). While I don't anticipate getting anywhere that complex > immediately (maybe ever), I think there are still advantages today to using > explicit modeling, and it allows for a natural extension point to > supporting the more complex use-cases. > * Will "bytes to string" be a candidate? There are two reasonable > approaches for underlying data either using hex-encoded value or doing an > in-place conversion assuming all data is UTF-8? > * There is already an existing proposal > <https://github.com/apache/iceberg/issues/9065> to promote from > Long->Timestamp [2] which assumes milliseconds. This seems like an > arbitrary choice, where one could reasonably have other time granularities > stored in Long values. > * I'd argue that once a schema is going from "any type"->"string", > something was fairly wrong with data modelling initially, providing more > tools to help users fix these types of issues seems beneficial in the long > run (again not something that needs to be done now but laying the > ground-work is useful). > > I'll also add that being explicit about explicit transforms can be > decoupled from using a new ID for the translated field, but it fits less > nicely because it requires adding an additional attribute just for type > conversion to schema fields, rather than already using the semantics of > "initial default". > > it’s a simple and portable translation that should be clearly defined by >> the format. > > > My intent was any functions defined for conversion would be part of the > specification (not to allow for arbitrary code instantiation) just as > partition transform functions are defined today. > > Last, I'm happy to write up a more formal doc to carry on the > conversation, if the trade-offs discussed here seem worthwhile. This might > help make the conversation easier to follow but I don't want to write > something up if people don't think it is worth pursuing. > > Thanks, > Micah > > > [1] More concrete example > > Schema before type alteration: > > {"name": "col_1", "field-id": 1, "type": long} > > Schema after type alteration ("alter col_1 set type string"): > > {"name": "col_1", "field-id": 2, "initial-default": { > "function_name": to_string > "input_argument": { > "column_id": 1, > "column_type": long > } > }"} > > While this trivial example seems to make things even more verbose, it is > important to note that schema changes are generally assumed to be generally > infrequent, and that most schemas have many more columns. There is still a > computational improvement at read time to not having to reparse old-schemas > to determine the columns type for a historic schema. > > [2] https://github.com/apache/iceberg/issues/9065 > > > > > > On Mon, Aug 19, 2024 at 1:09 PM Ryan Blue <b...@databricks.com.invalid> > wrote: > >> I don’t think that type promotion by replacing a column is a good >> direction to head. Right now we have a fairly narrow problem of not having >> the original type information for stats. That’s a problem with a fairly >> simple solution in the long term and it doesn’t require the added >> complexity of replacing a column: >> >> - How do we keep track of the replaced column? Does it remain in the >> schema? Either we would need to keep the old schemas or implement a new >> “hidden” column state >> - Column predicates would need to be rewritten for older data files >> based on the default value for the replacement column >> - This would require some dynamic default code that doesn’t exist but >> would amount to projecting the original column and casting it — there’s >> not >> much of a functional difference besides needing more complex projection >> >> I also don’t agree with the expanded definition of type promotion. Type >> promotion exposes a way to implicitly cast older data to the new type. That >> doesn’t allow you to choose the string format you want for a date, it’s a >> simple and portable translation that should be clearly defined by the >> format. >> >> I think it makes sense to go with the current way that schemas work and >> continue to use field IDs to identify columns. >> >> Ryan >> >> On Mon, Aug 19, 2024 at 9:54 AM Micah Kornfield <emkornfi...@gmail.com> >> wrote: >> >>> I think continuing to define type promotion as something that happens >>> implicitly from the reader perspective has a few issues: >>> >>> 1. It makes it difficult to reason about all additional features that >>> might require stable types to interpret. Examples of existing filters: >>> partition statistics file, existing partition data in manifests, existing >>> statistics values. Some potential future features/transforms like bloom >>> filters in manifest files and default values (e.g. moving from bytes to >>> strings). >>> 2. It lacks flexibility in handling non-obvious transforms (e.g. date >>> to string, could have many possible formats) >>> 3. Some of the typed promotions can overflow, and clients might want to >>> handle this overflow in a variety of ways (fail on read, cap to largest >>> allowed value etc). >>> >>> Instead my preference would be handle new promotions work as follows: >>> >>> 1. Make any new type promotions require a new field ID. This means that >>> type promotion is effectively dropping a field and adding a new one with >>> the same name. This is nice because it relies on already defined logic for >>> dropping a column and what is/isn't allowed. >>> 2. Modelling, the transformation explicitly as an initial default >>> converting a column from one column to another. e.g. a strawman sample >>> sample of a JSON model to long->string promotion would look like: >>> >>> "{ >>> "function_name": to_string >>> "input_argument": { >>> "column_id": 1 >>> "column_type": long >>> } >>> }" >>> >>> This allows leveraging the existing on-going work of default values, and >>> provides a path forward to: >>> 1. Allows using old statistics/partition information to the greatest >>> extent possible as an optimization, but by default would be correct if >>> readers choose not to handle this (the only thing that is necessary for >>> correct results is correct column projection resolution). >>> 2. Add additional configuration to functions to handle potential >>> ambiguities or features the client might want (different date/numeric >>> formats, how to handle overflow). >>> 3. Effectively makes resolution of the metadata constant time >>> (technically it would be linear in the number of promotions), instead of >>> requiring parsing/keeping old schemas for metadata about only a few fields. >>> >>> Thanks, >>> Micah >>> >>> >>> >>> >>> On Fri, Aug 16, 2024 at 4:00 PM Ryan Blue <b...@apache.org> wrote: >>> >>>> I’ve recently been working on updating the spec for new types and type >>>> promotion cases in v3. >>>> >>>> I was talking to Micah and he pointed out an issue with type promotion: >>>> the upper and lower bounds for data file columns that are kept in Avro >>>> manifests don’t have any information about the type that was used to encode >>>> the bounds. >>>> >>>> For example, when writing to a table with a float column, 4: f, the >>>> manifest’s lower_bounds and upper_bounds maps will have an entry with >>>> the type ID (4) as the key and a 4-byte encoded float for the value. If >>>> column f were later promoted to double, those maps aren’t changed. The >>>> way we currently detect that the type was promoted is to check the binary >>>> value and read it as a float if there are 4 bytes instead of 8. This >>>> prevents us from adding int to double type promotion because when >>>> there are 4 bytes we would not know whether the value was originally an >>>> int or a float. >>>> >>>> Several of the type promotion cases from my previous email hit this >>>> problem. Date/time types to string, int and long to string, and long to >>>> timestamp are all affected. I think the best path forward is to add fewer >>>> type promotion cases to v3 and support only these new cases: >>>> >>>> - int and long to string >>>> - date to timestamp >>>> - null/unknown to any >>>> - any to variant (if supported by the Variant spec) >>>> >>>> That list would allow us to keep using the current strategy and not add >>>> new metadata to track the type to our manifests. My rationale for not >>>> adding new information to track the bound types at the time that the data >>>> file metadata is created is that it would inflate the size of manifests and >>>> push out the timeline for getting v3 done. Many of us would like to get v3 >>>> released to get the timestamp_ns and variant types out. And if we can get >>>> at least some of the promotion cases out that’s better. >>>> >>>> To address type promotion in the long term, I think that we should >>>> consider moving to Parquet manifests. This has been suggested a few times >>>> so that we can project just the lower and upper bounds that are needed for >>>> scan planning. That would also fix type promotion because the manifest file >>>> schema would include full type information for the stats columns. Given the >>>> complexity of releasing Parquet manifests, I think it makes more sense to >>>> get a few promotion cases done now in v3 and follow up with the rest in v4. >>>> >>>> Ryan >>>> >>>> -- >>>> Ryan Blue >>>> >>> >> >> -- >> Ryan Blue >> Databricks >> >