Re: Type promotion in v3

Micah Kornfield Tue, 20 Aug 2024 22:14:10 -0700

Thanks Dan,

I'm pretty strongly opposed to the idea of assigning new field ids as part
> of type promotion.  I understand what we're trying to accomplish, but I
> just don't think that's the right mechanism to achieve it.  The field id
> specifically identifies the field and shouldn't change as
> attributes change (name, type, comment, etc.).



This seems to  be the consensus.  I'm OK with this, but I really think we
should be restricting type promotions to a smaller set (more on this
below). It still feels like it might be useful to explicitly track type
promotions in the current schema rather than having to refer and parse
historical schemas at scan time.

Historically, we've only allowed obvious and safe promotion (e.g. int to
> long).  I think some of the more "practical" cases like long to timestamp
> have made this more confusing because they seem arbitrary in terms of how
> we define the promotion (e.g. always interpret as ms) and may be confusing
> the situation because we want to accommodate typical use cases where people
> are migrating data without rewriting.


ms to timestamp does seem arbitrary and as far as I can tell few if any
systems support this as a type conversion.

FWIW, I did a quick survey of other technologies and I found the following
(I might have missed something or maybe there are more docs):

Redshift <https://docs.aws.amazon.com/redshift/latest/dg/r_ALTER_TABLE.html>
[1]
- appears to only support widening varchar columns (and as far as I can
tell does not support any other type promotions).

Snowflake
<https://docs.snowflake.com/en/sql-reference/sql/alter-table-column)> [2] -
Only supports changes that are metadata only or strict type widening (no to
string conversion), with the exception that precision can be narrowed if
all data currently fits into the column.

BigQuery
<https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#alter_column_set_data_type_statement>
[3] -
Only supports some numeric conversion (mostly widening) and also widening
of types (no string conversion)

Hive
<https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-AllowedImplicitConversions>
(by
default) [4] - Supports anything to string, and widening of numeric types.

Delta Lake
<https://learn.microsoft.com/en-us/azure/databricks/delta/type-widening> [5]
- Numeric type widening (and date to timestampNTZ).  (no string conversion)

Postgres - Implictly supports any conversion supporting "Assignment casts"
(I couldn't find a canonical list of built in assignment casts  but integer
to string seems to be included here) or arbitrary conversion via "using
expression".  As a side-note, IIUC, any column change rewrites the entire
table eagerly in postgres.

I'm not convinced we want to make this more flexible or configurable in
> terms of how you can promote or what the representation is (e.g. different
> formats of date to string) as that introduces a lot of complexity and in
> most cases is safer to require a rewrite when making complex
> transformations on table.  I feel like those situations are better use
> cases for views and we shouldn't try to accommodate everything at the table
> level.


Where do you see the line between views and table level?  Views can be used
to accomplish all type promotions, but certainly there is value in having
some built in, and supporting data migrations directly in the table
format.  What is the rubric we will use to decide when to add a new
promotion rule?

IMO, based on the analysis above (and also for simplicity).  I think built
in promotions should be limited to either metadata only operations, or
widening that is a padding only operation.

Therefore, FWIW, I'm -0.0 on supporting promotion to string.  I'd like to
validate that the restrictions we are putting in place actually make it
usable for its intended purpose and that we can actually identify all the
ancillary features that need to be special cased. So far we have discussed
stats and partition transforms.  It seems sort order, puffin files and
default values also need to be examined (and possibly other features?).  I
also think that we still run into problems (perhaps minor) that "parquet as
manifests" don't solve here if we ever need to account for transitive
promotions (e.g. Integer->Decimal->String).

Also, FWIW, I'm -1.0 on being able to promote long as milliseconds to
timestamps. This seems like an esoteric and arbitrary decision for
something core to the table format.

If we want to make backfilling easier for more drastic type changes, then I
think treating them as explicitly adding a new column with defaults based
on an old column makes more sense to me (I imagine this wouldn't be
supported with SET COLUMN TYPE DDL, but rather ADD COLUMN given the
consensus the implicitly changing field ID is not a good idea).

Thanks,
Micah

[1] https://docs.aws.amazon.com/redshift/latest/dg/r_ALTER_TABLE.html
[2] https://docs.snowflake.com/en/sql-reference/sql/alter-table-column
[3]
https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#alter_column_set_data_type_statement
[4]
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-AllowedImplicitConversions
[5] https://learn.microsoft.com/en-us/azure/databricks/delta/type-widening


On Tue, Aug 20, 2024 at 12:59 PM Daniel Weeks <dwe...@apache.org> wrote:

>
> I'm pretty strongly opposed to the idea of assigning new field ids as part
> of type promotion.  I understand what we're trying to accomplish, but I
> just don't think that's the right mechanism to achieve it.  The field id
> specifically identifies the field and shouldn't change as
> attributes change (name, type, comment, etc.).
>
> Isolating this issue to the most basic level, we always have a file with a
> source physical/logical representation and a target representation defined
> by the current table schema.  If we enumerate the specific transforms
> between the source and the target in the spec, there is no need to have a
> way to represent those in at the metadata level.  Historically, we've only
> allowed obvious and safe promotion (e.g. int to long).  I think some of the
> more "practical" cases like long to timestamp have made this more confusing
> because they seem arbitrary in terms of how we define the promotion (e.g.
> always interpret as ms) and may be confusing the situation because we want
> to accommodate typical use cases where people are migrating data without
> rewriting.
>
> I'm not convinced we want to make this more flexible or configurable in
> terms of how you can promote or what the representation is (e.g. different
> formats of date to string) as that introduces a lot of complexity and in
> most cases is safer to require a rewrite when making complex
> transformations on table.  I feel like those situations are better use
> cases for views and we shouldn't try to accommodate everything at the table
> level.
>
> The issue here is that we don't have appropriate fidelity in the metadata
> to map the stats to the current schema, so we can either introduce more
> metadata to close the gap or make partial progress until there's a better
> option (like moving the parquet metadata with explicit type information at
> the stats level).
>
> In terms of keeping complexity low, I'd lean more towards restricting
> evolution (like with incompatible transforms) than trying to accommodate
> those use cases with a more complicated representation.
>
> Best,
> -Dan
>
>
> On Tue, Aug 20, 2024 at 10:14 AM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
>> Hi Fokko,
>>
>>
>>> In this case, we still need to keep the schemas. As an example:
>>
>> The example you gave is close to what I was imagining (if we get to
>> details I might have a slightly different organization).  This might be
>> semantic, but I don't see this as keeping "schemas", since all data is
>> present in the current schema, and not in historical schemas (i.e. schemas
>> with different IDs).
>>
>> My main concern is that this can explode in complexity, especially when
>>> we end up with cycles: date → timestamp → string → datetime. With custom
>>> formatting is it also easy to go: timestamp → string → timestamptz.
>>
>>
>> To be clear I'm not advocating for introducing  all of these elements,
>> but I think leaving the door open for them is useful.  On the topic of
>> complexity, here is an example of where having explicit lineage modelled is
>> useful:
>>
>> long->timestamp->string (timestamp to string is not proposed yet but has
>> come up in private conversations).
>>
>> I assume we want timestamp->string to produce a human readable
>> timestamp?  If we don't store explicit lineage what will the process be to
>> ensure we get "2024-08-20T09:51:51-07:00" instead of "1724172711000" if we
>> encounter a file written when the data was a "long"
>>
>>
>> I agree, and this is also mentioned in the proposal. But each granularity
>>> has different types, eg. timestamp, timestamp_ns. So this is not really
>>> a problem.
>>
>>
>> Sorry I don't think I've seen the proposal (
>> https://github.com/apache/iceberg/issues/9065 is light on details and
>> maybe my mailing list searches are failing to turn up anything on dev@).
>> Given that TIMESTAMP_NS didn't exist prior to V3, I'd expect at least some
>> people have been storing this data as long and might want to convert it to
>> a proper timestamp column (assuming the long is in MILLISECONDs appears to
>> preclude this)?  Similarly, it seems like we would be ruling out casting
>> "seconds".
>>
>> I think that's fair <https://github.com/apache/iceberg/pull/5151>, but
>>> believe we should be opinionated about how we store data. Would we be able
>>> to write hex-encoded strings?
>>
>>
>> The issue here is that binary->string (utf-8) can fail and should ideally
>> be handled.  hex-encoded strings cannot fail.  Right now I believe there is
>> again ambiguity if we would apply bytes->string for default values, since
>> we don't have enough metadata at the moment to distinguish between these
>> two values in the JSON serialized initial value (currently bytes are
>> written as hex encoded values, so if we read a file that did not have the
>> existing column present, it would surface the hex encoded bytes and not
>> Avro's/in place cast interpretation of the data).
>>
>> I'm hesitant to introduce this complexity since I think for example the
>>> formatting or parsing of custom formats should be done in the engine and
>>> not the format.
>>
>>
>> Again, I'm not suggesting we need to introduce complexity.  Given the
>> examples raised above I think there is already an issue with transforms in
>> scope.  The proposed solution of new field IDs and explicit lineage solves
>> short term problems, and if we wish gives more flexibility in the long run.
>>
>> Thanks,
>> Micah
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Aug 20, 2024 at 4:32 AM Fokko Driesprong <fo...@apache.org>
>> wrote:
>>
>>> Yes, I was thinking it would be a recursive structure that tracked each
>>>> change.  Cleanup could be achieved by also tracking schema ID of the last
>>>> time the field was present along with the schema ID of the written data
>>>> files in manifests (as discussed on the other threads), and cleaning up
>>>> lineage when no data files were written in the schema.
>>>>
>>>
>>> In this case, we still need to keep the schemas. As an example:
>>>
>>> {"name": "col_1", "field-id": 1, "type": int}
>>>
>>> Schema after type alteration ("alter col_1 set type long"):
>>>
>>> {"name": "col_1", "field-id": 2, "initial-default": {
>>>    "function_name": to_long?
>>>    "input_argument": {
>>>        "column_id": 1,
>>>        "column_type": int
>>>    }
>>> }
>>>
>>> Schema after type alteration ("alter col_1 set type string"):
>>>
>>> {"name": "col_1", "field-id": 3, "initial-default": {
>>>    "function_name": to_string
>>>    "input_argument": {
>>>        "column_id": 2,
>>>        "column_type": long
>>>    }
>>> }
>>>
>>> If I understand correctly, you're suggesting:
>>>
>>> {"name": "col_1", "field-id": 3, "initial-default": {
>>>    "function_name": to_string
>>>    "input_argument": {
>>>        "column_id": 2,
>>>        "column_type": {
>>>            "function_name": to_long?
>>>            "input_argument": {
>>>                "column_id": 1,
>>>                "column_type": int
>>>            }
>>>        }
>>>    }
>>> }
>>>
>>> My main concern is that this can explode in complexity, especially when
>>> we end up with cycles: date → timestamp → string → datetime. With custom
>>> formatting is it also easy to go: timestamp → string → timestamptz.
>>>
>>> *  There is already an existing proposal
>>>> <https://github.com/apache/iceberg/issues/9065> to promote from
>>>> Long->Timestamp [2] which assumes milliseconds.  This seems like an
>>>> arbitrary choice, where one could reasonably have other time granularities
>>>> stored in Long values.
>>>
>>>
>>> I agree, and this is also mentioned in the proposal. But each
>>> granularity has different types, eg. timestamp, timestamp_ns. So this
>>> is not really a problem.
>>>
>>> *  Will "bytes to string" be a candidate? There are two reasonable
>>>> approaches for underlying data either using hex-encoded value or doing an
>>>> in-place conversion assuming all data is UTF-8?
>>>
>>>
>>> I think that's fair <https://github.com/apache/iceberg/pull/5151>, but
>>> believe we should be opinionated about how we store data. Would we be able
>>> to write hex-encoded strings?
>>>
>>> I'd argue that once a schema is going from "any type"->"string",
>>>> something  was fairly wrong with data modelling initially, providing more
>>>> tools to help users fix these types of issues seems beneficial in the long
>>>> run (again not something that needs to be done now but laying the
>>>> ground-work is useful).
>>>
>>>
>>> Similar to the point above we should be opinionated about this. For
>>> example, historically we've been parsing dates strictly, as an example, see
>>> DateTimeUtil
>>> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/util/DateTimeUtil.java>.
>>>
>>>
>>> I'm hesitant to introduce this complexity since I think for example the
>>> formatting or parsing of custom formats should be done in the engine and
>>> not the format.
>>>
>>> Kind regards,
>>> Fokko Driesprong
>>>
>>>
>>>
>>> Op di 20 aug 2024 om 05:58 schreef Micah Kornfield <
>>> emkornfi...@gmail.com>:
>>>
>>>> Hi Xiangjin,
>>>>
>>>>  Could you elaborate a bit more on how the Parquet manifest would fix
>>>>> the type promotion problem? If the lower and upper bounds are still a map
>>>>> of <int, binary>, I don't think we can perform column pruning on that, and
>>>>> the type information of the stat column is still missing.
>>>>
>>>>
>>>> I think the idea with Parquet files is one would no longer use a map to
>>>> track these statistics but instead have a column per field-id/statistics
>>>> pair.  So for each column in the original schema that would wanted to have
>>>> either:
>>>>
>>>> struct col1_stats {
>>>>    int min_value;
>>>>    int max_value;
>>>>    long null_count,
>>>>    etc.
>>>> }
>>>>
>>>> or a struct per statistic with each column:
>>>>
>>>> struct min_stats {
>>>>    int col1;
>>>>    string col2;
>>>>   etc
>>>> }
>>>>
>>>> This is similar to how partition values are stored today in Avro.  And
>>>> I don't think there is anything stopping from doing this in Avro either,
>>>> except it is potentially less useful because you can't save much by
>>>> selecting only columns in Avro.
>>>>
>>>> Cheers,
>>>> Micah
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Aug 19, 2024 at 7:50 PM xianjin <xian...@apache.org> wrote:
>>>>
>>>>> Hey Ryan,
>>>>>
>>>>> Thanks for the reply, it clears most things up. Some responses inline:
>>>>>
>>>>> > This ends up being a little different because we can detect more
>>>>> cases when the bounds must have been strings — any time when the length of
>>>>> the upper and lower bound is different. Because strings tend to have 
>>>>> longer
>>>>> values (we truncate to 16 chars by default) and are not typically a fixed
>>>>> number of bytes, we don’t expect very many cases in which you’d have to
>>>>> discard the string bounds.
>>>>>
>>>>> > That said, if we want to be safe we could also wait to introduce int
>>>>> and long to string promotion, or put it behind a flag to opt in.
>>>>>
>>>>> Thanks for the clarification. I would prefer to put it behind a flag
>>>>> to opt in, considering the complexity it involves and the potential
>>>>> interests from users.
>>>>>
>>>>> > We’ve discussed this topic before and the consensus has been to
>>>>> allow type promotion if there is no partition field that would be 
>>>>> affected.
>>>>> Here’s the wording I’ve added in the PR:
>>>>>
>>>>> Type promotion is not allowed for a field that is referenced by
>>>>> source-id or source-ids of a partition field if the partition
>>>>> transform would produce a different value after promoting the type.
>>>>>
>>>>> > I think that covers the cases.
>>>>>
>>>>> Yes, I think it's indeed a good solution.
>>>>>
>>>>>> > There might be an easy/light way to add this new metadata: we can
>>>>>> persist schema_id in the DataFile.
>>>>>
>>>>> > Fokko already covered some of the challenges here and I agree with
>>>>>> those. I think that we want to be able to discard older schemas or not 
>>>>>> send
>>>>>> them to clients. And I think that since we can fix this in the next rev 
>>>>>> of
>>>>>> the spec, adding metadata just for this purpose is not a good strategy. 
>>>>>> I’d
>>>>>> much rather work on the long-term fix than add metadata and complicated
>>>>>> logic to pass schemas through and detect the type.
>>>>>
>>>>> > I think it might be worthy as we can always rewrite old data files
>>>>>> to use the latest schema.
>>>>>
>>>>> > I agree, but this doesn’t solve the problem with the current
>>>>>> manifest files. This is something I would introduce when we add Parquet
>>>>>> manifest files.
>>>>>
>>>>>
>>>>>  Could you elaborate a bit more on how the Parquet manifest would fix
>>>>> the type promotion problem? If the lower and upper bounds are still a map
>>>>> of <int, binary>, I don't think we can perform column pruning on that, and
>>>>> the type information of the stat column is still missing.
>>>>>
>>>>> On Tue, Aug 20, 2024 at 1:02 AM Ryan Blue <b...@databricks.com.invalid>
>>>>> wrote:
>>>>>
>>>>>> If the reader logic depends on the length of data in bytes, will this
>>>>>> prevent us from adding any type promotions to string?
>>>>>>
>>>>>> This ends up being a little different because we can detect more
>>>>>> cases when the bounds must have been strings — any time when the length 
>>>>>> of
>>>>>> the upper and lower bound is different. Because strings tend to have 
>>>>>> longer
>>>>>> values (we truncate to 16 chars by default) and are not typically a fixed
>>>>>> number of bytes, we don’t expect very many cases in which you’d have to
>>>>>> discard the string bounds.
>>>>>>
>>>>>> That said, if we want to be safe we could also wait to introduce int
>>>>>> and long to string promotion, or put it behind a flag to opt in.
>>>>>>
>>>>>> There might be an easy/light way to add this new metadata: we can
>>>>>> persist schema_id in the DataFile.
>>>>>>
>>>>>> Fokko already covered some of the challenges here and I agree with
>>>>>> those. I think that we want to be able to discard older schemas or not 
>>>>>> send
>>>>>> them to clients. And I think that since we can fix this in the next rev 
>>>>>> of
>>>>>> the spec, adding metadata just for this purpose is not a good strategy. 
>>>>>> I’d
>>>>>> much rather work on the long-term fix than add metadata and complicated
>>>>>> logic to pass schemas through and detect the type.
>>>>>>
>>>>>> I think there’s also another aspect to consider: whether the new type
>>>>>> promotion is compatible with partition transforms
>>>>>>
>>>>>> We’ve discussed this topic before and the consensus has been to allow
>>>>>> type promotion if there is no partition field that would be affected.
>>>>>> Here’s the wording I’ve added in the PR:
>>>>>>
>>>>>> Type promotion is not allowed for a field that is referenced by
>>>>>> source-id or source-ids of a partition field if the partition
>>>>>> transform would produce a different value after promoting the type.
>>>>>>
>>>>>> I think that covers the cases.
>>>>>>
>>>>>> I think it might be worthy as we can always rewrite old data files to
>>>>>> use the latest schema.
>>>>>>
>>>>>> I agree, but this doesn’t solve the problem with the current manifest
>>>>>> files. This is something I would introduce when we add Parquet manifest
>>>>>> files.
>>>>>>
>>>>>> On Mon, Aug 19, 2024 at 9:26 AM Amogh Jahagirdar 2am...@gmail.com
>>>>>> <http://mailto:2am...@gmail.com> wrote:
>>>>>>
>>>>>> Hey all,
>>>>>>>
>>>>>>> > There might be an easy/light way to add this new metadata: we can
>>>>>>> persist schema_id in the DataFile. It still adds some extra size to the
>>>>>>> manifest file but should be negligible?
>>>>>>>
>>>>>>> I do think it's probably negligible in terms of the size (at least
>>>>>>> in terms of the value that we get out of having that field) but I think 
>>>>>>> the
>>>>>>> main downside of that approach is that it's still an additional spec 
>>>>>>> change
>>>>>>> which if we consider going down the route of Parquet manifests then that
>>>>>>> field largely becomes moot/throw-away work at that point since the type
>>>>>>> information would exist in Parquet.
>>>>>>>
>>>>>>> > And I think there’s also another aspect to consider: whether the
>>>>>>> new type promotion is compatible with partition transforms. Currently 
>>>>>>> all
>>>>>>> the partition transforms produce the same result for promoted types: 
>>>>>>> int ->
>>>>>>> long, float -> double. If we are adding a new type promotion, current
>>>>>>> partition transforms will produce different results for type promotion 
>>>>>>> such
>>>>>>> as int/long -> string, so the partition() of DataFile will not hold for
>>>>>>> promoted types. One possible way to fix that would be evolving the
>>>>>>> PartitionSpec with a new one?
>>>>>>>
>>>>>>> Yes, an important part of type promotion is validation that whatever
>>>>>>> evolution is being attempted can actually happen if the column being
>>>>>>> evolved is part of a partition transform! I was working on an
>>>>>>> implementation for this and so far it's just a strict check that the 
>>>>>>> field
>>>>>>> being promoted is not part of the partition spec. The proposed way of
>>>>>>> evolving the spec sounds plausible specifically for int/long -> string,
>>>>>>> although I'd need to double check/think through implications.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Amogh Jahagirdar
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Aug 19, 2024 at 8:43 AM Xianjin YE <xian...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hey Fokko,
>>>>>>>>
>>>>>>>> > Distribute all the schemas to the executors, and we have to do
>>>>>>>> the lookup and comparison there.
>>>>>>>>
>>>>>>>> I don’t think this would be a problem: the schema id in the
>>>>>>>> DataFile should be only used in driver’s planning phase to determine 
>>>>>>>> the
>>>>>>>> lower/upper bounds, so no extra schema except the current one should be
>>>>>>>> distributed to the executor. Even if all schemas are required, they 
>>>>>>>> can be
>>>>>>>> retrieved from SerializableTable’s lazyTable() method.
>>>>>>>>
>>>>>>>> > Not being able to prune old schema until they are not used
>>>>>>>> anymore (including all historical snapshots).
>>>>>>>>
>>>>>>>> That’s indeed a problem. However we haven’t add the ability to
>>>>>>>> prune unused schemas yet(which I would like to add too after
>>>>>>>> RemoveUnusedSpecs), we can consider that when implementing. BTW, I 
>>>>>>>> think it
>>>>>>>> might be worthy as we can always rewrite old data files to use the 
>>>>>>>> latest
>>>>>>>> schema.
>>>>>>>>
>>>>>>>> On Aug 19, 2024, at 22:19, Fokko Driesprong <fo...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Thanks Ryan for bringing this up, that's an
>>>>>>>> interesting problem, let me think about this.
>>>>>>>>
>>>>>>>> we can persist schema_id in the DataFile
>>>>>>>>
>>>>>>>>
>>>>>>>> This was also my first thought. The two drawbacks are:
>>>>>>>>
>>>>>>>>    - Distribute all the schemas to the executors, and we have to
>>>>>>>>    do the lookup and comparison there.
>>>>>>>>    - Not being able to prune old schema until they are not used
>>>>>>>>    anymore (including all historical snapshots).
>>>>>>>>
>>>>>>>> If we are adding new type promotion, current partition transforms
>>>>>>>>> will produce different result for type promotion such as int/long ->
>>>>>>>>> string, so the partition() of DataFile will not hold for promoted 
>>>>>>>>> types.
>>>>>>>>> One possible way to fix that would be evolving the PartitionSpec with 
>>>>>>>>> a new
>>>>>>>>> one?
>>>>>>>>
>>>>>>>>
>>>>>>>> That's a good call! Currently, we create partition spec evaluators
>>>>>>>> based on the partition-spec-id. Evolving the partition spec would fix 
>>>>>>>> it.
>>>>>>>> When we decide to include the schema-id, we would be able to create the
>>>>>>>> evaluator based on the (partition-spec-id, schema-id) tuple when 
>>>>>>>> evaluating
>>>>>>>> the partitions.
>>>>>>>>
>>>>>>>> Kind regards,
>>>>>>>> Fokko
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Op ma 19 aug 2024 om 15:59 schreef Xianjin YE <xian...@apache.org>:
>>>>>>>>
>>>>>>>>> Thanks Ryan for bringing this up.
>>>>>>>>>
>>>>>>>>> > int and long to string
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Could you elaborate a bit on how we can support type promotion for
>>>>>>>>> `int` and `long` to `string` if the upper and lower bounds are already
>>>>>>>>> encoded in 4/8 bytes binary? It seems that we cannot add promotions to
>>>>>>>>> string as Piotr pointed out?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> > My rationale for not adding new information to track the bound
>>>>>>>>> types at the time that the data file metadata is created is that it 
>>>>>>>>> would
>>>>>>>>> inflate the size of manifests and push out the timeline for getting 
>>>>>>>>> v3 done.
>>>>>>>>>
>>>>>>>>> There might be an easy/light way to add this new metadata: we can
>>>>>>>>> persist schema_id in the DataFile. It still adds some extra size to 
>>>>>>>>> the
>>>>>>>>> manifest file but should be negligible?
>>>>>>>>>
>>>>>>>>> And I think there’s also another aspect to consider: whether the
>>>>>>>>> new type promotion is compatible with partition transforms. Currently 
>>>>>>>>> all
>>>>>>>>> the partition transforms produce the same result for promoted types: 
>>>>>>>>> int ->
>>>>>>>>> long, float -> double. If we are adding new type promotion, current
>>>>>>>>> partition transforms will produce different result for type promotion 
>>>>>>>>> such
>>>>>>>>> as int/long -> string, so the partition() of DataFile will not hold 
>>>>>>>>> for
>>>>>>>>> promoted types. One possible way to fix that would be evolving the
>>>>>>>>> PartitionSpec with a new one?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Aug 17, 2024, at 07:00, Ryan Blue <b...@apache.org> wrote:
>>>>>>>>>
>>>>>>>>> I’ve recently been working on updating the spec for new types and
>>>>>>>>> type promotion cases in v3.
>>>>>>>>> I was talking to Micah and he pointed out an issue with type
>>>>>>>>> promotion: the upper and lower bounds for data file columns that are 
>>>>>>>>> kept
>>>>>>>>> in Avro manifests don’t have any information about the type that was 
>>>>>>>>> used
>>>>>>>>> to encode the bounds.
>>>>>>>>> For example, when writing to a table with a float column, 4: f,
>>>>>>>>> the manifest’s lower_bounds and upper_bounds maps will have an entry 
>>>>>>>>> with
>>>>>>>>> the type ID (4) as the key and a 4-byte encoded float for the value. 
>>>>>>>>> If
>>>>>>>>> column f were later promoted to double, those maps aren’t changed. 
>>>>>>>>> The way
>>>>>>>>> we currently detect that the type was promoted is to check the binary 
>>>>>>>>> value
>>>>>>>>> and read it as a float if there are 4 bytes instead of 8. This 
>>>>>>>>> prevents us
>>>>>>>>> from adding int to double type promotion because when there are 4 
>>>>>>>>> bytes we
>>>>>>>>> would not know whether the value was originally an int or a float.
>>>>>>>>> Several of the type promotion cases from my previous email hit
>>>>>>>>> this problem. Date/time types to string, int and long to string, and 
>>>>>>>>> long
>>>>>>>>> to timestamp are all affected. I think the best path forward is to add
>>>>>>>>> fewer type promotion cases to v3 and support only these new cases:
>>>>>>>>> • int and long to string
>>>>>>>>> • date to timestamp
>>>>>>>>> • null/unknown to any
>>>>>>>>> • any to variant (if supported by the Variant spec)
>>>>>>>>> That list would allow us to keep using the current strategy and
>>>>>>>>> not add new metadata to track the type to our manifests. My rationale 
>>>>>>>>> for
>>>>>>>>> not adding new information to track the bound types at the time that 
>>>>>>>>> the
>>>>>>>>> data file metadata is created is that it would inflate the size of
>>>>>>>>> manifests and push out the timeline for getting v3 done. Many of us 
>>>>>>>>> would
>>>>>>>>> like to get v3 released to get the timestamp_ns and variant types 
>>>>>>>>> out. And
>>>>>>>>> if we can get at least some of the promotion cases out that’s better.
>>>>>>>>> To address type promotion in the long term, I think that we should
>>>>>>>>> consider moving to Parquet manifests. This has been suggested a few 
>>>>>>>>> times
>>>>>>>>> so that we can project just the lower and upper bounds that are 
>>>>>>>>> needed for
>>>>>>>>> scan planning. That would also fix type promotion because the 
>>>>>>>>> manifest file
>>>>>>>>> schema would include full type information for the stats columns. 
>>>>>>>>> Given the
>>>>>>>>> complexity of releasing Parquet manifests, I think it makes more 
>>>>>>>>> sense to
>>>>>>>>> get a few promotion cases done now in v3 and follow up with the rest 
>>>>>>>>> in v4.
>>>>>>>>> Ryan
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Ryan Blue
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> --
>>>>>> Ryan Blue
>>>>>> Databricks
>>>>>>
>>>>>

Re: Type promotion in v3

Reply via email to