Thanks Ryan for bringing this up, that's an interesting problem, let me
think about this.

we can persist schema_id in the DataFile


This was also my first thought. The two drawbacks are:

   - Distribute all the schemas to the executors, and we have to do the
   lookup and comparison there.
   - Not being able to prune old schema until they are not used anymore
   (including all historical snapshots).

If we are adding new type promotion, current partition transforms will
> produce different result for type promotion such as int/long -> string, so
> the partition() of DataFile will not hold for promoted types. One possible
> way to fix that would be evolving the PartitionSpec with a new one?


That's a good call! Currently, we create partition spec evaluators based on
the partition-spec-id. Evolving the partition spec would fix it. When we
decide to include the schema-id, we would be able to create the evaluator
based on the (partition-spec-id, schema-id) tuple when evaluating the
partitions.

Kind regards,
Fokko



Op ma 19 aug 2024 om 15:59 schreef Xianjin YE <xian...@apache.org>:

> Thanks Ryan for bringing this up.
>
> > int and long to string
>
>
> Could you elaborate a bit on how we can support type promotion for `int`
> and `long` to `string` if the upper and lower bounds are already encoded in
> 4/8 bytes binary? It seems that we cannot add promotions to string as Piotr
> pointed out?
>
>
> > My rationale for not adding new information to track the bound types at
> the time that the data file metadata is created is that it would inflate
> the size of manifests and push out the timeline for getting v3 done.
>
> There might be an easy/light way to add this new metadata: we can persist
> schema_id in the DataFile. It still adds some extra size to the manifest
> file but should be negligible?
>
> And I think there’s also another aspect to consider: whether the new type
> promotion is compatible with partition transforms. Currently all the
> partition transforms produce the same result for promoted types: int ->
> long, float -> double. If we are adding new type promotion, current
> partition transforms will produce different result for type promotion such
> as int/long -> string, so the partition() of DataFile will not hold for
> promoted types. One possible way to fix that would be evolving the
> PartitionSpec with a new one?
>
>
> On Aug 17, 2024, at 07:00, Ryan Blue <b...@apache.org> wrote:
>
> I’ve recently been working on updating the spec for new types and type
> promotion cases in v3.
> I was talking to Micah and he pointed out an issue with type promotion:
> the upper and lower bounds for data file columns that are kept in Avro
> manifests don’t have any information about the type that was used to encode
> the bounds.
> For example, when writing to a table with a float column, 4: f, the
> manifest’s lower_bounds and upper_bounds maps will have an entry with the
> type ID (4) as the key and a 4-byte encoded float for the value. If column
> f were later promoted to double, those maps aren’t changed. The way we
> currently detect that the type was promoted is to check the binary value
> and read it as a float if there are 4 bytes instead of 8. This prevents us
> from adding int to double type promotion because when there are 4 bytes we
> would not know whether the value was originally an int or a float.
> Several of the type promotion cases from my previous email hit this
> problem. Date/time types to string, int and long to string, and long to
> timestamp are all affected. I think the best path forward is to add fewer
> type promotion cases to v3 and support only these new cases:
> • int and long to string
> • date to timestamp
> • null/unknown to any
> • any to variant (if supported by the Variant spec)
> That list would allow us to keep using the current strategy and not add
> new metadata to track the type to our manifests. My rationale for not
> adding new information to track the bound types at the time that the data
> file metadata is created is that it would inflate the size of manifests and
> push out the timeline for getting v3 done. Many of us would like to get v3
> released to get the timestamp_ns and variant types out. And if we can get
> at least some of the promotion cases out that’s better.
> To address type promotion in the long term, I think that we should
> consider moving to Parquet manifests. This has been suggested a few times
> so that we can project just the lower and upper bounds that are needed for
> scan planning. That would also fix type promotion because the manifest file
> schema would include full type information for the stats columns. Given the
> complexity of releasing Parquet manifests, I think it makes more sense to
> get a few promotion cases done now in v3 and follow up with the rest in v4.
> Ryan
>
> --
> Ryan Blue
>
>
>

Reply via email to