Re: Type promotion in v3

2024-08-30 Thread Micah Kornfield
> > At the same time, I also agree with Micah's point that we should make sure > we analyze the implications of this particular int/long -> string promotion > on sort order(s), puffin and default values and any special casing this may > introduce for those capabilities. One other area I forgot wa

Re: Type promotion in v3

2024-08-26 Thread Amogh Jahagirdar
Just caught up on this discussion, thanks all for the insights! I largely agree with Ryan's point that read-time transformation is different from a normal type promotion even though the read-time transformation may seem logically equivalent. The compelling part of leveraging the upper/lower bounds

Re: Type promotion in v3

2024-08-22 Thread Ryan Blue
Thanks for the discussion, everyone. I think the back and forth between Fokko and Micah helped me understand Micah's position more clear. I can see how some of the challenges that I raised would be solved, like moving the previous field into the metadata of the transformation. I agree with a lot o

Re: Type promotion in v3

2024-08-20 Thread Micah Kornfield
Thanks Dan, I'm pretty strongly opposed to the idea of assigning new field ids as part > of type promotion. I understand what we're trying to accomplish, but I > just don't think that's the right mechanism to achieve it. The field id > specifically identifies the field and shouldn't change as >

Re: Type promotion in v3

2024-08-20 Thread Daniel Weeks
I'm pretty strongly opposed to the idea of assigning new field ids as part of type promotion. I understand what we're trying to accomplish, but I just don't think that's the right mechanism to achieve it. The field id specifically identifies the field and shouldn't change as attributes change (na

Re: Type promotion in v3

2024-08-20 Thread Micah Kornfield
Hi Fokko, > In this case, we still need to keep the schemas. As an example: The example you gave is close to what I was imagining (if we get to details I might have a slightly different organization). This might be semantic, but I don't see this as keeping "schemas", since all data is present i

Re: Type promotion in v3

2024-08-20 Thread Xianjin YE
Hi Micah, > I think the idea with Parquet files is one would no longer use a map to track > these statistics but instead have a column per field-id/statistics pair. > …. > This is similar to how partition values are stored today in Avro. And I > don’t think there is anything stopping from doi

Re: Type promotion in v3

2024-08-20 Thread Fokko Driesprong
> > Yes, I was thinking it would be a recursive structure that tracked each > change. Cleanup could be achieved by also tracking schema ID of the last > time the field was present along with the schema ID of the written data > files in manifests (as discussed on the other threads), and cleaning up

Re: Type promotion in v3

2024-08-19 Thread Micah Kornfield
Hi Xiangjin, Could you elaborate a bit more on how the Parquet manifest would fix the > type promotion problem? If the lower and upper bounds are still a map of > , I don't think we can perform column pruning on that, and the > type information of the stat column is still missing. I think the i

Re: Type promotion in v3

2024-08-19 Thread Micah Kornfield
> > If we go with the approach that type promotion results in a change in the > field-id, what happens when a certain field has been changed > multiple times? Does it mean that we end up with tracking the lineage of > field change history? Yes, I was thinking it would be a recursive structure tha

Re: Type promotion in v3

2024-08-19 Thread xianjin
Hey Ryan, Thanks for the reply, it clears most things up. Some responses inline: > This ends up being a little different because we can detect more cases when the bounds must have been strings — any time when the length of the upper and lower bound is different. Because strings tend to have longe

Re: Type promotion in v3

2024-08-19 Thread Gang Wu
Hi Micah, If we go with the approach that type promotion results in a change in the field-id, what happens when a certain field has been changed multiple times? Does it mean that we end up with tracking the lineage of field change history? Thanks, Gang On Tue, Aug 20, 2024 at 7:34 AM Micah Kornf

Re: Type promotion in v3

2024-08-19 Thread Micah Kornfield
Hi Ryan, Thanks for the reply, responses inline > >- How do we keep track of the replaced column? Does it remain in the >schema? Either we would need to keep the old schemas or implement a new >“hidden” column state > > I don't think this is the case, the function metadata provides al

Re: Type promotion in v3

2024-08-19 Thread Ryan Blue
I don’t think that type promotion by replacing a column is a good direction to head. Right now we have a fairly narrow problem of not having the original type information for stats. That’s a problem with a fairly simple solution in the long term and it doesn’t require the added complexity of replac

Re: Type promotion in v3

2024-08-19 Thread Ryan Blue
If the reader logic depends on the length of data in bytes, will this prevent us from adding any type promotions to string? This ends up being a little different because we can detect more cases when the bounds must have been strings — any time when the length of the upper and lower bound is diffe

Re: Type promotion in v3

2024-08-19 Thread Micah Kornfield
I think continuing to define type promotion as something that happens implicitly from the reader perspective has a few issues: 1. It makes it difficult to reason about all additional features that might require stable types to interpret. Examples of existing filters: partition statistics file, e

Re: Type promotion in v3

2024-08-19 Thread Amogh Jahagirdar
Hey all, > There might be an easy/light way to add this new metadata: we can persist schema_id in the DataFile. It still adds some extra size to the manifest file but should be negligible? I do think it's probably negligible in terms of the size (at least in terms of the value that we get out of

Re: Type promotion in v3

2024-08-19 Thread Xianjin YE
Hey Fokko, > Distribute all the schemas to the executors, and we have to do the lookup and > comparison there. I don’t think this would be a problem: the schema id in the DataFile should be only used in driver’s planning phase to determine the lower/upper bounds, so no extra schema except the

Re: Type promotion in v3

2024-08-19 Thread Fokko Driesprong
Thanks Ryan for bringing this up, that's an interesting problem, let me think about this. we can persist schema_id in the DataFile This was also my first thought. The two drawbacks are: - Distribute all the schemas to the executors, and we have to do the lookup and comparison there. -

Re: Type promotion in v3

2024-08-19 Thread Xianjin YE
Thanks Ryan for bringing this up. > int and long to string Could you elaborate a bit on how we can support type promotion for `int` and `long` to `string` if the upper and lower bounds are already encoded in 4/8 bytes binary? It seems that we cannot add promotions to string as Piotr pointed o

Re: Type promotion in v3

2024-08-19 Thread Yujiang Zhong
Hi Ryan, I don't understand how the Parquet format manifests would resolve the type promotion issue here. Could you please provide more detailed information to help me understand it? Thank you. > That would also fix type promotion because the manifest file schema would > include full type info

Re: Type promotion in v3

2024-08-19 Thread Piotr Findeisen
Hi, Lack of type information in lower/upper bounds is definitely an interesting problem. For example the 4 bytes \x31\x32\x33\x34 value can be interpreted as string "1234" or 875770417 integer value (stored little-endian). if the reader logic depends on the length of data in bytes, will this preve