>
> At the same time, I also agree with Micah's point that we should make sure
> we analyze the implications of this particular int/long -> string promotion
> on sort order(s), puffin and default values and any special casing this may
> introduce for those capabilities.
One other area I forgot wa
Just caught up on this discussion, thanks all for the insights!
I largely agree with Ryan's point that read-time transformation is
different from a normal type promotion even though the read-time
transformation may seem logically equivalent. The compelling part of
leveraging the upper/lower bounds
Thanks for the discussion, everyone. I think the back and forth between
Fokko and Micah helped me understand Micah's position more clear. I can see
how some of the challenges that I raised would be solved, like moving the
previous field into the metadata of the transformation.
I agree with a lot o
Thanks Dan,
I'm pretty strongly opposed to the idea of assigning new field ids as part
> of type promotion. I understand what we're trying to accomplish, but I
> just don't think that's the right mechanism to achieve it. The field id
> specifically identifies the field and shouldn't change as
>
I'm pretty strongly opposed to the idea of assigning new field ids as part
of type promotion. I understand what we're trying to accomplish, but I
just don't think that's the right mechanism to achieve it. The field id
specifically identifies the field and shouldn't change as
attributes change (na
Hi Fokko,
> In this case, we still need to keep the schemas. As an example:
The example you gave is close to what I was imagining (if we get to details
I might have a slightly different organization). This might be semantic,
but I don't see this as keeping "schemas", since all data is present i
Hi Micah,
> I think the idea with Parquet files is one would no longer use a map to track
> these statistics but instead have a column per field-id/statistics pair.
> ….
> This is similar to how partition values are stored today in Avro. And I
> don’t think there is anything stopping from doi
>
> Yes, I was thinking it would be a recursive structure that tracked each
> change. Cleanup could be achieved by also tracking schema ID of the last
> time the field was present along with the schema ID of the written data
> files in manifests (as discussed on the other threads), and cleaning up
Hi Xiangjin,
Could you elaborate a bit more on how the Parquet manifest would fix the
> type promotion problem? If the lower and upper bounds are still a map of
> , I don't think we can perform column pruning on that, and the
> type information of the stat column is still missing.
I think the i
>
> If we go with the approach that type promotion results in a change in the
> field-id, what happens when a certain field has been changed
> multiple times? Does it mean that we end up with tracking the lineage of
> field change history?
Yes, I was thinking it would be a recursive structure tha
Hey Ryan,
Thanks for the reply, it clears most things up. Some responses inline:
> This ends up being a little different because we can detect more cases
when the bounds must have been strings — any time when the length of the
upper and lower bound is different. Because strings tend to have longe
Hi Micah,
If we go with the approach that type promotion results in a change in the
field-id, what happens when a certain field has been changed
multiple times? Does it mean that we end up with tracking the lineage of
field change history?
Thanks,
Gang
On Tue, Aug 20, 2024 at 7:34 AM Micah Kornf
Hi Ryan,
Thanks for the reply, responses inline
>
>- How do we keep track of the replaced column? Does it remain in the
>schema? Either we would need to keep the old schemas or implement a new
>“hidden” column state
>
> I don't think this is the case, the function metadata provides al
I don’t think that type promotion by replacing a column is a good direction
to head. Right now we have a fairly narrow problem of not having the
original type information for stats. That’s a problem with a fairly simple
solution in the long term and it doesn’t require the added complexity of
replac
If the reader logic depends on the length of data in bytes, will this
prevent us from adding any type promotions to string?
This ends up being a little different because we can detect more cases when
the bounds must have been strings — any time when the length of the upper
and lower bound is diffe
I think continuing to define type promotion as something that happens
implicitly from the reader perspective has a few issues:
1. It makes it difficult to reason about all additional features that
might require stable types to interpret. Examples of existing filters:
partition statistics file, e
Hey all,
> There might be an easy/light way to add this new metadata: we can persist
schema_id in the DataFile. It still adds some extra size to the manifest
file but should be negligible?
I do think it's probably negligible in terms of the size (at least in terms
of the value that we get out of
Hey Fokko,
> Distribute all the schemas to the executors, and we have to do the lookup and
> comparison there.
I don’t think this would be a problem: the schema id in the DataFile should be
only used in driver’s planning phase to determine the lower/upper bounds, so no
extra schema except the
Thanks Ryan for bringing this up, that's an interesting problem, let me
think about this.
we can persist schema_id in the DataFile
This was also my first thought. The two drawbacks are:
- Distribute all the schemas to the executors, and we have to do the
lookup and comparison there.
-
Thanks Ryan for bringing this up.
> int and long to string
Could you elaborate a bit on how we can support type promotion for `int` and
`long` to `string` if the upper and lower bounds are already encoded in 4/8
bytes binary? It seems that we cannot add promotions to string as Piotr pointed
o
Hi Ryan,
I don't understand how the Parquet format manifests would resolve the type
promotion issue here. Could you please provide more detailed information to
help me understand it? Thank you.
> That would also fix type promotion because the manifest file schema would
> include full type info
Hi,
Lack of type information in lower/upper bounds is definitely an interesting
problem.
For example the 4 bytes \x31\x32\x33\x34 value can be interpreted as string
"1234" or 875770417 integer value (stored little-endian).
if the reader logic depends on the length of data in bytes, will this
preve
22 matches
Mail list logo