Hi, Lack of type information in lower/upper bounds is definitely an interesting problem. For example the 4 bytes \x31\x32\x33\x34 value can be interpreted as string "1234" or 875770417 integer value (stored little-endian). if the reader logic depends on the length of data in bytes, will this prevent us from adding any type promotions *to *string?
Best Piotr On Sat, 17 Aug 2024 at 01:00, Ryan Blue <b...@apache.org> wrote: > I’ve recently been working on updating the spec for new types and type > promotion cases in v3. > > I was talking to Micah and he pointed out an issue with type promotion: > the upper and lower bounds for data file columns that are kept in Avro > manifests don’t have any information about the type that was used to encode > the bounds. > > For example, when writing to a table with a float column, 4: f, the > manifest’s lower_bounds and upper_bounds maps will have an entry with the > type ID (4) as the key and a 4-byte encoded float for the value. If column > f were later promoted to double, those maps aren’t changed. The way we > currently detect that the type was promoted is to check the binary value > and read it as a float if there are 4 bytes instead of 8. This prevents us > from adding int to double type promotion because when there are 4 bytes > we would not know whether the value was originally an int or a float. > > Several of the type promotion cases from my previous email hit this > problem. Date/time types to string, int and long to string, and long to > timestamp are all affected. I think the best path forward is to add fewer > type promotion cases to v3 and support only these new cases: > > - int and long to string > - date to timestamp > - null/unknown to any > - any to variant (if supported by the Variant spec) > > That list would allow us to keep using the current strategy and not add > new metadata to track the type to our manifests. My rationale for not > adding new information to track the bound types at the time that the data > file metadata is created is that it would inflate the size of manifests and > push out the timeline for getting v3 done. Many of us would like to get v3 > released to get the timestamp_ns and variant types out. And if we can get > at least some of the promotion cases out that’s better. > > To address type promotion in the long term, I think that we should > consider moving to Parquet manifests. This has been suggested a few times > so that we can project just the lower and upper bounds that are needed for > scan planning. That would also fix type promotion because the manifest file > schema would include full type information for the stats columns. Given the > complexity of releasing Parquet manifests, I think it makes more sense to > get a few promotion cases done now in v3 and follow up with the rest in v4. > > Ryan > > -- > Ryan Blue >