I don't think 1 & 2 make sense. I don't think there are a lot of users reading 2gb strings or lists with 2B objects in them. Saying we just don't support that pattern seems fine for now. I also believe the string and list types have better cross-language support than the large variants.
On Sun, Jan 10, 2021 at 8:49 AM Ying Zhou <yzhou7...@gmail.com> wrote: > Hi, > > While finishing the ORC writer in C++ I found that the ORC reader treats > certain types in rather awkward ways. Hence I filed this Jira ticket: > https://issues.apache.org/jira/browse/ARROW-11117 < > https://issues.apache.org/jira/browse/ARROW-11117> > > After starting to work on ORC tickets mostly filed by myself I began to > worry that the type mappings in the ORC reader might already be used by > users of Arrow. I wonder whether we should grandfather the issues or > gradually switch to a new type mapping. > > Here are my proposed changes: > 1. The ORC STRING type should be converted to the Arrow LARGE_STRING type > instead of STRING type since it is large. > 2. The ORC LIST type should be converted to the Arrow LARGE_LIST type > instead of LIST type since it is large. > 3. The ORC MAP type should be converted to the Arrow MAP type instead of > list of structs with hardcoded field names as long as > the offsets fit into int32. Otherwise we shouldn't return OK. > > Thanks, > Ying