It probably makes sense to make this option configurable. I think it is OK to change the default to use Maps. My guess is the initial ORC implementation predated having a Map type in the specification.
On Thu, Jan 28, 2021 at 9:28 AM Ying Zhou <yzhou7...@gmail.com> wrote: > Hi, > > Really thanks Deepak! > > I really want to edit the ORC reader to read ORC MAPs as Arrow MAPs now > and it’s not a serious hassle to do so. Is there anyone who needs the > read-ORC-maps-as-lists-of-structs functionality? If not I will do it likely > in my current PR. > > Ying > > > On Jan 19, 2021, at 8:45 PM, Deepak Majeti <majeti.dee...@gmail.com> > wrote: > > > > Hi Ying, > > > > I can help review/merge any ORC C++ contributions. > > > > > > On Thu, Jan 14, 2021 at 6:57 PM Ying Zhou <yzhou7...@gmail.com> wrote: > > > >> Well, I haven’t found any. Thankfully ORC does work and I can figure out > >> how it works by testing using simple examples. However I have never > managed > >> to contact the ORC community at all. They have never responded to any > of my > >> emails to d...@orc.apache.org <mailto:d...@orc.apache.org> I do want to > add > >> write Snappy support (which was actually already done 2 years ago by > >> someone else but due to lack of unit testing it was never merged into > >> master. I can write the tests.) and maybe Decimal256 to ORC C++ if they > are > >> wiling to review and merge them. If anyone has successfully contacted > the > >> ORC community please let me know how. > >> > >> Best, > >> Ying > >> > >>> On Jan 14, 2021, at 8:39 AM, Antoine Pitrou <anto...@python.org> > wrote: > >>> > >>> > >>> Hi Ying, > >>> > >>> Is there a semantic description of the ORC data types somewhere? > >>> I've read through https://orc.apache.org/docs/types.html and > >>> https://orc.apache.org/specification/ORCv1/ but those docs don't seem > >>> to explain the intent and constraints of each of the data types. > >>> > >>> Regards > >>> > >>> Antoine. > >>> > >>> > >>> > >>> > >>> On Mon, 11 Jan 2021 21:15:05 -0500 > >>> Ying Zhou <yzhou7...@gmail.com> wrote: > >>>> Thanks! What about 3? > >>>> Shall we convert ORC maps to Arrow maps as opposed to lists of structs > >> with fields of the structs named ‘key’ and ‘value’? > >>>> > >>>> > >>>> > >>>>> On Jan 10, 2021, at 6:45 PM, Jacques Nadeau <jacq...@apache.org> > >> wrote: > >>>>> > >>>>> I don't think 1 & 2 make sense. I don't think there are a lot of > users > >>>>> reading 2gb strings or lists with 2B objects in them. Saying we just > >> don't > >>>>> support that pattern seems fine for now. I also believe the string > and > >> list > >>>>> types have better cross-language support than the large variants. > >>>>> > >>>>> On Sun, Jan 10, 2021 at 8:49 AM Ying Zhou <yzhou7...@gmail.com> > wrote: > >>>>> > >>>>>> Hi, > >>>>>> > >>>>>> While finishing the ORC writer in C++ I found that the ORC reader > >> treats > >>>>>> certain types in rather awkward ways. Hence I filed this Jira > ticket: > >>>>>> https://issues.apache.org/jira/browse/ARROW-11117 < > >>>>>> https://issues.apache.org/jira/browse/ARROW-11117> > >>>>>> > >>>>>> After starting to work on ORC tickets mostly filed by myself I began > >> to > >>>>>> worry that the type mappings in the ORC reader might already be used > >> by > >>>>>> users of Arrow. I wonder whether we should grandfather the issues or > >>>>>> gradually switch to a new type mapping. > >>>>>> > >>>>>> Here are my proposed changes: > >>>>>> 1. The ORC STRING type should be converted to the Arrow LARGE_STRING > >> type > >>>>>> instead of STRING type since it is large. > >>>>>> 2. The ORC LIST type should be converted to the Arrow LARGE_LIST > type > >>>>>> instead of LIST type since it is large. > >>>>>> 3. The ORC MAP type should be converted to the Arrow MAP type > instead > >> of > >>>>>> list of structs with hardcoded field names as long as > >>>>>> the offsets fit into int32. Otherwise we shouldn't return OK. > >>>>>> > >>>>>> Thanks, > >>>>>> Ying > >>>> > >>>> > >>> > >>> > >>> > >> > >> > > > > -- > > regards, > > Deepak Majeti > >