Re: [C++] Shall we modify the ORC reader?

Ying Zhou Thu, 14 Jan 2021 15:57:46 -0800

Well, I haven’t found any. Thankfully ORC does work and I can figure out how it 
works by testing using simple examples. However I have never managed to contact 
the ORC community at all. They have never responded to any of my emails to 
d...@orc.apache.org <mailto:d...@orc.apache.org> I do want to add write Snappy 
support (which was actually already done 2 years ago by someone else but due to 
lack of unit testing it was never merged into master. I can write the tests.) 
and maybe Decimal256 to ORC C++ if they are wiling to review and merge them. If 
anyone has successfully contacted the ORC community please let me know how.


Best,
Ying

> On Jan 14, 2021, at 8:39 AM, Antoine Pitrou <anto...@python.org> wrote:
> 
> 
> Hi Ying,
> 
> Is there a semantic description of the ORC data types somewhere?
> I've read through https://orc.apache.org/docs/types.html and
> https://orc.apache.org/specification/ORCv1/ but those docs don't seem
> to explain the intent and constraints of each of the data types.
> 
> Regards
> 
> Antoine.
> 
> 
> 
> 
> On Mon, 11 Jan 2021 21:15:05 -0500
> Ying Zhou <yzhou7...@gmail.com> wrote:
>> Thanks! What about 3? 
>> Shall we convert ORC maps to Arrow maps as opposed to lists of structs with 
>> fields of the structs named ‘key’ and ‘value’?
>> 
>> 
>> 
>>> On Jan 10, 2021, at 6:45 PM, Jacques Nadeau <jacq...@apache.org> wrote:
>>> 
>>> I don't think 1 & 2 make sense. I don't think there are a lot of users
>>> reading 2gb strings or lists with 2B objects in them. Saying we just don't
>>> support that pattern seems fine for now. I also believe the string and list
>>> types have better cross-language support than the large variants.
>>> 
>>> On Sun, Jan 10, 2021 at 8:49 AM Ying Zhou <yzhou7...@gmail.com> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> While finishing the ORC writer in C++ I found that the ORC reader treats
>>>> certain types in rather awkward ways. Hence I filed this Jira ticket:
>>>> https://issues.apache.org/jira/browse/ARROW-11117 <
>>>> https://issues.apache.org/jira/browse/ARROW-11117>
>>>> 
>>>> After starting to work on ORC tickets mostly filed by myself I began to
>>>> worry that the type mappings in the ORC reader might already be used by
>>>> users of Arrow. I wonder whether we should grandfather the issues or
>>>> gradually switch to a new type mapping.
>>>> 
>>>> Here are my proposed changes:
>>>> 1. The ORC STRING type should be converted to the Arrow LARGE_STRING type
>>>> instead of STRING type since it is large.
>>>> 2. The ORC LIST type should be converted to the Arrow LARGE_LIST type
>>>> instead of LIST type since it is large.
>>>> 3. The ORC MAP type should be converted to the Arrow MAP type instead of
>>>> list of structs with hardcoded field names as long as
>>>> the offsets fit into int32. Otherwise we shouldn't return OK.
>>>> 
>>>> Thanks,
>>>> Ying  
>> 
>> 
> 
> 
>

Re: [C++] Shall we modify the ORC reader?

Reply via email to