[jira] [Commented] (ARROW-16000) [C++][Dataset] Support Latin-1 encoding

Antoine Pitrou (Jira) Fri, 15 Jul 2022 12:17:08 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-16000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567364#comment-17567364
 ]


Antoine Pitrou commented on ARROW-16000:
----------------------------------------

I would want to know first if there's actual contention due to the Python GIL 
and/or interpreter overhead.

In C++ the basic building block is {{TransformInputStream}}: 
https://arrow.apache.org/docs/cpp/api/io.html#transforming-input-wrapper . It 
should be easy for a normally skilled C++ developer to use it to wrap their 
transcoding library of choice (some might want to use ICU, others libiconv, 
etc.).

I think it would be ideal if we offered an optional header-only that would wrap 
ICU in a {{TransformInputStream}}, without actually requiring ICU to be present 
when compiling Arrow. Perhaps this can be through templates?

Also datasets needs to grow a dedicated configuration option to wrap all input 
streams, perhaps.


> [C++][Dataset] Support Latin-1 encoding
> ---------------------------------------
>
>                 Key: ARROW-16000
>                 URL: https://issues.apache.org/jira/browse/ARROW-16000
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Nicola Crane
>            Priority: Major
>
> In ARROW-15992 a user is reporting issues with trying to read in files with 
> Latin-1 encoding.  I had a look through the docs for the Dataset API and I 
> don't think this is currently supported.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-16000) [C++][Dataset] Support Latin-1 encoding

Reply via email to