[ https://issues.apache.org/jira/browse/ARROW-16000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567364#comment-17567364 ]
Antoine Pitrou commented on ARROW-16000: ---------------------------------------- I would want to know first if there's actual contention due to the Python GIL and/or interpreter overhead. In C++ the basic building block is {{TransformInputStream}}: https://arrow.apache.org/docs/cpp/api/io.html#transforming-input-wrapper . It should be easy for a normally skilled C++ developer to use it to wrap their transcoding library of choice (some might want to use ICU, others libiconv, etc.). I think it would be ideal if we offered an optional header-only that would wrap ICU in a {{TransformInputStream}}, without actually requiring ICU to be present when compiling Arrow. Perhaps this can be through templates? Also datasets needs to grow a dedicated configuration option to wrap all input streams, perhaps. > [C++][Dataset] Support Latin-1 encoding > --------------------------------------- > > Key: ARROW-16000 > URL: https://issues.apache.org/jira/browse/ARROW-16000 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Nicola Crane > Priority: Major > > In ARROW-15992 a user is reporting issues with trying to read in files with > Latin-1 encoding. I had a look through the docs for the Dataset API and I > don't think this is currently supported. -- This message was sent by Atlassian Jira (v8.20.10#820010)