[jira] [Commented] (ARROW-16000) [C++][Dataset] Support Latin-1 encoding

Joost Hoozemans (Jira) Wed, 20 Jul 2022 06:03:19 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-16000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569009#comment-17569009
 ]


Joost Hoozemans commented on ARROW-16000:
-----------------------------------------

Thanks everyone for the advice. What makes CsvFragmentScanOptions the preferred 
place over csv.ReadOptions? CsvFragmentScanOptions right now doesn't directly 
store any properties itself, it only carries a csv.ConvertOptions and 
csv.ReadOptions. And compression and encoding sound like properties of a whole 
file, not a fragment (although I don't know if that is what Fragment means 
here).

Would it make sense as first attempt for me to add a TransformInputStream to 
CsvFragmentScanOptions or ReadOptions? Because then I can create 1 in python in 
the same way read_csv does it (with MakeTransformInputStream, with a callback 
into en/decode functions in a python library). Then we can see if there is a 
performance problem. Then later we could add functionality that creates a 
TransformInputStream in c++ world with a callback to some external library as 
Antoine suggested

> [C++][Dataset] Support Latin-1 encoding
> ---------------------------------------
>
>                 Key: ARROW-16000
>                 URL: https://issues.apache.org/jira/browse/ARROW-16000
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Nicola Crane
>            Assignee: Joost Hoozemans
>            Priority: Major
>
> In ARROW-15992 a user is reporting issues with trying to read in files with 
> Latin-1 encoding.  I had a look through the docs for the Dataset API and I 
> don't think this is currently supported.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-16000) [C++][Dataset] Support Latin-1 encoding

Reply via email to