[ 
https://issues.apache.org/jira/browse/ARROW-16000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569152#comment-17569152
 ] 

Weston Pace commented on ARROW-16000:
-------------------------------------

For more context.  A {{fragment}} is a term introduced by the datasets API.  
The goal of the datasets API is to read data in from a collection on 
independently scannable fragments (in practice, fragment usually equals file).

The scanning process has its own set of options, ScanOptions, (e.g. 
use_threads, batch size, projection, etc.) which are independent of the file 
format.  Each file reader has its own set of options (e.g. ReadOptions) and 
it's completely unaware of any dataset scanner.

Now, pretend you want to scan a collection of CSV files with a custom delimiter 
(e.g. |).  It doesn't make sense for delimiter to be a property of scan options 
because it is specific to CSV.

As a result, we have ScanOptions::fragment_scan_options.  This is an interface 
that each format provides an implementation for, which can be provided for the 
scan.

So, to read a CSV file with a custom delimiter, you just create ParseOptions 
with the correct delimiter.  To read a dataset of CSV files with a custom 
delimiter you first create scan options for the scan itself, and then a 
ParseOptions with the custom delimiter, and then link them via 
ScanOptions::fragment_scan_options.

> [C++][Dataset] Support Latin-1 encoding
> ---------------------------------------
>
>                 Key: ARROW-16000
>                 URL: https://issues.apache.org/jira/browse/ARROW-16000
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Nicola Crane
>            Assignee: Joost Hoozemans
>            Priority: Major
>
> In ARROW-15992 a user is reporting issues with trying to read in files with 
> Latin-1 encoding.  I had a look through the docs for the Dataset API and I 
> don't think this is currently supported.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to