The simplest way to do this sort of paging today would be to create
multiple files and then you could read as few or as many files as you want.
This approach also works regardless of format.

With parquet/orc you can create multiple row groups / stripes within a
single file, and then partition amongst those.

Scan based skipping should be possible for parquet and orc, I think support
was recently added to the parquet reader for this.  However, there is not
yet support for this in datasets.

> or simply wants to load the last 500 records.

A word of caution here.  Row-based skipping in the file reader is good for
partitioning but not always good for a "last 500 records".  This is because
skipping at the file-level (in the scan) will happen before any filters are
applied.  So, for example, if a user asks for "the last 500 records with
filter x > 0" then that might not be the same thing as "the last 500
records in my file".  That sort of LIMIT operation often has to be applied
in memory and then only pushed down into the scan if there are no filters
(and no sorts).

-Weston

On Fri, Jun 2, 2023 at 5:45 AM Wenbo Hu <huwenbo1...@gmail.com> wrote:

> Hi,
>     I'm trying to implement a data management system by python with
> arrow flight. The well designed dataset with filesystem makes the data
> management even simpler.
>     But I'm facing a situation: reading range in a dataset.
> Considering a dataset stored in feather format with 1 million rows in
> a remote file system (e.g. s3), client connects to multiple flight
> servers to parallel load data (e.g. 2 servers, one do_get from head to
> half, the other do_get from half to end), or simply wants to load the
> last 500 records.
>     At this point, the server needs to skip reading heading records
> for the reasons of network bandwidth and memory limitation, rather
> than transferring, loading heading records into memory and discarding
> them.
>     I think modern storage format may have advantages in determining
> position of the specific range of records in the file than csv, since
> csv has to move line by line without indexes. Also, I found fragment
> related apis in the dataset, but not much documentation on that (maybe
> more related to partitioning?).
>     Here is the proposal to add "limit and offset" to ScannerOption
> and Available Compute Functions and Acero, since it is also a very
> common operation in SQL as well.
>     But I realize that only implementing "limit and offset" compute
> functions have little effect on my situation, since the arrow compute
> functions accept arrays/scalars as input which the loading process has
> been taken. "Limit and offset'' in ScannerOption of dataset may need
> to have a dedicated implementation rather than directly call compute
> to filter. Furthermore, Acero may also benefit from this feature for
> scansink.
>    Or any other ideas for this situation?
> --
> ---------------------
> Best Regards,
> Wenbo Hu,
>

Reply via email to