[jira] [Commented] (ARROW-5318) [Python] pyarrow hdfs reader overrequests

Rok Mihevc (Jira) Tue, 10 Jan 2023 23:51:30 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-5318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17662339#comment-17662339
 ]


Rok Mihevc commented on ARROW-5318:
-----------------------------------

This issue has been migrated to [issue 
#21780|https://github.com/apache/arrow/issues/21780] on GitHub. Please see the 
[migration documentation|https://github.com/apache/arrow/issues/14542] for 
further details.

> [Python] pyarrow hdfs reader overrequests  
> -------------------------------------------
>
>                 Key: ARROW-5318
>                 URL: https://issues.apache.org/jira/browse/ARROW-5318
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.10.0
>            Reporter: Ivan Dimitrov
>            Priority: Blocker
>             Fix For: 0.14.0
>
>
> I am using pyarrow's HdfsFilesystem interface. When I call a read on n bytes, 
> I often get 0%-300% more data sent over the network. My suspicion is that 
> pyarrow is reading ahead.
> The pyarrow parquet reader doesn't have this behavior, and I am looking for a 
> way to turn off read ahead for the general HDFS interface.
> I am running on ubuntu 14.04. This issue is present in pyarrow 0.10 - 0.13 
> (newest released version). I am on python 2.7
> I have been using wireshark to track the packets passed on the network.
> I suspect it is read ahead since the time for the 1st read is much greater 
> than the time for 2nd read.
>  
> The regular pyarrow reader
> {code:java}
> import pyarrow as pa 
> fs = pa.hdfs.connect(hostname, driver='libhdfs') 
> file_path = 'dataset/train/piece0000' 
> f = fs.open(file_path) 
> f.seek(0) 
> n_bytes = 3000000 
> f.read(n_bytes)
> {code}
>  
> Parquet code without the same issue
> {code:java}
> parquet_file = 'dataset/train/parquet/part-22e3' 
> pf = fs.open(parquet_path) 
> pqf = pa.parquet.ParquetFile(pf)
> data = pqf.read_row_group(0, columns=['col_name'])
>  {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-5318) [Python] pyarrow hdfs reader overrequests

Reply via email to