[ https://issues.apache.org/jira/browse/ARROW-5318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17662339#comment-17662339 ]
Rok Mihevc commented on ARROW-5318: ----------------------------------- This issue has been migrated to [issue #21780|https://github.com/apache/arrow/issues/21780] on GitHub. Please see the [migration documentation|https://github.com/apache/arrow/issues/14542] for further details. > [Python] pyarrow hdfs reader overrequests > ------------------------------------------- > > Key: ARROW-5318 > URL: https://issues.apache.org/jira/browse/ARROW-5318 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 0.10.0 > Reporter: Ivan Dimitrov > Priority: Blocker > Fix For: 0.14.0 > > > I am using pyarrow's HdfsFilesystem interface. When I call a read on n bytes, > I often get 0%-300% more data sent over the network. My suspicion is that > pyarrow is reading ahead. > The pyarrow parquet reader doesn't have this behavior, and I am looking for a > way to turn off read ahead for the general HDFS interface. > I am running on ubuntu 14.04. This issue is present in pyarrow 0.10 - 0.13 > (newest released version). I am on python 2.7 > I have been using wireshark to track the packets passed on the network. > I suspect it is read ahead since the time for the 1st read is much greater > than the time for 2nd read. > > The regular pyarrow reader > {code:java} > import pyarrow as pa > fs = pa.hdfs.connect(hostname, driver='libhdfs') > file_path = 'dataset/train/piece0000' > f = fs.open(file_path) > f.seek(0) > n_bytes = 3000000 > f.read(n_bytes) > {code} > > Parquet code without the same issue > {code:java} > parquet_file = 'dataset/train/parquet/part-22e3' > pf = fs.open(parquet_path) > pqf = pa.parquet.ParquetFile(pf) > data = pqf.read_row_group(0, columns=['col_name']) > {code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010)