[ https://issues.apache.org/jira/browse/HADOOP-19641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18012186#comment-18012186 ]
Steve Loughran commented on HADOOP-19641: ----------------------------------------- are you using the openFile seek policy as suggested? parquet will tell you when its a parquet file and its read policy is common: 8 byte footer, reall footer, rowgroups. > ABFS: [ReadAheadV2] First Read should bypass ReadBufferManager > -------------------------------------------------------------- > > Key: HADOOP-19641 > URL: https://issues.apache.org/jira/browse/HADOOP-19641 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/azure > Affects Versions: 3.4.1 > Reporter: Anuj Modi > Assignee: Anuj Modi > Priority: Major > Labels: Performance > > We have observed this across multiple workload runs that when we start > reading data from input stream. The first read which came to input stream has > to be read synchronously even if we trigger prefetch request for that > particular offset. Most of the times we end up doing extra work of checking > if the prefetch is trigerred, removing prefetch from the pending queue and go > ahead to do a direct remote read in workload thread itself. > To avoid all this overhead, we will always bypass read ahead for the very > first read of each input stream and trigger read aheads for second read > onwards. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org