[jira] [Commented] (HADOOP-19364) S3A Analytics-Accelerator: Add IoStatistics support

Steve Loughran (Jira) Fri, 10 Jan 2025 09:21:04 -0800


    [ 
https://issues.apache.org/jira/browse/HADOOP-19364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17912046#comment-17912046
 ]


Steve Loughran commented on HADOOP-19364:
-----------------------------------------

This will be very useful for testing, and for production diagnostics. I would 
suggest adding the initial values asap, and expand as the work progresses.

These are effective to expose state (is this parsed as an analytics file?) and 
performance (successful prefetch count) which can be aggregated across entire 
jobs


Proposed duration trackers for all remote/slow operations
* Time to fetch footer
* Time to parse footer
* If a whole file is being read: Time to GET that.
* Duration of prefetches.

Counters
* number of reads satisfied by prefetched blocks
* parse errors (for aggregation)
* unsatisfied reads
* prefetch blocks discarded without being used
* number of prefetch blocks used more than once.
* maybe: number of adjacent blocks? could hint sizes too low.


Gauges (using 0/1 as true/false)
* is analytics file with supported format (parquet,...)?
* successfully parsed?
* has parquet v3 footer?


Really counters, but using these stat groups supports aggregation across fs/job
* actual footer size
* file size
* structure of file (columns, rowgroups, sizes?)


> S3A Analytics-Accelerator: Add IoStatistics support
> ---------------------------------------------------
>
>                 Key: HADOOP-19364
>                 URL: https://issues.apache.org/jira/browse/HADOOP-19364
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>            Reporter: Ahmar Suhail
>            Priority: Major
>
> S3A provides InputStream statistics: 
> [https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/statistics/S3AInputStreamStatistics.java]
> This helps track things like how many bytes were read from a stream etc. 
>  
> The current integration does not currently implement statistics. To start off 
> with we should identify which of these statistics makes sense for us track in 
> the new stream. Some examples are:
>  
> 1/ bytesRead
> 2/ readOperationStarted
> 3/ initiateGetRequest
>  
> Some of these (1 and 2) are more straightforward, and should not require any 
> changes to analytics-accelerator-s3, but tracking GET requests will require 
> this. 
> We should also add tests that make assertions on these statistics. See 
> ITestS3APrefetchingInputStream for an example to do this. 
> And see https://issues.apache.org/jira/browse/HADOOP-18190 for how this was 
> done on the prefetching stream, and PR: 
> https://github.com/apache/hadoop/pull/4458



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-19364) S3A Analytics-Accelerator: Add IoStatistics support

Reply via email to