Hi, I have started the voting thread for this FLIP as well after modifying according to Roman's suggestions. If someone has some other suggestions, please don't hesitate to raise them here regardless of the ongoing vote :)
Best, Piotrek śr., 22 lis 2023 o 12:37 Piotr Nowojski <pnowoj...@apache.org> napisał(a): > Ok, good point about dropping the aggregation method from the API. We can > always add it in the future if it will ever be needed. > > Thanks for your input! > > Best, > Piotrek > > śr., 22 lis 2023 o 01:53 Roman Khachatryan <ro...@apache.org> napisał(a): > >> Thanks for clarifying, I see your points (although reporting metrics as >> spans still seems counter intuitive to me). >> >> As for the aggregation, I'm concerned that it might be unnecessarily >> ambiguous: where the aggregation is performed (JM/TM); across what >> (tasks/time); and which aggregation should be used. >> >> How about dropping it from the API and always using min, max, sum, avg? I >> think we're interested in these aggregations for all the metrics, and >> there >> is no penalty for reporting all of them because it's only for >> initialization. >> >> Regards, >> Roman >> >> On Mon, Nov 20, 2023, 8:42 AM Piotr Nowojski <pnowoj...@apache.org> >> wrote: >> >> > Hi Roman! >> > >> > > 1. why the existing MetricGroup interface can't be used? It already >> had >> > > methods to add metrics and spans ... >> > >> > That's because of the need to: >> > a) associate the spans to specifically Job's initialisation >> > b) we need to logically aggregate the span's attributes across subtasks. >> > >> > `MetricGroup` doesn't have such capabilities and it's too generic an >> > interface to introduce things like that IMO. >> > >> > Additionally for metrics: >> > c) reporting initialization measurements as metrics is a flawed concept >> as >> > described in the FLIP's-384 motivation >> > Additionally for spans: >> > d) as discussed in the FLIP's-384 thread, we don't want to report >> separate >> > spans on the TMs. At least not right now >> > >> > Also having a specialized, dedicated for initialization metrics class to >> > collect those numbers, makes the interfaces >> > more lean and more specialized. >> > >> > > 2. IIUC, based on these numbers, we're going to report span(s). >> Shouldn't >> > > the backend report them as spans? >> > >> > As discussed in the FLIP's-384, initially we don't want to report spans >> on >> > TMs. Later, optionally reporting >> > individual subtask's checkpoint/recovery spans on the JM looks like a >> > logical follow up. >> > >> > > 3. How is the implementation supposed to infer that some metric is a >> part >> > > of initialization (and make the corresponding RPC to JM?). Should the >> > > interfaces be more explicit about that? >> > >> > This FLIP proposes [1] to add >> > `CustomInitializationMetrics >> > KeyedStateBackendParameters#getCustomInitializationMetrics()` >> > accessor to the `KeyedStateBackendParameters` argument that's passed to >> > `createKeyedStateBackend(...)` >> > method. That's pretty explicit I would say :) >> > >> > > 4. What do you think about using histogram or percentiles instead of >> > > min/max/sum/avg? That would be more informative >> > >> > I would prefer to start with the simplest min/max/sum/avg, and let's >> see in >> > which direction (if any) we need to evolve >> > that. Alternative to percentiles is previously mentioned to report >> > separately each subtask's initialisation/checkpointing span. >> > >> > Best, >> > Piotrek >> > >> > [1] >> > >> > >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-386%3A+Support+adding+custom+metrics+in+Recovery+Spans#FLIP386:SupportaddingcustommetricsinRecoverySpans-PublicInterfaces >> > >> > czw., 16 lis 2023 o 15:45 Roman Khachatryan <ro...@apache.org> >> napisał(a): >> > >> > > Thanks for the proposal, >> > > >> > > Can you please explain: >> > > 1. why the existing MetricGroup interface can't be used? It already >> had >> > > methods to add metrics and spans ... >> > > >> > > 2. IIUC, based on these numbers, we're going to report span(s). >> Shouldn't >> > > the backend report them as spans? >> > > >> > > 3. How is the implementation supposed to infer that some metric is a >> part >> > > of initialization (and make the corresponding RPC to JM?). Should the >> > > interfaces be more explicit about that? >> > > >> > > 4. What do you think about using histogram or percentiles instead of >> > > min/max/sum/avg? That would be more informative >> > > >> > > I like the idea of introducing parameter objects for backend creation. >> > > >> > > Regards, >> > > Roman >> > > >> > > On Tue, Nov 7, 2023, 1:20 PM Piotr Nowojski <pnowoj...@apache.org> >> > wrote: >> > > >> > > > (Fixing topic) >> > > > >> > > > wt., 7 lis 2023 o 09:40 Piotr Nowojski <pnowoj...@apache.org> >> > > napisał(a): >> > > > >> > > > > Hi all! >> > > > > >> > > > > I would like to start a discussion on a follow up of FLIP-384: >> > > Introduce >> > > > > TraceReporter and use it to create checkpointing and recovery >> traces >> > > [1]: >> > > > > >> > > > > *FLIP-386: Support adding custom metrics in Recovery Spans [2]* >> > > > > >> > > > > This FLIP adds a functionality that will allow state backends to >> > attach >> > > > > custom metrics to the recovery/initialization traces. This >> requires >> > > > changes >> > > > > to the `@PublicEvolving` `StateBackend` API, and it will be >> initially >> > > > used >> > > > > in `RocksDBIncrementalRestoreOperation` to measure how long does >> it >> > > take >> > > > to >> > > > > download remote files and separately how long does it take to load >> > > those >> > > > > files into the local RocksDB instance. >> > > > > >> > > > > Please let me know what you think! >> > > > > >> > > > > Best, >> > > > > Piotr Nowojski >> > > > > >> > > > > [1] >> > > > > >> > > > >> > > >> > >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-384%3A+Introduce+TraceReporter+and+use+it+to+create+checkpointing+and+recovery+traces >> > > > > [2] >> > > > > >> > > > >> > > >> > >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-386%3A+Support+adding+custom+metrics+in+Recovery+Spans >> > > > > >> > > > > >> > > > >> > > >> > >> >