Surya Hebbar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21683 )

Change subject: IMPALA-13304: Include aggregate instance-level metrics within 
experimental profile(V2)
......................................................................


Patch Set 8:

Also, thank you for prompting me regarding the missing events case.

Please consider this example, to see some limitations and complications arising 
from the current aggregation mechanism(i.e. 'labels', 'label_idxs', 
'timestamps').

The events are streamed from instances in this exact order.

Instance 1 -
"Open Started" - 2280ms
"Open Finished" - 3270ms
"Last Batch Returned" - 3470ms
"Closed" - 4100ms

Instance 2 -
"Open Started" - 1280ms
"Open Finished" - 1330ms
"First Batch Requested" - 1370ms
"Closed" - 3100ms

Result of aggregation -

labels -
"Open Started" - 0, "Open Finished" - 1, "Last Batch Returned" - 2, "Closed" - 
3, "First Batch Requested" - 4

label_idxs -
0 1 2 3
0 1 4 3

timestamps -
2280 3270 3470 4100
1280 1330 1370 3100

Note : The resulting 'labels', 'timestamps' and 'label_idxs' will be consistent 
as 'SortEvents' is being called before delivering events.

The proper alignment of timestamps (with zeros substituted) is -
(possible to generate using 'label_idxs')
2280 3270 3470 4100 0
1280 1330 0    3100 1370

The proper order and alignment of timestamps(with zeros substituted) is -
(functionally not possible to generate, even after using 'labels'(map), 
'label_idxs' and 'timestamps')
2280 3270 0     3470 4100
1280 1330 1370  0    3100
(note: identifying the order of events, through comparison between timestamps 
across instances is wrong)

In this manner, if the first few instances responsible for generating 'labels' 
map have missing events, it is functionally not possible to extract the perfect 
order.

Hence, the current aggregation mechanism, fails to provide an absolute method 
to align and order the missing events' timestamp in many edge cases.

Although a definite ordering is not always possible, I have come up with the 
following approach to generate the most probable orderings(using 'label_idxs' 
from all instances).           

1. Consider each index in 'label_idxs' as a node
2. While mantaining a set of visited nodes
3. Add adjacent indexes in 'label_idxs' as parent(or children), if they exist 
in visited nodes(Add to visited nodes, otherwise)
4. The result will be
  -> A degenerate binary tree (a successful definite order was found)
  -> A single multi-terminal unrooted tree(definite order, except where there 
are multiple terminals)
  -> A forest of disjoint trees(not functionally possible to find definite 
order, need to merge all trees in any order)
5. Level-order traversal of the final tree will provide a perfect order / one 
of the most probable orderings.

I believe this would result in a space and time complexity of O(n).            

In addition to the computational overhead, even after such a complex ordering 
process, still there is a possibility of not finding the perfect order(when it 
is not functionally definite, i.e. consider indexes [0, 1, 2] [3, 4, 5]).

Considering all of these factors, in such cases, I think any one of the 
following choices can be made, to preserve the simplicity and efficiency -
Either
1. Skip the small number of missing events(accept the rarely occuring small 
irregularity in timestamp's aggregate values)                                   
                                                                                
                                                                                
   
or                                                                              
                                                                                
                                                                                
                                                                                
                    
2. Ignore all the events from irregular instance(accept the loss of all 
instance's timestamp values in the aggregate)                                   
                                                                                
                                                    
or                                                                              
                                                                                
                                                                                
                                                                 
3. Declare a predifined order for event labels(accept the overhead of 
reordering timestamps, even if just a single event is missing)                  
                                                                                
                                                                                
              
(also, it may not be possible to have a predefined order for events in cases 
like AsyncCodegen)                                                              
                                                                                
                                                                                
                            
                                                                                
                                                                                
                                                                                
                                                   
After inspecting all these aspects, I have currently decided to go with the 
first option here.                                                              
                                                                                
                                                                          
                                                                                
                                                                                
                                                                                
                                                                                
           
Please let me know, if there is a much simpler way through this or if I have 
misunderstood something.


--
To view, visit http://gerrit.cloudera.org:8080/21683
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I49e18a7a7e1288e3e674e15b6fc86aad60a08214
Gerrit-Change-Number: 21683
Gerrit-PatchSet: 8
Gerrit-Owner: Surya Hebbar <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Kurt Deschler <[email protected]>
Gerrit-Reviewer: Riza Suminto <[email protected]>
Gerrit-Reviewer: Surya Hebbar <[email protected]>
Gerrit-Comment-Date: Fri, 15 Nov 2024 21:03:04 +0000
Gerrit-HasComments: No

Reply via email to