richardstartin edited a comment on pull request #7828: URL: https://github.com/apache/pinot/pull/7828#issuecomment-978891304
> I actually considered that, and decided to go with the snapshot approach because we want to optimize the query runtime method as much as possible I don’t think this is likely to be an optimisation unless the removeAll call is made frequently and the number of empty segments is large. I think when an optimisation makes the code much harder to read we need 4 bits of data: 1. **evidence that there is a problem** - e.g. wall time thresholded tracing events around EmptySegmentPruner.prune, or a sampling profiler reporting lots of samples in this method 2. **evidence that the thing being optimised causes the problem** - assuming we have data that there is a problem, is it the removeAll call (which using a HashSet may optimise relative to a concurrent variant) or the copy on line 151 which is the cause? 3. **evidence the problem is common** - e.g. statistics about number of empty segments from a real cluster, do we have a metric for this? 4. **evidence the optimisation is effective** - for widely observed set sizes, does it actually make a difference to have a HashSet instead of a concurrent HashSet? I raise this only because the code is *much* harder to read than if it used a threadsafe Set. If it were as readable I would be inclined to assume that the optimisation is effective. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
