[ https://issues.apache.org/jira/browse/HIVE-27000?focusedWorklogId=842662&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-842662 ]
ASF GitHub Bot logged work on HIVE-27000: ----------------------------------------- Author: ASF GitHub Bot Created on: 31/Jan/23 18:12 Start Date: 31/Jan/23 18:12 Worklog Time Spent: 10m Work Description: asolimando commented on code in PR #3997: URL: https://github.com/apache/hive/pull/3997#discussion_r1092311907 ########## standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/columnstats/merge/DateColumnStatsMerger.java: ########## @@ -43,64 +46,57 @@ public void merge(ColumnStatisticsObj aggregateColStats, ColumnStatisticsObj new DateColumnStatsDataInspector aggregateData = dateInspectorFromStats(aggregateColStats); DateColumnStatsDataInspector newData = dateInspectorFromStats(newColStats); - setLowValue(aggregateData, newData); - setHighValue(aggregateData, newData); - - aggregateData.setNumNulls(aggregateData.getNumNulls() + newData.getNumNulls()); - if (aggregateData.getNdvEstimator() == null || newData.getNdvEstimator() == null) { - aggregateData.setNumDVs(Math.max(aggregateData.getNumDVs(), newData.getNumDVs())); - } else { - NumDistinctValueEstimator oldEst = aggregateData.getNdvEstimator(); - NumDistinctValueEstimator newEst = newData.getNdvEstimator(); - final long ndv; - if (oldEst.canMerge(newEst)) { - oldEst.mergeEstimators(newEst); - ndv = oldEst.estimateNumDistinctValues(); - aggregateData.setNdvEstimator(oldEst); - } else { - ndv = Math.max(aggregateData.getNumDVs(), newData.getNumDVs()); - } - LOG.debug("Use bitvector to merge column {}'s ndvs of {} and {} to be {}", aggregateColStats.getColName(), - aggregateData.getNumDVs(), newData.getNumDVs(), ndv); - aggregateData.setNumDVs(ndv); + Date lowValue = mergeLowValue(getLowValue(aggregateData), getLowValue(newData)); + if (lowValue != null) { + aggregateData.setLowValue(lowValue); + } + Date highValue = mergeHighValue(getHighValue(aggregateData), getHighValue(newData)); + if (highValue != null) { Review Comment: Thanks @akshat0395 for your comment, you are right, but unfortunately the problem lies in the `*ColumnStatsDataInspector` objects, which inherits from the corresponding `*ColumnStatsData`, which don't have a common ancestor. For this reason it's hard to write a generic method, we would need to rely on generic for the data type (say `Double`) and anyway a big switch for handling the *ColumnStatsDataInspector objects and casting, it would rapidly defeat the purpose. This lack of class hierarchy for column statistics comes directly from the thrift interface objects for the column statistics in the metastore (see [here](https://github.com/apache/hive/blob/master/standalone-metastore/metastore-common/src/main/thrift/hive_metastore.thrift)). HIVE-26313 could be a chance to overcome that, and define a better hierarchy for those classes, since we won't anyway be having a 1-1 mapping from a column in the metastore for column statistics and a Java class, but we will have to parse a (probably Json) blob and do the "mapping" ourselves. In summary, I'd rather leave this for when we will have a better class hierarchy for those objects (hopefully with HIVE-26313). WDYT? Issue Time Tracking ------------------- Worklog Id: (was: 842662) Time Spent: 1h (was: 50m) > Improve the modularity of the *ColumnStatsMerger classes > -------------------------------------------------------- > > Key: HIVE-27000 > URL: https://issues.apache.org/jira/browse/HIVE-27000 > Project: Hive > Issue Type: Improvement > Components: Statistics > Affects Versions: 4.0.0-alpha-2 > Reporter: Alessandro Solimando > Assignee: Alessandro Solimando > Priority: Major > Labels: pull-request-available > Time Spent: 1h > Remaining Estimate: 0h > > *ColumnStatsMerger classes contain a lot of duplicate code which is not > specific to the data type, and that could therefore be lifted to a common > parent class. > This phenomenon is bound to become even worse if we keep enriching further > our supported set of statistics as we did in the context of HIVE-26221. > The current ticket aims at improving the modularity and code reuse of the > *ColumnStatsMerger classes, while improving unit-test coverage to cover all > classes and support more use-cases. -- This message was sent by Atlassian Jira (v8.20.10#820010)