[ https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=580872&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-580872 ]
ASF GitHub Bot logged work on HIVE-24928: ----------------------------------------- Author: ASF GitHub Bot Created on: 12/Apr/21 10:29 Start Date: 12/Apr/21 10:29 Worklog Time Spent: 10m Work Description: kgyrtkirk commented on a change in pull request #2111: URL: https://github.com/apache/hive/pull/2111#discussion_r611489108 ########## File path: ql/src/java/org/apache/hadoop/hive/ql/parse/TaskCompiler.java ########## @@ -422,7 +422,7 @@ private String extractTableFullName(StatsTask tsk) throws SemanticException { TableSpec tableSpec = new TableSpec(table, partitions); tableScan.getConf().getTableMetadata().setTableSpec(tableSpec); - if (BasicStatsNoJobTask.canUseFooterScan(table, inputFormat)) { + if (BasicStatsNoJobTask.canUseColumnStats(table, inputFormat)) { Review comment: we have a `BasicStatsNoJobTask.canUseStats` and a `BasicStatsNoJobTask.canUseColumnStats` - I think "footerscan" is the basicstats stuff ; could this be a typo? ########## File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsNoJobTask.java ########## @@ -90,12 +91,21 @@ public BasicStatsNoJobTask(HiveConf conf, BasicStatsNoJobWork work) { console = new LogHelper(LOG); } - public static boolean canUseFooterScan( + public static boolean canUseStats( Table table, Class<? extends InputFormat> inputFormat) { + return canUseColumnStats(table, inputFormat) || useBasicStatsFromStorageHandler(table); + } + + public static boolean canUseColumnStats(Table table, Class<? extends InputFormat> inputFormat) { Review comment: this has nothing to do with column stats - that's a different thing ########## File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsNoJobTask.java ########## @@ -312,23 +384,23 @@ private int aggregateStats(ExecutorService threadPool, Hive db) { return ret; } - private int updatePartitions(Hive db, List<FooterStatCollector> scs, Table table) throws InvalidOperationException, HiveException { + private int updatePartitions(Hive db, List<StatCollector> scs, Table table) throws InvalidOperationException, HiveException { String tableFullName = table.getFullyQualifiedName(); if (scs.isEmpty()) { return 0; } if (work.isStatsReliable()) { Review comment: note: it might make sense to somehow communicate this `work.isStatsReliable` somehow to the `StatCollector` so it can make that `LOG` entry if it has to... ########## File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsNoJobTask.java ########## @@ -119,16 +129,83 @@ public String getName() { return "STATS-NO-JOB"; } - static class StatItem { - Partish partish; - Map<String, String> params; - Object result; + abstract static class StatCollector implements Runnable { + + protected Partish partish; + protected Object result; + protected LogHelper console; + + public static Function<StatCollector, String> SIMPLE_NAME_FUNCTION = + sc -> String.format("%s#%s", sc.partish().getTable().getCompleteName(), sc.partish().getPartishType()); + + public static Function<StatCollector, Partition> EXTRACT_RESULT_FUNCTION = sc -> (Partition) sc.result(); + + abstract Partish partish(); + abstract boolean isValid(); + abstract Object result(); + abstract void init(HiveConf conf, LogHelper console) throws IOException; + + protected String toString(Map<String, String> parameters) { + return StatsSetupConst.SUPPORTED_STATS.stream().map(st -> st + "=" + parameters.get(st)) + .collect(Collectors.joining(", ")); + } } - static class FooterStatCollector implements Runnable { + static class HiveStorageHandlerStatCollector extends StatCollector { + + public HiveStorageHandlerStatCollector(Partish partish) { + this.partish = partish; + } + + @Override + public void init(HiveConf conf, LogHelper console) throws IOException { + this.console = console; + } + + @Override + public void run() { + try { + Table table = partish.getTable(); + Map<String, String> parameters = partish.getPartParameters(); + TableDesc tableDesc = Utilities.getTableDesc(table); + Map<String, String> basicStatistics = table.getStorageHandler().getBasicStatistics(tableDesc); + + StatsSetupConst.setBasicStatsState(parameters, StatsSetupConst.TRUE); Review comment: I don't understand why we make changes to the `Table` when we could be updating infos of a partition as well... I guess in case of IceBerg you will not have regular partitions ; so it will probably work for that correctly I think here you want to change `parameters` ########## File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsNoJobTask.java ########## @@ -90,12 +91,21 @@ public BasicStatsNoJobTask(HiveConf conf, BasicStatsNoJobWork work) { console = new LogHelper(LOG); } - public static boolean canUseFooterScan( + public static boolean canUseStats( Review comment: do we really need 3 methods here? I think we only need 1 and please add "Basic" to its name so that we are "more" clear with it ########## File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsNoJobTask.java ########## @@ -119,16 +129,83 @@ public String getName() { return "STATS-NO-JOB"; } - static class StatItem { - Partish partish; - Map<String, String> params; - Object result; + abstract static class StatCollector implements Runnable { + + protected Partish partish; + protected Object result; + protected LogHelper console; + + public static Function<StatCollector, String> SIMPLE_NAME_FUNCTION = + sc -> String.format("%s#%s", sc.partish().getTable().getCompleteName(), sc.partish().getPartishType()); + + public static Function<StatCollector, Partition> EXTRACT_RESULT_FUNCTION = sc -> (Partition) sc.result(); + + abstract Partish partish(); + abstract boolean isValid(); + abstract Object result(); + abstract void init(HiveConf conf, LogHelper console) throws IOException; + + protected String toString(Map<String, String> parameters) { + return StatsSetupConst.SUPPORTED_STATS.stream().map(st -> st + "=" + parameters.get(st)) + .collect(Collectors.joining(", ")); + } } - static class FooterStatCollector implements Runnable { + static class HiveStorageHandlerStatCollector extends StatCollector { + + public HiveStorageHandlerStatCollector(Partish partish) { + this.partish = partish; + } + + @Override + public void init(HiveConf conf, LogHelper console) throws IOException { + this.console = console; + } Review comment: make this the default implementation ########## File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsNoJobTask.java ########## @@ -119,16 +129,83 @@ public String getName() { return "STATS-NO-JOB"; } - static class StatItem { - Partish partish; - Map<String, String> params; - Object result; + abstract static class StatCollector implements Runnable { + + protected Partish partish; + protected Object result; + protected LogHelper console; + + public static Function<StatCollector, String> SIMPLE_NAME_FUNCTION = + sc -> String.format("%s#%s", sc.partish().getTable().getCompleteName(), sc.partish().getPartishType()); + + public static Function<StatCollector, Partition> EXTRACT_RESULT_FUNCTION = sc -> (Partition) sc.result(); + + abstract Partish partish(); Review comment: please either add `get` prefix ; or open up field access... these are internal classes and also implement these methods as `final` ########## File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsNoJobTask.java ########## @@ -265,30 +351,16 @@ private int aggregateStats(ExecutorService threadPool, Hive db) { Table table = tableSpecs.tableHandle; - Collection<Partition> partitions = null; - if (work.getPartitions() == null || work.getPartitions().isEmpty()) { - if (table.isPartitioned()) { - partitions = tableSpecs.partitions; - } - } else { - partitions = work.getPartitions(); - } + List<Partish> partishes = getPartishes(table); - LinkedList<Partish> partishes = Lists.newLinkedList(); - if (partitions == null) { - partishes.add(Partish.buildFor(table)); + List<StatCollector> scs; + if (useBasicStatsFromStorageHandler(table)) { + scs = partishes.stream().map(HiveStorageHandlerStatCollector::new).collect(Collectors.toList()); Review comment: note: I think if you use a simple loop with a conditional it will be more readable... the current solution duplicates: `partishes.stream().map` and `.collect(Collectors.toList())` ########## File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsNoJobTask.java ########## @@ -119,16 +129,83 @@ public String getName() { return "STATS-NO-JOB"; } - static class StatItem { - Partish partish; - Map<String, String> params; - Object result; + abstract static class StatCollector implements Runnable { + + protected Partish partish; + protected Object result; + protected LogHelper console; + + public static Function<StatCollector, String> SIMPLE_NAME_FUNCTION = + sc -> String.format("%s#%s", sc.partish().getTable().getCompleteName(), sc.partish().getPartishType()); + + public static Function<StatCollector, Partition> EXTRACT_RESULT_FUNCTION = sc -> (Partition) sc.result(); + + abstract Partish partish(); + abstract boolean isValid(); + abstract Object result(); + abstract void init(HiveConf conf, LogHelper console) throws IOException; + + protected String toString(Map<String, String> parameters) { + return StatsSetupConst.SUPPORTED_STATS.stream().map(st -> st + "=" + parameters.get(st)) + .collect(Collectors.joining(", ")); + } } - static class FooterStatCollector implements Runnable { + static class HiveStorageHandlerStatCollector extends StatCollector { + + public HiveStorageHandlerStatCollector(Partish partish) { + this.partish = partish; + } + + @Override + public void init(HiveConf conf, LogHelper console) throws IOException { + this.console = console; + } + + @Override + public void run() { + try { + Table table = partish.getTable(); + Map<String, String> parameters = partish.getPartParameters(); + TableDesc tableDesc = Utilities.getTableDesc(table); + Map<String, String> basicStatistics = table.getStorageHandler().getBasicStatistics(tableDesc); + + StatsSetupConst.setBasicStatsState(parameters, StatsSetupConst.TRUE); + parameters.putAll(basicStatistics); Review comment: I don't understand why we need to do this; this will effectively copy all table props from the table object to the partition. I guess in case of IceBerg you will not have regular partitions ; so it will probably work for that correctly...but in case there are some this will be just wierd... ########## File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsTask.java ########## @@ -118,11 +119,17 @@ public String getName() { private boolean isMissingAcidState = false; private BasicStatsWork work; private boolean followedColStats1; + private Map<String, String> providedBasicStats; public BasicStatsProcessor(Partish partish, BasicStatsWork work, HiveConf conf, boolean followedColStats2) { this.partish = partish; this.work = work; followedColStats1 = followedColStats2; + Table table = partish.getTable(); + if (table.isNonNative()) { + TableDesc tableDesc = Utilities.getTableDesc(table); + providedBasicStats = table.getStorageHandler().getBasicStatistics(tableDesc); + } Review comment: this `Processor` is created for a `Partish` and not for a table - this part will work incorrectly for partitions. ########## File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsNoJobTask.java ########## @@ -312,23 +384,23 @@ private int aggregateStats(ExecutorService threadPool, Hive db) { return ret; } - private int updatePartitions(Hive db, List<FooterStatCollector> scs, Table table) throws InvalidOperationException, HiveException { + private int updatePartitions(Hive db, List<StatCollector> scs, Table table) throws InvalidOperationException, HiveException { String tableFullName = table.getFullyQualifiedName(); if (scs.isEmpty()) { return 0; } if (work.isStatsReliable()) { - for (FooterStatCollector statsCollection : scs) { - if (statsCollection.result == null) { - LOG.debug("Stats requested to be reliable. Empty stats found: {}", statsCollection.partish.getSimpleName()); + for (StatCollector statsCollection : scs) { + if (statsCollection.result() == null) { Review comment: why do we need to call a method to get `result` ? isn't this the `isValid` method? ########## File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsTask.java ########## @@ -140,7 +147,7 @@ public Object process(StatsAggregator statsAggregator) throws HiveException, Met StatsSetupConst.clearColumnStatsState(parameters); } - if (partfileStatus == null) { + if (partfileStatus == null && providedBasicStats == null) { Review comment: note: these conditional seem to represent a different path - would be an early return an option? ########## File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsTask.java ########## @@ -118,11 +119,17 @@ public String getName() { private boolean isMissingAcidState = false; private BasicStatsWork work; private boolean followedColStats1; + private Map<String, String> providedBasicStats; public BasicStatsProcessor(Partish partish, BasicStatsWork work, HiveConf conf, boolean followedColStats2) { this.partish = partish; this.work = work; followedColStats1 = followedColStats2; + Table table = partish.getTable(); + if (table.isNonNative()) { Review comment: what about non-iceberg non-native tables ? I think we can count the rows in a non-native table with the basic TS rowcounter I think you are trying to identify your usecase here by `isNonNative` - I think you would be better off saving your initial decision about how you plan to execute this task in the `work` ########## File path: iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java ########## @@ -153,6 +156,37 @@ public DecomposedPredicate decomposePredicate(JobConf jobConf, Deserializer dese return predicate; } + @Override + public boolean canProvideBasicStatistics() { + return true; + } + + @Override + public Map<String, String> getBasicStatistics(TableDesc tableDesc) { + Table table = Catalogs.loadTable(conf, tableDesc.getProperties()); + Map<String, String> stats = new HashMap<>(); + if (table.currentSnapshot() != null) { + Map<String, String> summary = table.currentSnapshot().summary(); + if (summary != null) { + if (summary.containsKey(SnapshotSummary.TOTAL_DATA_FILES_PROP)) { + stats.put(StatsSetupConst.NUM_FILES, summary.get(SnapshotSummary.TOTAL_DATA_FILES_PROP)); + } + if (summary.containsKey(SnapshotSummary.TOTAL_RECORDS_PROP)) { + stats.put(StatsSetupConst.ROW_COUNT, summary.get(SnapshotSummary.TOTAL_RECORDS_PROP)); + } + // TODO: add TOTAL_SIZE when iceberg 0.12 is released + if (summary.containsKey("total-files-size")) { + stats.put(StatsSetupConst.TOTAL_SIZE, summary.get("total-files-size")); + } + } + } else { + stats.put(StatsSetupConst.NUM_FILES, "0"); Review comment: note that right now we don't take opportunities for `ROW_COUNT==0` cases (we threat it as nonexistent stats). it seems to me that its already covered in the test - so we will notice this when we decide to improve on it later note: I think the current api doesn't make it possible to "remove" statistics data if we need to do that.. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking ------------------- Worklog Id: (was: 580872) Time Spent: 5h 40m (was: 5.5h) > In case of non-native tables use basic statistics from HiveStorageHandler > ------------------------------------------------------------------------- > > Key: HIVE-24928 > URL: https://issues.apache.org/jira/browse/HIVE-24928 > Project: Hive > Issue Type: Bug > Components: Hive > Affects Versions: 4.0.0 > Reporter: László Pintér > Assignee: László Pintér > Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 5h 40m > Remaining Estimate: 0h > > When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE > ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by > the BasicStatsTask class. This class tries to estimate the statistics by > scanning the directory of the table. > In the case of non-native tables (iceberg, hbase), the table directory might > contain metadata files as well, which would be counted by the BasicStatsTask > when calculating basic stats. > Instead of having this logic, the HiveStorageHandler implementation should > provide basic statistics. -- This message was sent by Atlassian Jira (v8.3.4#803005)