[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler

ASF GitHub Bot (Jira) Mon, 12 Apr 2021 03:30:08 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=580872&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-580872
 ]


ASF GitHub Bot logged work on HIVE-24928:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 12/Apr/21 10:29
            Start Date: 12/Apr/21 10:29
    Worklog Time Spent: 10m 
      Work Description: kgyrtkirk commented on a change in pull request #2111:
URL: https://github.com/apache/hive/pull/2111#discussion_r611489108



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/parse/TaskCompiler.java
##########
@@ -422,7 +422,7 @@ private String extractTableFullName(StatsTask tsk) throws 
SemanticException {
     TableSpec tableSpec = new TableSpec(table, partitions);
     tableScan.getConf().getTableMetadata().setTableSpec(tableSpec);
 
-    if (BasicStatsNoJobTask.canUseFooterScan(table, inputFormat)) {
+    if (BasicStatsNoJobTask.canUseColumnStats(table, inputFormat)) {

Review comment:
       we have a `BasicStatsNoJobTask.canUseStats` 
   and a `BasicStatsNoJobTask.canUseColumnStats`  - I think "footerscan" is the 
basicstats stuff ; could this be a typo?

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsNoJobTask.java
##########
@@ -90,12 +91,21 @@ public BasicStatsNoJobTask(HiveConf conf, 
BasicStatsNoJobWork work) {
     console = new LogHelper(LOG);
   }
 
-  public static boolean canUseFooterScan(
+  public static boolean canUseStats(
       Table table, Class<? extends InputFormat> inputFormat) {
+      return canUseColumnStats(table, inputFormat) || 
useBasicStatsFromStorageHandler(table);
+  }
+
+  public static boolean canUseColumnStats(Table table, Class<? extends 
InputFormat> inputFormat) {

Review comment:
       this has nothing to do with column stats - that's a different thing

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsNoJobTask.java
##########
@@ -312,23 +384,23 @@ private int aggregateStats(ExecutorService threadPool, 
Hive db) {
     return ret;
   }
 
-  private int updatePartitions(Hive db, List<FooterStatCollector> scs, Table 
table) throws InvalidOperationException, HiveException {
+  private int updatePartitions(Hive db, List<StatCollector> scs, Table table) 
throws InvalidOperationException, HiveException {
 
     String tableFullName = table.getFullyQualifiedName();
 
     if (scs.isEmpty()) {
       return 0;
     }
     if (work.isStatsReliable()) {

Review comment:
       note: it might make sense to somehow communicate this 
`work.isStatsReliable` somehow to the `StatCollector` so it can make that `LOG` 
entry if it has to...

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsNoJobTask.java
##########
@@ -119,16 +129,83 @@ public String getName() {
     return "STATS-NO-JOB";
   }
 
-  static class StatItem {
-    Partish partish;
-    Map<String, String> params;
-    Object result;
+  abstract static class StatCollector implements Runnable {
+
+    protected Partish partish;
+    protected Object result;
+    protected LogHelper console;
+
+    public static Function<StatCollector, String> SIMPLE_NAME_FUNCTION =
+        sc -> String.format("%s#%s", 
sc.partish().getTable().getCompleteName(), sc.partish().getPartishType());
+
+    public static Function<StatCollector, Partition> EXTRACT_RESULT_FUNCTION = 
sc -> (Partition) sc.result();
+
+    abstract Partish partish();
+    abstract boolean isValid();
+    abstract Object result();
+    abstract void init(HiveConf conf, LogHelper console) throws IOException;
+
+    protected String toString(Map<String, String> parameters) {
+      return StatsSetupConst.SUPPORTED_STATS.stream().map(st -> st + "=" + 
parameters.get(st))
+          .collect(Collectors.joining(", "));
+    }
   }
 
-  static class FooterStatCollector implements Runnable {
+  static class HiveStorageHandlerStatCollector extends StatCollector {
+
+    public HiveStorageHandlerStatCollector(Partish partish) {
+      this.partish = partish;
+    }
+
+    @Override
+    public void init(HiveConf conf, LogHelper console) throws IOException {
+      this.console = console;
+    }
+
+    @Override
+    public void run() {
+      try {
+        Table table = partish.getTable();
+        Map<String, String> parameters = partish.getPartParameters();
+        TableDesc tableDesc = Utilities.getTableDesc(table);
+        Map<String, String> basicStatistics = 
table.getStorageHandler().getBasicStatistics(tableDesc);
+
+        StatsSetupConst.setBasicStatsState(parameters, StatsSetupConst.TRUE);

Review comment:
       I don't understand why we make changes to the `Table` when we could be 
updating infos of a partition as well...
   I guess in case of IceBerg you will not have regular partitions ; so it will 
probably work for that correctly
   
   I think here you want to change `parameters`

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsNoJobTask.java
##########
@@ -90,12 +91,21 @@ public BasicStatsNoJobTask(HiveConf conf, 
BasicStatsNoJobWork work) {
     console = new LogHelper(LOG);
   }
 
-  public static boolean canUseFooterScan(
+  public static boolean canUseStats(

Review comment:
       do we really need 3 methods here? I think we only need 1
   
   and please add "Basic" to its name so that we are "more" clear with it

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsNoJobTask.java
##########
@@ -119,16 +129,83 @@ public String getName() {
     return "STATS-NO-JOB";
   }
 
-  static class StatItem {
-    Partish partish;
-    Map<String, String> params;
-    Object result;
+  abstract static class StatCollector implements Runnable {
+
+    protected Partish partish;
+    protected Object result;
+    protected LogHelper console;
+
+    public static Function<StatCollector, String> SIMPLE_NAME_FUNCTION =
+        sc -> String.format("%s#%s", 
sc.partish().getTable().getCompleteName(), sc.partish().getPartishType());
+
+    public static Function<StatCollector, Partition> EXTRACT_RESULT_FUNCTION = 
sc -> (Partition) sc.result();
+
+    abstract Partish partish();
+    abstract boolean isValid();
+    abstract Object result();
+    abstract void init(HiveConf conf, LogHelper console) throws IOException;
+
+    protected String toString(Map<String, String> parameters) {
+      return StatsSetupConst.SUPPORTED_STATS.stream().map(st -> st + "=" + 
parameters.get(st))
+          .collect(Collectors.joining(", "));
+    }
   }
 
-  static class FooterStatCollector implements Runnable {
+  static class HiveStorageHandlerStatCollector extends StatCollector {
+
+    public HiveStorageHandlerStatCollector(Partish partish) {
+      this.partish = partish;
+    }
+
+    @Override
+    public void init(HiveConf conf, LogHelper console) throws IOException {
+      this.console = console;
+    }

Review comment:
       make this the default implementation

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsNoJobTask.java
##########
@@ -119,16 +129,83 @@ public String getName() {
     return "STATS-NO-JOB";
   }
 
-  static class StatItem {
-    Partish partish;
-    Map<String, String> params;
-    Object result;
+  abstract static class StatCollector implements Runnable {
+
+    protected Partish partish;
+    protected Object result;
+    protected LogHelper console;
+
+    public static Function<StatCollector, String> SIMPLE_NAME_FUNCTION =
+        sc -> String.format("%s#%s", 
sc.partish().getTable().getCompleteName(), sc.partish().getPartishType());
+
+    public static Function<StatCollector, Partition> EXTRACT_RESULT_FUNCTION = 
sc -> (Partition) sc.result();
+
+    abstract Partish partish();

Review comment:
       please either add `get` prefix ; or open up field access... these are 
internal classes
   
   and also implement these methods as `final` 

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsNoJobTask.java
##########
@@ -265,30 +351,16 @@ private int aggregateStats(ExecutorService threadPool, 
Hive db) {
 
       Table table = tableSpecs.tableHandle;
 
-      Collection<Partition> partitions = null;
-      if (work.getPartitions() == null || work.getPartitions().isEmpty()) {
-        if (table.isPartitioned()) {
-          partitions = tableSpecs.partitions;
-        }
-      } else {
-        partitions = work.getPartitions();
-      }
+      List<Partish> partishes = getPartishes(table);
 
-      LinkedList<Partish> partishes = Lists.newLinkedList();
-      if (partitions == null) {
-        partishes.add(Partish.buildFor(table));
+      List<StatCollector> scs;
+      if (useBasicStatsFromStorageHandler(table)) {
+        scs = 
partishes.stream().map(HiveStorageHandlerStatCollector::new).collect(Collectors.toList());

Review comment:
       note: I think if you use a simple loop with a conditional it will be 
more readable...
   
   the current solution duplicates: `partishes.stream().map` and 
`.collect(Collectors.toList())`

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsNoJobTask.java
##########
@@ -119,16 +129,83 @@ public String getName() {
     return "STATS-NO-JOB";
   }
 
-  static class StatItem {
-    Partish partish;
-    Map<String, String> params;
-    Object result;
+  abstract static class StatCollector implements Runnable {
+
+    protected Partish partish;
+    protected Object result;
+    protected LogHelper console;
+
+    public static Function<StatCollector, String> SIMPLE_NAME_FUNCTION =
+        sc -> String.format("%s#%s", 
sc.partish().getTable().getCompleteName(), sc.partish().getPartishType());
+
+    public static Function<StatCollector, Partition> EXTRACT_RESULT_FUNCTION = 
sc -> (Partition) sc.result();
+
+    abstract Partish partish();
+    abstract boolean isValid();
+    abstract Object result();
+    abstract void init(HiveConf conf, LogHelper console) throws IOException;
+
+    protected String toString(Map<String, String> parameters) {
+      return StatsSetupConst.SUPPORTED_STATS.stream().map(st -> st + "=" + 
parameters.get(st))
+          .collect(Collectors.joining(", "));
+    }
   }
 
-  static class FooterStatCollector implements Runnable {
+  static class HiveStorageHandlerStatCollector extends StatCollector {
+
+    public HiveStorageHandlerStatCollector(Partish partish) {
+      this.partish = partish;
+    }
+
+    @Override
+    public void init(HiveConf conf, LogHelper console) throws IOException {
+      this.console = console;
+    }
+
+    @Override
+    public void run() {
+      try {
+        Table table = partish.getTable();
+        Map<String, String> parameters = partish.getPartParameters();
+        TableDesc tableDesc = Utilities.getTableDesc(table);
+        Map<String, String> basicStatistics = 
table.getStorageHandler().getBasicStatistics(tableDesc);
+
+        StatsSetupConst.setBasicStatsState(parameters, StatsSetupConst.TRUE);
+        parameters.putAll(basicStatistics);

Review comment:
       I don't understand why we need to do this; this will effectively copy 
all table props from the table object to the partition.
   I guess in case of IceBerg you will not have regular partitions ; so it will 
probably work for that correctly...but in case there are some this will be just 
wierd...
   

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsTask.java
##########
@@ -118,11 +119,17 @@ public String getName() {
     private boolean isMissingAcidState = false;
     private BasicStatsWork work;
     private boolean followedColStats1;
+    private Map<String, String> providedBasicStats;
 
     public BasicStatsProcessor(Partish partish, BasicStatsWork work, HiveConf 
conf, boolean followedColStats2) {
       this.partish = partish;
       this.work = work;
       followedColStats1 = followedColStats2;
+      Table table = partish.getTable();
+      if (table.isNonNative()) {
+        TableDesc tableDesc = Utilities.getTableDesc(table);
+        providedBasicStats = 
table.getStorageHandler().getBasicStatistics(tableDesc);
+      }

Review comment:
       this `Processor` is created for a `Partish` and not for a table - this 
part will work incorrectly for partitions.

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsNoJobTask.java
##########
@@ -312,23 +384,23 @@ private int aggregateStats(ExecutorService threadPool, 
Hive db) {
     return ret;
   }
 
-  private int updatePartitions(Hive db, List<FooterStatCollector> scs, Table 
table) throws InvalidOperationException, HiveException {
+  private int updatePartitions(Hive db, List<StatCollector> scs, Table table) 
throws InvalidOperationException, HiveException {
 
     String tableFullName = table.getFullyQualifiedName();
 
     if (scs.isEmpty()) {
       return 0;
     }
     if (work.isStatsReliable()) {
-      for (FooterStatCollector statsCollection : scs) {
-        if (statsCollection.result == null) {
-          LOG.debug("Stats requested to be reliable. Empty stats found: {}", 
statsCollection.partish.getSimpleName());
+      for (StatCollector statsCollection : scs) {
+        if (statsCollection.result() == null) {

Review comment:
       why do we need to call a method to get `result` ? isn't this the 
`isValid` method?

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsTask.java
##########
@@ -140,7 +147,7 @@ public Object process(StatsAggregator statsAggregator) 
throws HiveException, Met
         StatsSetupConst.clearColumnStatsState(parameters);
       }
 
-      if (partfileStatus == null) {
+      if (partfileStatus == null && providedBasicStats == null) {

Review comment:
       note: these conditional seem to represent a different path - would be an 
early return an option?

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsTask.java
##########
@@ -118,11 +119,17 @@ public String getName() {
     private boolean isMissingAcidState = false;
     private BasicStatsWork work;
     private boolean followedColStats1;
+    private Map<String, String> providedBasicStats;
 
     public BasicStatsProcessor(Partish partish, BasicStatsWork work, HiveConf 
conf, boolean followedColStats2) {
       this.partish = partish;
       this.work = work;
       followedColStats1 = followedColStats2;
+      Table table = partish.getTable();
+      if (table.isNonNative()) {

Review comment:
       what about non-iceberg non-native tables ?
   I think we can count the rows in a non-native table with the basic TS 
rowcounter
   
   I think you are trying to identify your usecase here by `isNonNative` - I 
think you would be better off saving your initial decision about how you plan 
to execute this task in the `work`

##########
File path: 
iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java
##########
@@ -153,6 +156,37 @@ public DecomposedPredicate decomposePredicate(JobConf 
jobConf, Deserializer dese
     return predicate;
   }
 
+  @Override
+  public boolean canProvideBasicStatistics() {
+    return true;
+  }
+
+  @Override
+  public Map<String, String> getBasicStatistics(TableDesc tableDesc) {
+    Table table = Catalogs.loadTable(conf, tableDesc.getProperties());
+    Map<String, String> stats = new HashMap<>();
+    if (table.currentSnapshot() != null) {
+      Map<String, String> summary = table.currentSnapshot().summary();
+      if (summary != null) {
+        if (summary.containsKey(SnapshotSummary.TOTAL_DATA_FILES_PROP)) {
+          stats.put(StatsSetupConst.NUM_FILES, 
summary.get(SnapshotSummary.TOTAL_DATA_FILES_PROP));
+        }
+        if (summary.containsKey(SnapshotSummary.TOTAL_RECORDS_PROP)) {
+          stats.put(StatsSetupConst.ROW_COUNT, 
summary.get(SnapshotSummary.TOTAL_RECORDS_PROP));
+        }
+        // TODO: add TOTAL_SIZE when iceberg 0.12 is released
+        if (summary.containsKey("total-files-size")) {
+          stats.put(StatsSetupConst.TOTAL_SIZE, 
summary.get("total-files-size"));
+        }
+      }
+    } else {
+      stats.put(StatsSetupConst.NUM_FILES, "0");

Review comment:
       note that right now we don't take opportunities for `ROW_COUNT==0` cases 
(we threat it as nonexistent stats).
   it seems to me that its already covered in the test - so we will notice this 
when we decide to improve on it later
   
   note: I think the current api doesn't make it possible to "remove" 
statistics data if we need to do that..




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 580872)
    Time Spent: 5h 40m  (was: 5.5h)

> In case of non-native tables use basic statistics from HiveStorageHandler
> -------------------------------------------------------------------------
>
>                 Key: HIVE-24928
>                 URL: https://issues.apache.org/jira/browse/HIVE-24928
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive
>    Affects Versions: 4.0.0
>            Reporter: László Pintér
>            Assignee: László Pintér
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.0.0
>
>          Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE 
> ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by 
> the BasicStatsTask class. This class tries to estimate the statistics by 
> scanning the directory of the table. 
> In the case of non-native tables (iceberg, hbase), the table directory might 
> contain metadata files as well, which would be counted by the BasicStatsTask 
> when calculating basic stats. 
> Instead of having this logic, the HiveStorageHandler implementation should 
> provide basic statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler

Reply via email to