[GitHub] [flink] swuferhong commented on a diff in pull request #20084: [FLINK-27988][table-planner] Let HiveTableSource extend from SupportStatisticsReport

GitBox Tue, 26 Jul 2022 23:07:46 -0700


swuferhong commented on code in PR #20084:
URL: https://github.com/apache/flink/pull/20084#discussion_r930653031



##########
flink-connectors/flink-connector-hive/src/main/java/org/apache/flink/connectors/hive/HiveTableSource.java:
##########
@@ -261,6 +273,99 @@ public DynamicTableSource copy() {
         return source;
     }
 
+    @Override
+    public TableStats reportStatistics() {
+        try {
+            // only support BOUNDED source
+            if (isStreamingSource()) {
+                return TableStats.UNKNOWN;
+            }
+            if 
(flinkConf.get(FileSystemConnectorOptions.SOURCE_REPORT_STATISTICS)
+                    != FileSystemConnectorOptions.FileStatisticsType.ALL) {
+                return TableStats.UNKNOWN;
+            }
+
+            HiveSourceBuilder sourceBuilder =
+                    new HiveSourceBuilder(jobConf, flinkConf, tablePath, 
hiveVersion, catalogTable)
+                            .setProjectedFields(projectedFields)
+                            .setLimit(limit);
+            int threadNum =
+                    
flinkConf.get(HiveOptions.TABLE_EXEC_HIVE_LOAD_PARTITION_SPLITS_THREAD_NUM);
+            List<HiveTablePartition> hivePartitionsToRead =
+                    getAllPartitions(
+                            jobConf,
+                            hiveVersion,
+                            tablePath,
+                            catalogTable.getPartitionKeys(),
+                            remainingPartitions);
+            BulkFormat<RowData, HiveSourceSplit> defaultBulkFormat =
+                    sourceBuilder.createDefaultBulkFormat();
+            List<HiveSourceSplit> inputSplits =
+                    HiveSourceFileEnumerator.createInputSplits(
+                            0, hivePartitionsToRead, threadNum, jobConf);
+            if (inputSplits.size() != 0) {
+                TableStats tableStats;
+                if (defaultBulkFormat instanceof 
FileBasedStatisticsReportableInputFormat) {
+                    // If HiveInputFormat's variable useMapRedReader is false, 
Hive using Flink's
+                    // InputFormat to read data.
+                    tableStats =
+                            ((FileBasedStatisticsReportableInputFormat) 
defaultBulkFormat)
+                                    .reportStatistics(
+                                            inputSplits.stream()
+                                                    .map(FileSourceSplit::path)
+                                                    
.collect(Collectors.toList()),
+                                            
catalogTable.getSchema().toRowDataType());
+                } else {
+                    // If HiveInputFormat's variable useMapRedReader is true, 
Hive using MapRed
+                    // InputFormat to read data.
+                    tableStats =
+                            getMapRedInputFormatStatistics(
+                                    inputSplits, 
catalogTable.getSchema().toRowDataType());
+                }
+                if (limit == null) {
+                    // If no limit push down, return recompute table stats.
+                    return tableStats;
+                } else {
+                    // If table have limit push down, return new table stats 
without table column
+                    // stats.
+                    long newRowCount = Math.min(limit, 
tableStats.getRowCount());
+                    return new TableStats(newRowCount);
+                }
+            } else {
+                return TableStats.UNKNOWN;
+            }
+
+        } catch (Exception e) {
+            return TableStats.UNKNOWN;

Review Comment:
   > I think it will be better to log the exception
   
   done!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] swuferhong commented on a diff in pull request #20084: [FLINK-27988][table-planner] Let HiveTableSource extend from SupportStatisticsReport

Reply via email to