[GitHub] [hudi] alexeykudinkin commented on a change in pull request #4013: [HUDI-2778] Optimize statistics collection related codes and add some docs for z-order add fix some bugs

GitBox Mon, 22 Nov 2021 11:35:20 -0800


alexeykudinkin commented on a change in pull request #4013:
URL: https://github.com/apache/hudi/pull/4013#discussion_r754578303




##########
File path: 
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieColumnRangeMetadata.java
##########
@@ -30,16 +28,28 @@
   private final String columnName;
   private final T minValue;
   private final T maxValue;
-  private final long numNulls;
-  private final PrimitiveStringifier stringifier;
+  private long numNulls;
+  // For Decimal Type/Date Type, minValue/maxValue cannot represent it's 
original value.
+  // eg: when parquet collects column information, the decimal type is 
collected as int/binary type.
+  // so we cannot use minValue and maxValue directly, use 
minValueAsString/maxValueAsString instead.
+  private final String minValueAsString;

Review comment:
       Please take a look at my comment below -- i don't think this is 
necessary we can handle all type conversions at the time of reading of the 
Footer and don't need to propagate it further

##########
File path: 
hudi-common/src/main/java/org/apache/hudi/common/util/ParquetUtils.java
##########
@@ -283,17 +286,38 @@ public Boolean apply(String recordKey) {
 
   /**
    * Parse min/max statistics stored in parquet footers for all columns.
+   * ParquetRead.readFooter is not a thread safe method.
+   *
+   * @param conf hadoop conf.
+   * @param parquetFilePath file to be read.
+   * @param cols cols which need to collect statistics.
+   * @return a HoodieColumnRangeMetadata instance.
    */
-  public Collection<HoodieColumnRangeMetadata<Comparable>> 
readRangeFromParquetMetadata(Configuration conf, Path parquetFilePath, 
List<String> cols) {
+  public Collection<HoodieColumnRangeMetadata<Comparable>> 
readRangeFromParquetMetadata(
+      Configuration conf,
+      Path parquetFilePath,
+      List<String> cols) {
     ParquetMetadata metadata = readMetadata(conf, parquetFilePath);
     // collect stats from all parquet blocks
     Map<String, List<HoodieColumnRangeMetadata<Comparable>>> 
columnToStatsListMap = metadata.getBlocks().stream().flatMap(blockMetaData -> {
-      return blockMetaData.getColumns().stream().filter(f -> 
cols.contains(f.getPath().toDotString())).map(columnChunkMetaData ->
-          new HoodieColumnRangeMetadata<>(parquetFilePath.getName(), 
columnChunkMetaData.getPath().toDotString(),
-              columnChunkMetaData.getStatistics().genericGetMin(),
-              columnChunkMetaData.getStatistics().genericGetMax(),
-              columnChunkMetaData.getStatistics().getNumNulls(),
-              columnChunkMetaData.getPrimitiveType().stringifier()));
+      return blockMetaData.getColumns().stream().filter(f -> 
cols.contains(f.getPath().toDotString())).map(columnChunkMetaData -> {
+        String minAsString;
+        String maxAsString;
+        if (columnChunkMetaData.getPrimitiveType().getOriginalType() == 
OriginalType.DATE) {
+          synchronized (lock) {

Review comment:
       This has been addressed by #4060




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] alexeykudinkin commented on a change in pull request #4013: [HUDI-2778] Optimize statistics collection related codes and add some docs for z-order add fix some bugs

Reply via email to