alexeykudinkin commented on a change in pull request #5181:
URL: https://github.com/apache/hudi/pull/5181#discussion_r840111400



##########
File path: 
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
##########
@@ -1126,23 +1127,29 @@ public static void aggregateColumnStats(IndexedRecord 
record, List<Schema.Field>
     }
 
     fields.forEach(field -> {
-      Map<String, Object> columnStats = 
columnToStats.getOrDefault(field.name(), new HashMap<>());
-      final String fieldVal = getNestedFieldValAsString((GenericRecord) 
record, field.name(), true, consistentLogicalTimestampEnabled);
+      Map<String, Object> columnStats = columnToStats.get(field.name());
+      GenericRecord genericRecord = (GenericRecord) record;
+      final Object fieldVal = convertValueForSpecificDataTypes(field.schema(), 
genericRecord.get(field.name()), consistentLogicalTimestampEnabled);
+      final Schema fieldSchema = 
getNestedFieldSchemaFromWriteSchema(genericRecord.getSchema(), field.name());
       // update stats
-      final int fieldSize = fieldVal == null ? 0 : fieldVal.length();
-      columnStats.put(TOTAL_SIZE, 
Long.parseLong(columnStats.getOrDefault(TOTAL_SIZE, 0).toString()) + fieldSize);
-      columnStats.put(TOTAL_UNCOMPRESSED_SIZE, 
Long.parseLong(columnStats.getOrDefault(TOTAL_UNCOMPRESSED_SIZE, 0).toString()) 
+ fieldSize);
+      // NOTE: Unlike Parquet, Avro does not give the field size.
+      columnStats.put(TOTAL_SIZE, 
Long.parseLong(columnStats.getOrDefault(TOTAL_SIZE, 0).toString()));
+      columnStats.put(TOTAL_UNCOMPRESSED_SIZE, 
Long.parseLong(columnStats.getOrDefault(TOTAL_UNCOMPRESSED_SIZE, 
0).toString()));
 
-      if (!isNullOrEmpty(fieldVal)) {
+      if (fieldVal != null) {
         // set the min value of the field
         if (!columnStats.containsKey(MIN)) {
           columnStats.put(MIN, fieldVal);
         }
-        if (fieldVal.compareTo(String.valueOf(columnStats.get(MIN))) < 0) {
+        if (compare(fieldVal, columnStats.get(MIN), fieldSchema) < 0) {
           columnStats.put(MIN, fieldVal);
         }
         // set the max value of the field
-        if (fieldVal.compareTo(String.valueOf(columnStats.getOrDefault(MAX, 
""))) > 0) {
+        if (!columnStats.containsKey(MAX)) {
+          columnStats.put(MAX, fieldVal);
+        }
+        // set the max value of the field
+        if (compare(fieldVal, columnStats.get(MAX), fieldSchema) > 0) {

Review comment:
       Column stats only work with top-level columns, for any nested fields 
they are useless since we're only collecting bounds for the whole object




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to