Copilot commented on code in PR #10149: URL: https://github.com/apache/gravitino/pull/10149#discussion_r2877967778
########## docs/manage-statistics-in-gravitino.md: ########## @@ -307,16 +307,17 @@ For example, if you set an extra property `foo` to `bar` for Lance storage optio For Lance remote storage, you can refer to the document [here](https://lancedb.github.io/lance/usage/storage/). -| Configuration item | Description | Default value | Required | Since version | -|----------------------------------------------------------------------|--------------------------------------|--------------------------------------|----------|---------------| -| `gravitino.stats.partition.storageOption.location` | The location of Lance files | `${GRAVITINO_HOME}/data/lance` | No | 1.0.0 | -| `gravitino.stats.partition.storageOption.maxRowsPerFile` | The maximum rows per file | `1000000` | No | 1.0.0 | -| `gravitino.stats.partition.storageOption.maxBytesPerFile` | The maximum bytes per file | `104857600` | No | 1.0.0 | -| `gravitino.stats.partition.storageOption.maxRowsPerGroup` | The maximum rows per group | `1000000` | No | 1.0.0 | -| `gravitino.stats.partition.storageOption.readBatchSize` | The batch record number when reading | `10000` | No | 1.0.0 | -| `gravitino.stats.partition.storageOption.datasetCacheSize` | size of dataset cache for Lance | `0`, It means we don't use the cache | No | 1.0.0 | -| `gravitino.stats.partition.storageOption.metadataFileCacheSizeBytes` | The Lance's metadata file cache size | `102400` | No | 1.0.0 | -| `gravitino.stats.partition.storageOption.indexCacheSizeBytes` | The Lance's index cache size | `102400` | No | 1.0.0 | +| Configuration item | Description | Default value | Required | Since version | +|----------------------------------------------------------------------|------------------------------------------------------------|--------------------------------------|----------|---------------| +| `gravitino.stats.partition.storageOption.location` | The location of Lance files | `${GRAVITINO_HOME}/data/lance` | No | 1.0.0 | +| `gravitino.stats.partition.storageOption.maxRowsPerFile` | The maximum rows per file | `1000000` | No | 1.0.0 | +| `gravitino.stats.partition.storageOption.maxBytesPerFile` | The maximum bytes per file | `104857600` | No | 1.0.0 | +| `gravitino.stats.partition.storageOption.maxRowsPerGroup` | The maximum rows per group | `1000000` | No | 1.0.0 | +| `gravitino.stats.partition.storageOption.readBatchSize` | The batch record number when reading | `10000` | No | 1.0.0 | +| `gravitino.stats.partition.storageOption.datasetCacheSize` | size of dataset cache for Lance | `0`, It means we don't use the cache | No | 1.0.0 | +| `gravitino.stats.partition.storageOption.metadataFileCacheSizeBytes` | The Lance's metadata file cache size | `102400` | No | 1.0.0 | +| `gravitino.stats.partition.storageOption.indexCacheSizeBytes` | The Lance's index cache size | `102400` | No | 1.0.0 | +| `gravitino.stats.partition.storageOption.maxStatisticsPerUpdate` | Maximum number of statistics allowed per update operation | `100` | No | 1.0.0 | Review Comment: The newly added `maxStatisticsPerUpdate` row lists `Since version` as `1.0.0`, but this option is introduced in this PR. Please update the `Since version` column to the version where this configuration is first available (the repo currently targets `1.2.0-SNAPSHOT`). ```suggestion | `gravitino.stats.partition.storageOption.maxStatisticsPerUpdate` | Maximum number of statistics allowed per update operation | `100` | No | 1.2.0-SNAPSHOT | ``` ########## core/src/test/java/org/apache/gravitino/stats/storage/TestLancePartitionStatisticStorage.java: ########## @@ -378,6 +378,55 @@ public void testLancePartitionStatisticStorageWithCache() throws Exception { storage.close(); } + @Test + public void testExceedMaxStatisticsPerUpdateLimit() throws Exception { + PartitionStatisticStorageFactory factory = new LancePartitionStatisticStorageFactory(); + + String metalakeName = "metalake"; + String catalogName = "catalog"; + String schemaName = "schema"; + String tableName = "table"; + + MetadataObject metadataObject = + MetadataObjects.of( + Lists.newArrayList(catalogName, schemaName, tableName), MetadataObject.Type.TABLE); + + EntityStore entityStore = mock(EntityStore.class); + TableEntity tableEntity = mock(TableEntity.class); + when(entityStore.get(any(), any(), any())).thenReturn(tableEntity); + when(tableEntity.id()).thenReturn(101L); + FieldUtils.writeField(GravitinoEnv.getInstance(), "entityStore", entityStore, true); + + String location = Files.createTempDirectory("lance_stats_exceed_test").toString(); + Map<String, String> properties = Maps.newHashMap(); + properties.put("location", location); + // Default limit is 100, generate 101 statistics to exceed the limit + + LancePartitionStatisticStorage storage = + (LancePartitionStatisticStorage) factory.create(properties); + + // Generate 101 statistics which exceeds the default limit of 100 + int count = 101; + Map<MetadataObject, Map<String, Map<String, StatisticValue<?>>>> originData = + generateData(metadataObject, count, 1); + Map<MetadataObject, List<PartitionStatisticsUpdate>> statisticsToUpdate = + convertData(originData); + + List<MetadataObjectStatisticsUpdate> objectUpdates = Lists.newArrayList(); + for (Map.Entry<MetadataObject, List<PartitionStatisticsUpdate>> entry : + statisticsToUpdate.entrySet()) { + objectUpdates.add(MetadataObjectStatisticsUpdate.of(entry.getKey(), entry.getValue())); + } + + IllegalArgumentException exception = + Assertions.assertThrows( + IllegalArgumentException.class, + () -> storage.updateStatistics(metalakeName, objectUpdates)); + Assertions.assertTrue(exception.getMessage().contains("exceeds the maximum limit")); + + storage.close(); Review Comment: This test creates a temporary directory and a `LancePartitionStatisticStorage` instance but doesn’t guarantee cleanup if an assertion fails. Please use try/finally (or try-with-resources) to always `close()` the storage and delete the created temp/Lance dataset directory (as the other tests in this class do) to avoid leaking temp files/resources across test runs. ```suggestion String location = null; LancePartitionStatisticStorage storage = null; try { location = Files.createTempDirectory("lance_stats_exceed_test").toString(); Map<String, String> properties = Maps.newHashMap(); properties.put("location", location); // Default limit is 100, generate 101 statistics to exceed the limit storage = (LancePartitionStatisticStorage) factory.create(properties); // Generate 101 statistics which exceeds the default limit of 100 int count = 101; Map<MetadataObject, Map<String, Map<String, StatisticValue<?>>>> originData = generateData(metadataObject, count, 1); Map<MetadataObject, List<PartitionStatisticsUpdate>> statisticsToUpdate = convertData(originData); List<MetadataObjectStatisticsUpdate> objectUpdates = Lists.newArrayList(); for (Map.Entry<MetadataObject, List<PartitionStatisticsUpdate>> entry : statisticsToUpdate.entrySet()) { objectUpdates.add(MetadataObjectStatisticsUpdate.of(entry.getKey(), entry.getValue())); } IllegalArgumentException exception = Assertions.assertThrows( IllegalArgumentException.class, () -> storage.updateStatistics(metalakeName, objectUpdates)); Assertions.assertTrue(exception.getMessage().contains("exceeds the maximum limit")); } finally { if (storage != null) { storage.close(); } if (location != null) { FileUtils.deleteDirectory(new File(location)); } } ``` ########## core/src/main/java/org/apache/gravitino/stats/storage/LancePartitionStatisticStorage.java: ########## @@ -177,6 +180,14 @@ public LancePartitionStatisticStorage(Map<String, String> properties) { indexCacheSize > 0, "Lance partition statistics storage indexCacheSizeBytes must be positive"); + this.maxStatisticsPerUpdate = + Integer.parseInt( + properties.getOrDefault( + MAX_STATISTICS_PER_UPDATE, String.valueOf(DEFAULT_MAX_STATISTICS_PER_UPDATE))); + Preconditions.checkArgument( Review Comment: New validation is added for `maxStatisticsPerUpdate`, but there is no test covering invalid configurations (e.g., `0`, negative values, or non-integer values). Please add a dedicated test that asserts the factory/storage creation fails with an appropriate exception/message when `maxStatisticsPerUpdate` is invalid, to match the project’s “every change must have a corresponding test” rule. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
