srinikandi opened a new issue #5047: URL: https://github.com/apache/hudi/issues/5047
Hi I have been using Apache Hudi Connector for Glue (hudi 0.90 version) and facing small file creation problem while inserting data into a hudi table. The input file is about 17 GB with 313 parquet part files. Each averaging around 70 mb. When I try to insert the data into Hudi table with overwrite option, this ends up creating some 7000 plus parquet part files, each with 6.5 MB. I did utilize the small file size and max file size parameters while writing. here is the config that I used . The intention was to create file sizes between 60 - 80 MB. common_config = { "hoodie.datasource.write.hive_style_partitioning": "true", "className": "org.apache.hudi", "hoodie.datasource.hive_sync.use_jdbc": "false", "hoodie.datasource.write.recordkey.field": "C1,C2,C3,C4", "hoodie.datasource.write.precombine.field": "", "hoodie.table.name": "TEST_TABLE", "hoodie.datasource.hive_sync.database": "TEST_DATABASE_RAW", "hoodie.datasource.hive_sync.table": "TEST_TABLE", "hoodie.datasource.hive_sync.enable": "true", "hoodie.datasource.write.partitionpath.field": "", "hoodie.datasource.hive_sync.support_timestamp": "true", "hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor", "hoodie.datasource.write.keygenerator.class": "org.apache.hudi.keygen.ComplexKeyGenerator", "hoodie.parquet.small.file.limit": "62914560", "hoodie.parquet.max.file.limit": "83886080", "hoodie.parquet.block.size": "62914560", "hoodie.insert.shuffle.parallelism": "5", "hoodie.datasource.write.operation": "insert" } full_ref_files is list with all of the parquet part file from the input folder. full_load_df = spark.read.parquet(*full_ref_files) full_load_df.write.format("hudi").options(**conf).mode("overwrite").save(raw_table_path) Below are the spark logs, where the last line shows that it create some 1000 plus partitions while writing to Hudi table. Any insights are deeply appreciated. 2022-03-15T06:42:23.502-05:00 | [Stage 1 (javaToPython at NativeMethodAccessorImpl.java:0):> (0 + 113) / 116] -- | -- | 2022-03-15T06:42:25.396-05:00 | [Stage 1 (javaToPython at NativeMethodAccessorImpl.java:0):> (9 + 107) / 116] | 2022-03-15T06:42:26.894-05:00 | [Stage 1 (javaToPython at NativeMethodAccessorImpl.java:0):==> (100 + 16) / 116] | 2022-03-15T06:42:28.379-05:00 | [Stage 3 (collect at /tmp/stage-to-raw-etl-glue-job-poc-1.py:105):()(83 + 33) / 116] | 2022-03-15T06:43:06.306-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(0 + 116) / 117] | 2022-03-15T06:43:12.079-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(1 + 116) / 117] | 2022-03-15T06:43:19.568-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(3 + 114) / 117] | 2022-03-15T06:43:22.290-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(3 + 114) / 117] | 2022-03-15T06:43:23.907-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(5 + 112) / 117] | 2022-03-15T06:43:25.782-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(5 + 112) / 117] | 2022-03-15T06:43:27.302-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(10 + 107) / 117] | 2022-03-15T06:43:31.700-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(11 + 106) / 117] | 2022-03-15T06:43:35.656-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(12 + 105) / 117] | 2022-03-15T06:44:05.906-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(13 + 104) / 117] | 2022-03-15T06:44:15.512-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(13 + 104) / 117] | 2022-03-15T06:44:20.853-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(16 + 101) / 117] | 2022-03-15T06:44:53.649-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(17 + 100) / 117] | 2022-03-15T06:45:08.909-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(17 + 100) / 117] | 2022-03-15T06:45:14.894-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(19 + 98) / 117] | 2022-03-15T06:45:26.452-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(20 + 97) / 117] | 2022-03-15T06:45:36.321-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(21 + 96) / 117] | 2022-03-15T06:45:38.712-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(21 + 96) / 117] | 2022-03-15T06:45:41.935-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(23 + 94) / 117] | 2022-03-15T06:45:43.177-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(25 + 92) / 117] | 2022-03-15T06:45:48.221-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(26 + 91) / 117] | 2022-03-15T06:46:00.568-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(28 + 89) / 117] | 2022-03-15T06:46:06.605-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(28 + 89) / 117] | 2022-03-15T06:46:10.268-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(29 + 88) / 117] | 2022-03-15T06:46:12.260-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(31 + 86) / 117] | 2022-03-15T06:46:15.381-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(33 + 84) / 117] | 2022-03-15T06:46:18.751-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(34 + 83) / 117] | 2022-03-15T06:46:20.892-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(36 + 81) / 117] | 2022-03-15T06:46:22.348-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(36 + 81) / 117] | 2022-03-15T06:46:23.410-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(38 + 79) / 117] | 2022-03-15T06:46:25.525-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(40 + 77) / 117] | 2022-03-15T06:46:27.795-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(45 + 72) / 117] | 2022-03-15T06:46:29.397-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(46 + 71) / 117] | 2022-03-15T06:46:30.493-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(49 + 68) / 117] | 2022-03-15T06:46:31.580-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(50 + 67) / 117] | 2022-03-15T06:46:33.033-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(53 + 64) / 117] | 2022-03-15T06:46:34.046-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(59 + 58) / 117] | 2022-03-15T06:46:35.666-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(64 + 53) / 117] | 2022-03-15T06:46:36.910-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(66 + 51) / 117] | 2022-03-15T06:46:38.815-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(68 + 49) / 117] | 2022-03-15T06:46:40.080-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(69 + 48) / 117] | 2022-03-15T06:46:41.088-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(72 + 45) / 117] | 2022-03-15T06:46:42.917-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(76 + 41) / 117] | 2022-03-15T06:46:44.805-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(80 + 37) / 117] | 2022-03-15T06:46:46.980-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(84 + 33) / 117] | 2022-03-15T06:46:48.120-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(86 + 31) / 117] | 2022-03-15T06:46:49.728-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(89 + 28) / 117] | 2022-03-15T06:46:50.893-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(90 + 27) / 117] | 2022-03-15T06:46:52.211-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(93 + 24) / 117] | 2022-03-15T06:46:53.544-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(97 + 20) / 117] | 2022-03-15T06:46:54.871-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(102 + 15) / 117] | 2022-03-15T06:46:55.892-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(105 + 12) / 117] | 2022-03-15T06:46:58.431-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(107 + 10) / 117] | 2022-03-15T06:46:59.667-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(110 + 7) / 117] | 2022-03-15T06:47:02.750-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(113 + 4) / 117] | 2022-03-15T06:47:14.859-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(114 + 3) / 117] | 2022-03-15T06:47:17.427-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(116 + 1) / 117] | 2022-03-15T06:47:21.640-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(117 + 0) / 117] | 2022-03-15T06:47:27.011-05:00 | [Stage 9 (mapToPair at BaseSparkCommitActionExecutor.java:209):> (0 + 117) / 117] | 2022-03-15T06:47:28.283-05:00 | [Stage 9 (mapToPair at BaseSparkCommitActionExecutor.java:209):> (2 + 115) / 117] | 2022-03-15T06:47:29.508-05:00 | [Stage 9 (mapToPair at BaseSparkCommitActionExecutor.java:209):> (6 + 111) / 117] | 2022-03-15T06:47:30.576-05:00 | [Stage 9 (mapToPair at BaseSparkCommitActionExecutor.java:209):()(10 + 107) / 117] | 2022-03-15T06:47:32.991-05:00 | [Stage 9 (mapToPair at BaseSparkCommitActionExecutor.java:209):()(15 + 102) / 117] | 2022-03-15T06:47:34.455-05:00 | [Stage 9 (mapToPair at BaseSparkCommitActionExecutor.java:209):> (18 + 99) / 117] | 2022-03-15T06:47:35.560-05:00 | [Stage 9 (mapToPair at BaseSparkCommitActionExecutor.java:209):> (23 + 94) / 117] | 2022-03-15T06:47:36.575-05:00 | [Stage 9 (mapToPair at BaseSparkCommitActionExecutor.java:209):> (29 + 88) / 117] | 2022-03-15T06:47:37.689-05:00 | [Stage 9 (mapToPair at BaseSparkCommitActionExecutor.java:209):> (41 + 76) / 117] | 2022-03-15T06:47:38.788-05:00 | [Stage 9 (mapToPair at BaseSparkCommitActionExecutor.java:209):> (64 + 53) / 117] | 2022-03-15T06:47:39.865-05:00 | [Stage 9 (mapToPair at BaseSparkCommitActionExecutor.java:209):> (89 + 28) / 117] | 2022-03-15T06:47:40.927-05:00 | [Stage 9 (mapToPair at BaseSparkCommitActionExecutor.java:209):> (110 + 7) / 117] | 2022-03-15T06:47:58.272-05:00 | [Stage 12 (isEmpty at HoodieSparkSqlWriter.scala:609):==> (1 + 3) / 4] | 2022-03-15T06:48:04.654-05:00 | [Stage 14 (isEmpty at HoodieSparkSqlWriter.scala:609):> (1 + 19) / 20] | 2022-03-15T06:48:06.493-05:00 | [Stage 14 (isEmpty at HoodieSparkSqlWriter.scala:609):> (3 + 17) / 20] | 2022-03-15T06:48:16.497-05:00 | [Stage 16 (isEmpty at HoodieSparkSqlWriter.scala:609):> (0 + 100) / 100] | 2022-03-15T06:48:17.592-05:00 | [Stage 16 (isEmpty at HoodieSparkSqlWriter.scala:609):==> (35 + 65) / 100] | 2022-03-15T06:48:19.544-05:00 | [Stage 16 (isEmpty at HoodieSparkSqlWriter.scala:609):==> (37 + 63) / 100] | 2022-03-15T06:48:20.588-05:00 | [Stage 16 (isEmpty at HoodieSparkSqlWriter.scala:609):====> (51 + 49) / 100] | 2022-03-15T06:48:21.593-05:00 | [Stage 16 (isEmpty at HoodieSparkSqlWriter.scala:609):======> (77 + 23) / 100] | 2022-03-15T06:48:23.827-05:00 | [Stage 16 (isEmpty at HoodieSparkSqlWriter.scala:609):=========> (93 + 7) / 100] | 2022-03-15T06:48:37.952-05:00 | [Stage 18 (isEmpty at HoodieSparkSqlWriter.scala:609):> (0 + 117) / 500] | 2022-03-15T06:48:38.995-05:00 | [Stage 18 (isEmpty at HoodieSparkSqlWriter.scala:609):> (68 + 117) / 500] | 2022-03-15T06:48:40.006-05:00 | [Stage 18 (isEmpty at HoodieSparkSqlWriter.scala:609):> (110 + 117) / 500] | 2022-03-15T06:48:51.385-05:00 | [Stage 18 (isEmpty at HoodieSparkSqlWriter.scala:609):> (116 + 117) / 500] | 2022-03-15T06:48:52.411-05:00 | [Stage 18 (isEmpty at HoodieSparkSqlWriter.scala:609):=> (155 + 117) / 500] | 2022-03-15T06:48:53.421-05:00 | [Stage 18 (isEmpty at HoodieSparkSqlWriter.scala:609):==> (214 + 117) / 500] | 2022-03-15T06:48:55.455-05:00 | [Stage 18 (isEmpty at HoodieSparkSqlWriter.scala:609):==> (231 + 116) / 500] | 2022-03-15T06:49:04.055-05:00 | [Stage 18 (isEmpty at HoodieSparkSqlWriter.scala:609):==> (232 + 117) / 500] | 2022-03-15T06:49:05.162-05:00 | [Stage 18 (isEmpty at HoodieSparkSqlWriter.scala:609):==> (237 + 117) / 500] | 2022-03-15T06:49:06.169-05:00 | [Stage 18 (isEmpty at HoodieSparkSqlWriter.scala:609):===> (287 + 117) / 500] | 2022-03-15T06:49:07.176-05:00 | [Stage 18 (isEmpty at HoodieSparkSqlWriter.scala:609):====> (320 + 117) / 500] | 2022-03-15T06:49:08.304-05:00 | [Stage 18 (isEmpty at HoodieSparkSqlWriter.scala:609):====> (345 + 117) / 500] | 2022-03-15T06:49:17.407-05:00 | [Stage 18 (isEmpty at HoodieSparkSqlWriter.scala:609):====> (348 + 116) / 500] | 2022-03-15T06:49:18.449-05:00 | [Stage 18 (isEmpty at HoodieSparkSqlWriter.scala:609):====> (349 + 117) / 500] | 2022-03-15T06:49:19.450-05:00 | [Stage 18 (isEmpty at HoodieSparkSqlWriter.scala:609):=====> (386 + 114) / 500] | 2022-03-15T06:49:20.457-05:00 | [Stage 18 (isEmpty at HoodieSparkSqlWriter.scala:609):======> (421 + 79) / 500] | 2022-03-15T06:49:21.528-05:00 | [Stage 18 (isEmpty at HoodieSparkSqlWriter.scala:609):=======> (452 + 48) / 500] | 2022-03-15T06:49:24.903-05:00 | [Stage 18 (isEmpty at HoodieSparkSqlWriter.scala:609):=======> (465 + 35) / 500] | 2022-03-15T06:49:26.307-05:00 | [Stage 18 (isEmpty at HoodieSparkSqlWriter.scala:609):=======> (466 + 34) / 500] | 2022-03-15T06:49:27.724-05:00 | [Stage 18 (isEmpty at HoodieSparkSqlWriter.scala:609):=======> (470 + 30) / 500] | 2022-03-15T06:49:31.513-05:00 | [Stage 18 (isEmpty at HoodieSparkSqlWriter.scala:609):=======> (481 + 19) / 500] | 2022-03-15T06:49:32.746-05:00 | [Stage 18 (isEmpty at HoodieSparkSqlWriter.scala:609):========> (496 + 4) / 500] | 2022-03-15T06:49:46.207-05:00 | [Stage 20 (isEmpty at HoodieSparkSqlWriter.scala:609):> (0 + 117) / 473] | 2022-03-15T06:49:47.211-05:00 | [Stage 20 (isEmpty at HoodieSparkSqlWriter.scala:609):> (61 + 117) / 473] | 2022-03-15T06:49:48.305-05:00 | [Stage 20 (isEmpty at HoodieSparkSqlWriter.scala:609):> (114 + 117) / 473] | 2022-03-15T06:49:59.447-05:00 | [Stage 20 (isEmpty at HoodieSparkSqlWriter.scala:609):> (116 + 117) / 473] | 2022-03-15T06:50:00.457-05:00 | [Stage 20 (isEmpty at HoodieSparkSqlWriter.scala:609):=> (158 + 117) / 473] | 2022-03-15T06:50:01.463-05:00 | [Stage 20 (isEmpty at HoodieSparkSqlWriter.scala:609):==> (205 + 117) / 473] | 2022-03-15T06:50:02.652-05:00 | [Stage 20 (isEmpty at HoodieSparkSqlWriter.scala:609):==> (231 + 117) / 473] | 2022-03-15T06:50:12.642-05:00 | [Stage 20 (isEmpty at HoodieSparkSqlWriter.scala:609):==> (232 + 117) / 473] | 2022-03-15T06:50:13.642-05:00 | [Stage 20 (isEmpty at HoodieSparkSqlWriter.scala:609):===> (261 + 117) / 473] | 2022-03-15T06:50:14.654-05:00 | [Stage 20 (isEmpty at HoodieSparkSqlWriter.scala:609):====> (300 + 117) / 473] | 2022-03-15T06:50:15.668-05:00 | [Stage 20 (isEmpty at HoodieSparkSqlWriter.scala:609):====> (335 + 117) / 473] | 2022-03-15T06:50:17.172-05:00 | [Stage 20 (isEmpty at HoodieSparkSqlWriter.scala:609):====> (347 + 117) / 473] | 2022-03-15T06:50:25.789-05:00 | [Stage 20 (isEmpty at HoodieSparkSqlWriter.scala:609):====> (348 + 116) / 473] | 2022-03-15T06:50:26.840-05:00 | [Stage 20 (isEmpty at HoodieSparkSqlWriter.scala:609):=====> (372 + 101) / 473] | 2022-03-15T06:50:27.869-05:00 | [Stage 20 (isEmpty at HoodieSparkSqlWriter.scala:609):======> (406 + 67) / 473] | 2022-03-15T06:50:28.903-05:00 | [Stage 20 (isEmpty at HoodieSparkSqlWriter.scala:609):=======> (445 + 28) / 473] | 2022-03-15T06:50:31.957-05:00 | [Stage 20 (isEmpty at HoodieSparkSqlWriter.scala:609):========> (465 + 8) / 473] | 2022-03-15T06:50:33.009-05:00 | [Stage 20 (isEmpty at HoodieSparkSqlWriter.scala:609):========> (468 + 5) / 473] | 2022-03-15T06:50:34.267-05:00 | [Stage 20 (isEmpty at HoodieSparkSqlWriter.scala:609):========> (470 + 3) / 473] | 2022-03-15T06:50:35.294-05:00 | [Stage 20 (isEmpty at HoodieSparkSqlWriter.scala:609):=========>(473 + 0) / 473] | 2022-03-15T06:50:36.433-05:00 | [Stage 22 (collect at SparkRDDWriteClient.java:123):=====> (773 + 115) / 1098] -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org