srinikandi opened a new issue #5047:
URL: https://github.com/apache/hudi/issues/5047


   Hi
   I have been using Apache Hudi Connector for Glue (hudi 0.90 version) and 
facing small file creation problem while inserting data into a hudi table. The 
input file is about 17 GB with 313 parquet part files. Each averaging around 70 
mb. When I try to insert the data into Hudi table with overwrite option, this 
ends up creating some 7000 plus parquet part files, each with 6.5 MB. I did 
utilize the small file size and max file size parameters while writing. 
   here is the config that I used . The intention was to create file sizes 
between 60 - 80 MB.
    common_config = {
           "hoodie.datasource.write.hive_style_partitioning": "true",
           "className": "org.apache.hudi",
           "hoodie.datasource.hive_sync.use_jdbc": "false",
           "hoodie.datasource.write.recordkey.field": "C1,C2,C3,C4",
           "hoodie.datasource.write.precombine.field": "",
           "hoodie.table.name": "TEST_TABLE",
           "hoodie.datasource.hive_sync.database": "TEST_DATABASE_RAW",
           "hoodie.datasource.hive_sync.table": "TEST_TABLE",
           "hoodie.datasource.hive_sync.enable": "true",
           "hoodie.datasource.write.partitionpath.field": "",
           "hoodie.datasource.hive_sync.support_timestamp": "true",
           "hoodie.datasource.hive_sync.partition_extractor_class": 
"org.apache.hudi.hive.MultiPartKeysValueExtractor",
           "hoodie.datasource.write.keygenerator.class": 
"org.apache.hudi.keygen.ComplexKeyGenerator",
           "hoodie.parquet.small.file.limit": "62914560",
           "hoodie.parquet.max.file.limit": "83886080",
           "hoodie.parquet.block.size": "62914560",
           "hoodie.insert.shuffle.parallelism": "5",
           "hoodie.datasource.write.operation": "insert"
           }
   full_ref_files is list with all of the parquet part file from the input 
folder.
    full_load_df = spark.read.parquet(*full_ref_files)
   
full_load_df.write.format("hudi").options(**conf).mode("overwrite").save(raw_table_path)
   
   Below are the spark logs, where the last line shows that it create some 1000 
plus partitions while writing to Hudi table.
   Any insights are deeply appreciated.
   
   2022-03-15T06:42:23.502-05:00 | [Stage 1 (javaToPython at 
NativeMethodAccessorImpl.java:0):> (0 + 113) / 116]
   -- | --
     | 2022-03-15T06:42:25.396-05:00 | [Stage 1 (javaToPython at 
NativeMethodAccessorImpl.java:0):> (9 + 107) / 116]
     | 2022-03-15T06:42:26.894-05:00 | [Stage 1 (javaToPython at 
NativeMethodAccessorImpl.java:0):==> (100 + 16) / 116]
     | 2022-03-15T06:42:28.379-05:00 | [Stage 3 (collect at 
/tmp/stage-to-raw-etl-glue-job-poc-1.py:105):()(83 + 33) / 116]
     | 2022-03-15T06:43:06.306-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(0 + 116) / 117]
     | 2022-03-15T06:43:12.079-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(1 + 116) / 117]
     | 2022-03-15T06:43:19.568-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(3 + 114) / 117]
     | 2022-03-15T06:43:22.290-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(3 + 114) / 117]
     | 2022-03-15T06:43:23.907-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(5 + 112) / 117]
     | 2022-03-15T06:43:25.782-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(5 + 112) / 117]
     | 2022-03-15T06:43:27.302-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(10 + 107) / 117]
     | 2022-03-15T06:43:31.700-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(11 + 106) / 117]
     | 2022-03-15T06:43:35.656-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(12 + 105) / 117]
     | 2022-03-15T06:44:05.906-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(13 + 104) / 117]
     | 2022-03-15T06:44:15.512-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(13 + 104) / 117]
     | 2022-03-15T06:44:20.853-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(16 + 101) / 117]
     | 2022-03-15T06:44:53.649-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(17 + 100) / 117]
     | 2022-03-15T06:45:08.909-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(17 + 100) / 117]
     | 2022-03-15T06:45:14.894-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(19 + 98) / 117]
     | 2022-03-15T06:45:26.452-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(20 + 97) / 117]
     | 2022-03-15T06:45:36.321-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(21 + 96) / 117]
     | 2022-03-15T06:45:38.712-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(21 + 96) / 117]
     | 2022-03-15T06:45:41.935-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(23 + 94) / 117]
     | 2022-03-15T06:45:43.177-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(25 + 92) / 117]
     | 2022-03-15T06:45:48.221-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(26 + 91) / 117]
     | 2022-03-15T06:46:00.568-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(28 + 89) / 117]
     | 2022-03-15T06:46:06.605-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(28 + 89) / 117]
     | 2022-03-15T06:46:10.268-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(29 + 88) / 117]
     | 2022-03-15T06:46:12.260-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(31 + 86) / 117]
     | 2022-03-15T06:46:15.381-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(33 + 84) / 117]
     | 2022-03-15T06:46:18.751-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(34 + 83) / 117]
     | 2022-03-15T06:46:20.892-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(36 + 81) / 117]
     | 2022-03-15T06:46:22.348-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(36 + 81) / 117]
     | 2022-03-15T06:46:23.410-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(38 + 79) / 117]
     | 2022-03-15T06:46:25.525-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(40 + 77) / 117]
     | 2022-03-15T06:46:27.795-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(45 + 72) / 117]
     | 2022-03-15T06:46:29.397-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(46 + 71) / 117]
     | 2022-03-15T06:46:30.493-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(49 + 68) / 117]
     | 2022-03-15T06:46:31.580-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(50 + 67) / 117]
     | 2022-03-15T06:46:33.033-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(53 + 64) / 117]
     | 2022-03-15T06:46:34.046-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(59 + 58) / 117]
     | 2022-03-15T06:46:35.666-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(64 + 53) / 117]
     | 2022-03-15T06:46:36.910-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(66 + 51) / 117]
     | 2022-03-15T06:46:38.815-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(68 + 49) / 117]
     | 2022-03-15T06:46:40.080-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(69 + 48) / 117]
     | 2022-03-15T06:46:41.088-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(72 + 45) / 117]
     | 2022-03-15T06:46:42.917-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(76 + 41) / 117]
     | 2022-03-15T06:46:44.805-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(80 + 37) / 117]
     | 2022-03-15T06:46:46.980-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(84 + 33) / 117]
     | 2022-03-15T06:46:48.120-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(86 + 31) / 117]
     | 2022-03-15T06:46:49.728-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(89 + 28) / 117]
     | 2022-03-15T06:46:50.893-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(90 + 27) / 117]
     | 2022-03-15T06:46:52.211-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(93 + 24) / 117]
     | 2022-03-15T06:46:53.544-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(97 + 20) / 117]
     | 2022-03-15T06:46:54.871-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(102 + 15) / 117]
     | 2022-03-15T06:46:55.892-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(105 + 12) / 117]
     | 2022-03-15T06:46:58.431-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(107 + 10) / 117]
     | 2022-03-15T06:46:59.667-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(110 + 7) / 117]
     | 2022-03-15T06:47:02.750-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(113 + 4) / 117]
     | 2022-03-15T06:47:14.859-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(114 + 3) / 117]
     | 2022-03-15T06:47:17.427-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(116 + 1) / 117]
     | 2022-03-15T06:47:21.640-05:00 | [Stage 6 (countByKey at 
BaseSparkCommitActionExecutor.java:175):()(117 + 0) / 117]
     | 2022-03-15T06:47:27.011-05:00 | [Stage 9 (mapToPair at 
BaseSparkCommitActionExecutor.java:209):> (0 + 117) / 117]
     | 2022-03-15T06:47:28.283-05:00 | [Stage 9 (mapToPair at 
BaseSparkCommitActionExecutor.java:209):> (2 + 115) / 117]
     | 2022-03-15T06:47:29.508-05:00 | [Stage 9 (mapToPair at 
BaseSparkCommitActionExecutor.java:209):> (6 + 111) / 117]
     | 2022-03-15T06:47:30.576-05:00 | [Stage 9 (mapToPair at 
BaseSparkCommitActionExecutor.java:209):()(10 + 107) / 117]
     | 2022-03-15T06:47:32.991-05:00 | [Stage 9 (mapToPair at 
BaseSparkCommitActionExecutor.java:209):()(15 + 102) / 117]
     | 2022-03-15T06:47:34.455-05:00 | [Stage 9 (mapToPair at 
BaseSparkCommitActionExecutor.java:209):> (18 + 99) / 117]
     | 2022-03-15T06:47:35.560-05:00 | [Stage 9 (mapToPair at 
BaseSparkCommitActionExecutor.java:209):> (23 + 94) / 117]
     | 2022-03-15T06:47:36.575-05:00 | [Stage 9 (mapToPair at 
BaseSparkCommitActionExecutor.java:209):> (29 + 88) / 117]
     | 2022-03-15T06:47:37.689-05:00 | [Stage 9 (mapToPair at 
BaseSparkCommitActionExecutor.java:209):> (41 + 76) / 117]
     | 2022-03-15T06:47:38.788-05:00 | [Stage 9 (mapToPair at 
BaseSparkCommitActionExecutor.java:209):> (64 + 53) / 117]
     | 2022-03-15T06:47:39.865-05:00 | [Stage 9 (mapToPair at 
BaseSparkCommitActionExecutor.java:209):> (89 + 28) / 117]
     | 2022-03-15T06:47:40.927-05:00 | [Stage 9 (mapToPair at 
BaseSparkCommitActionExecutor.java:209):> (110 + 7) / 117]
     | 2022-03-15T06:47:58.272-05:00 | [Stage 12 (isEmpty at 
HoodieSparkSqlWriter.scala:609):==> (1 + 3) / 4]
     | 2022-03-15T06:48:04.654-05:00 | [Stage 14 (isEmpty at 
HoodieSparkSqlWriter.scala:609):> (1 + 19) / 20]
     | 2022-03-15T06:48:06.493-05:00 | [Stage 14 (isEmpty at 
HoodieSparkSqlWriter.scala:609):> (3 + 17) / 20]
     | 2022-03-15T06:48:16.497-05:00 | [Stage 16 (isEmpty at 
HoodieSparkSqlWriter.scala:609):> (0 + 100) / 100]
     | 2022-03-15T06:48:17.592-05:00 | [Stage 16 (isEmpty at 
HoodieSparkSqlWriter.scala:609):==> (35 + 65) / 100]
     | 2022-03-15T06:48:19.544-05:00 | [Stage 16 (isEmpty at 
HoodieSparkSqlWriter.scala:609):==> (37 + 63) / 100]
     | 2022-03-15T06:48:20.588-05:00 | [Stage 16 (isEmpty at 
HoodieSparkSqlWriter.scala:609):====> (51 + 49) / 100]
     | 2022-03-15T06:48:21.593-05:00 | [Stage 16 (isEmpty at 
HoodieSparkSqlWriter.scala:609):======> (77 + 23) / 100]
     | 2022-03-15T06:48:23.827-05:00 | [Stage 16 (isEmpty at 
HoodieSparkSqlWriter.scala:609):=========> (93 + 7) / 100]
     | 2022-03-15T06:48:37.952-05:00 | [Stage 18 (isEmpty at 
HoodieSparkSqlWriter.scala:609):> (0 + 117) / 500]
     | 2022-03-15T06:48:38.995-05:00 | [Stage 18 (isEmpty at 
HoodieSparkSqlWriter.scala:609):> (68 + 117) / 500]
     | 2022-03-15T06:48:40.006-05:00 | [Stage 18 (isEmpty at 
HoodieSparkSqlWriter.scala:609):> (110 + 117) / 500]
     | 2022-03-15T06:48:51.385-05:00 | [Stage 18 (isEmpty at 
HoodieSparkSqlWriter.scala:609):> (116 + 117) / 500]
     | 2022-03-15T06:48:52.411-05:00 | [Stage 18 (isEmpty at 
HoodieSparkSqlWriter.scala:609):=> (155 + 117) / 500]
     | 2022-03-15T06:48:53.421-05:00 | [Stage 18 (isEmpty at 
HoodieSparkSqlWriter.scala:609):==> (214 + 117) / 500]
     | 2022-03-15T06:48:55.455-05:00 | [Stage 18 (isEmpty at 
HoodieSparkSqlWriter.scala:609):==> (231 + 116) / 500]
     | 2022-03-15T06:49:04.055-05:00 | [Stage 18 (isEmpty at 
HoodieSparkSqlWriter.scala:609):==> (232 + 117) / 500]
     | 2022-03-15T06:49:05.162-05:00 | [Stage 18 (isEmpty at 
HoodieSparkSqlWriter.scala:609):==> (237 + 117) / 500]
     | 2022-03-15T06:49:06.169-05:00 | [Stage 18 (isEmpty at 
HoodieSparkSqlWriter.scala:609):===> (287 + 117) / 500]
     | 2022-03-15T06:49:07.176-05:00 | [Stage 18 (isEmpty at 
HoodieSparkSqlWriter.scala:609):====> (320 + 117) / 500]
     | 2022-03-15T06:49:08.304-05:00 | [Stage 18 (isEmpty at 
HoodieSparkSqlWriter.scala:609):====> (345 + 117) / 500]
     | 2022-03-15T06:49:17.407-05:00 | [Stage 18 (isEmpty at 
HoodieSparkSqlWriter.scala:609):====> (348 + 116) / 500]
     | 2022-03-15T06:49:18.449-05:00 | [Stage 18 (isEmpty at 
HoodieSparkSqlWriter.scala:609):====> (349 + 117) / 500]
     | 2022-03-15T06:49:19.450-05:00 | [Stage 18 (isEmpty at 
HoodieSparkSqlWriter.scala:609):=====> (386 + 114) / 500]
     | 2022-03-15T06:49:20.457-05:00 | [Stage 18 (isEmpty at 
HoodieSparkSqlWriter.scala:609):======> (421 + 79) / 500]
     | 2022-03-15T06:49:21.528-05:00 | [Stage 18 (isEmpty at 
HoodieSparkSqlWriter.scala:609):=======> (452 + 48) / 500]
     | 2022-03-15T06:49:24.903-05:00 | [Stage 18 (isEmpty at 
HoodieSparkSqlWriter.scala:609):=======> (465 + 35) / 500]
     | 2022-03-15T06:49:26.307-05:00 | [Stage 18 (isEmpty at 
HoodieSparkSqlWriter.scala:609):=======> (466 + 34) / 500]
     | 2022-03-15T06:49:27.724-05:00 | [Stage 18 (isEmpty at 
HoodieSparkSqlWriter.scala:609):=======> (470 + 30) / 500]
     | 2022-03-15T06:49:31.513-05:00 | [Stage 18 (isEmpty at 
HoodieSparkSqlWriter.scala:609):=======> (481 + 19) / 500]
     | 2022-03-15T06:49:32.746-05:00 | [Stage 18 (isEmpty at 
HoodieSparkSqlWriter.scala:609):========> (496 + 4) / 500]
     | 2022-03-15T06:49:46.207-05:00 | [Stage 20 (isEmpty at 
HoodieSparkSqlWriter.scala:609):> (0 + 117) / 473]
     | 2022-03-15T06:49:47.211-05:00 | [Stage 20 (isEmpty at 
HoodieSparkSqlWriter.scala:609):> (61 + 117) / 473]
     | 2022-03-15T06:49:48.305-05:00 | [Stage 20 (isEmpty at 
HoodieSparkSqlWriter.scala:609):> (114 + 117) / 473]
     | 2022-03-15T06:49:59.447-05:00 | [Stage 20 (isEmpty at 
HoodieSparkSqlWriter.scala:609):> (116 + 117) / 473]
     | 2022-03-15T06:50:00.457-05:00 | [Stage 20 (isEmpty at 
HoodieSparkSqlWriter.scala:609):=> (158 + 117) / 473]
     | 2022-03-15T06:50:01.463-05:00 | [Stage 20 (isEmpty at 
HoodieSparkSqlWriter.scala:609):==> (205 + 117) / 473]
     | 2022-03-15T06:50:02.652-05:00 | [Stage 20 (isEmpty at 
HoodieSparkSqlWriter.scala:609):==> (231 + 117) / 473]
     | 2022-03-15T06:50:12.642-05:00 | [Stage 20 (isEmpty at 
HoodieSparkSqlWriter.scala:609):==> (232 + 117) / 473]
     | 2022-03-15T06:50:13.642-05:00 | [Stage 20 (isEmpty at 
HoodieSparkSqlWriter.scala:609):===> (261 + 117) / 473]
     | 2022-03-15T06:50:14.654-05:00 | [Stage 20 (isEmpty at 
HoodieSparkSqlWriter.scala:609):====> (300 + 117) / 473]
     | 2022-03-15T06:50:15.668-05:00 | [Stage 20 (isEmpty at 
HoodieSparkSqlWriter.scala:609):====> (335 + 117) / 473]
     | 2022-03-15T06:50:17.172-05:00 | [Stage 20 (isEmpty at 
HoodieSparkSqlWriter.scala:609):====> (347 + 117) / 473]
     | 2022-03-15T06:50:25.789-05:00 | [Stage 20 (isEmpty at 
HoodieSparkSqlWriter.scala:609):====> (348 + 116) / 473]
     | 2022-03-15T06:50:26.840-05:00 | [Stage 20 (isEmpty at 
HoodieSparkSqlWriter.scala:609):=====> (372 + 101) / 473]
     | 2022-03-15T06:50:27.869-05:00 | [Stage 20 (isEmpty at 
HoodieSparkSqlWriter.scala:609):======> (406 + 67) / 473]
     | 2022-03-15T06:50:28.903-05:00 | [Stage 20 (isEmpty at 
HoodieSparkSqlWriter.scala:609):=======> (445 + 28) / 473]
     | 2022-03-15T06:50:31.957-05:00 | [Stage 20 (isEmpty at 
HoodieSparkSqlWriter.scala:609):========> (465 + 8) / 473]
     | 2022-03-15T06:50:33.009-05:00 | [Stage 20 (isEmpty at 
HoodieSparkSqlWriter.scala:609):========> (468 + 5) / 473]
     | 2022-03-15T06:50:34.267-05:00 | [Stage 20 (isEmpty at 
HoodieSparkSqlWriter.scala:609):========> (470 + 3) / 473]
     | 2022-03-15T06:50:35.294-05:00 | [Stage 20 (isEmpty at 
HoodieSparkSqlWriter.scala:609):=========>(473 + 0) / 473]
     | 2022-03-15T06:50:36.433-05:00 | [Stage 22 (collect at 
SparkRDDWriteClient.java:123):=====> (773 + 115) / 1098]
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to