aznwarmonkey edited a comment on issue #4541: URL: https://github.com/apache/hudi/issues/4541#issuecomment-1008283680
> Hi, After making the suggested changes as mentioned above, The updated config looks like the following ```python hudi_options = { 'hoodie.table.name': table_name, 'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.ComplexKeyGenerator', 'hoodie.datasource.write.recordkey.field': keys, 'hoodie.datasource.write.partitionpath.field': ','.join(partitions), 'hoodie.datasource.write.hive_style_partitioning': True, 'hoodie.datasource.write.table.name': table_name, 'hoodie.datasource.write.table.type': 'COPY_ON_WRITE', 'hoodie.datasource.write.precombine.field': timestamp_col, 'hoodie.index.type': 'BLOOM', 'hoodie.consistency.check.enabled': True, 'hoodie.parquet.small.file.limit': 134217728, 'hoodie.parquet.max.file.size': 1073741824, 'write.bulk_insert.shuffle_by_partition': True, 'hoodie.datasource.write.row.writer.enable': True, 'hoodie.bulkinsert.sort.mode': 'PARTITION_SORT', 'hoodie.bulkinsert.shuffle.parallelism': num_partitons, 'hoodie.cleaner.commits.retained': '1', 'hoodie.clean.async': True, } df.write.format('org.apache.hudi') \ .option('hoodie.datasource.write.operation', 'bulk_insert') \ .options(**hudi_options).mode('append').save(output_path) ``` A couple of questions and what I am hoping to accomplish: - `write.parquet.block.size` is this in bytes or mb? i believe the default is 120: https://hudi.apache.org/docs/configurations/#writeparquetblocksize - when writing, I noticed I am getting a bunch of small parquet files ~20-50mb which i am trying to avoid, the desired size is at least 128MB. Tuning small.file.limit and max.file.size seem to have no noticeable impact, which is why having clustering is important (as the next time it writes to the same partition the hope is to combine the small files together) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org