[ 
https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17101207#comment-17101207
 ] 

Yanjia Gary Li commented on HUDI-494:
-------------------------------------

 

Commit 1:
{code:java}

"partitionToWriteStats" : {
    "year=2020/month=5/day=0/hour=0" : [ {
      "fileId" : "4aee295a-4bbd-4c74-ba49-f6d50f489524-0",
      "path" : 
"year=2020/month=5/day=0/hour=0/4aee295a-4bbd-4c74-ba49-f6d50f489524-0_0-112-1773_20200504101048.parquet",
      "prevCommit" : "null",
      "numWrites" : 21,
      "numDeletes" : 0,
      "numUpdateWrites" : 0,
      "numInserts" : 21,
      "totalWriteBytes" : 14397559,
      "totalWriteErrors" : 0,
      "tempPath" : null,
      "partitionPath" : "year=2020/month=5/day=0/hour=0",
      "totalLogRecords" : 0,
      "totalLogFilesCompacted" : 0,
      "totalLogSizeCompacted" : 0,
      "totalUpdatedRecordsCompacted" : 0,
      "totalLogBlocks" : 0,
      "totalCorruptLogBlock" : 0,
      "totalRollbackBlocks" : 0,
      "fileSizeInBytes" : 14397559
    }
{code}
Commit2:
{code:java}
  "partitionToWriteStats" : {
    "year=2020/month=5/day=0/hour=0" : [ {
      "fileId" : "4aee295a-4bbd-4c74-ba49-f6d50f489524-0",
      "path" : 
"year=2020/month=5/day=0/hour=0/4aee295a-4bbd-4c74-ba49-f6d50f489524-0_0-248-163129_20200505023830.parquet",
      "prevCommit" : "20200504101048",
      "numWrites" : 12817,
      "numDeletes" : 0,
      "numUpdateWrites" : 0,
      "numInserts" : 12796,
      "totalWriteBytes" : 16297335,
      "totalWriteErrors" : 0,
      "tempPath" : null,
      "partitionPath" : "year=2020/month=5/day=0/hour=0",
      "totalLogRecords" : 0,
      "totalLogFilesCompacted" : 0,
      "totalLogSizeCompacted" : 0,
      "totalUpdatedRecordsCompacted" : 0,
      "totalLogBlocks" : 0,
      "totalCorruptLogBlock" : 0,
      "totalRollbackBlocks" : 0,
      "fileSizeInBytes" : 16297335
    }, {
      "fileId" : "9d0c9e79-00dd-41d2-a217-0944f8428e1c-0",
      "path" : 
"year=2020/month=5/day=0/hour=0/9d0c9e79-00dd-41d2-a217-0944f8428e1c-0_1-248-163130_20200505023830.parquet",
      "prevCommit" : "null",
      "numWrites" : 200,
      "numDeletes" : 0,
      "numUpdateWrites" : 0,
      "numInserts" : 200,
      "totalWriteBytes" : 14428883,
      "totalWriteErrors" : 0,
      "tempPath" : null,
      "partitionPath" : "year=2020/month=5/day=0/hour=0",
      "totalLogRecords" : 0,
      "totalLogFilesCompacted" : 0,
      "totalLogSizeCompacted" : 0,
      "totalUpdatedRecordsCompacted" : 0,
      "totalLogBlocks" : 0,
      "totalCorruptLogBlock" : 0,
      "totalRollbackBlocks" : 0,
      "fileSizeInBytes" : 14428883
    }, {
      "fileId" : "5990beb4-bd0c-40c9-84f1-a4107287971e-0",
      "path" : 
"year=2020/month=5/day=0/hour=0/5990beb4-bd0c-40c9-84f1-a4107287971e-0_2-248-163131_20200505023830.parquet",
      "prevCommit" : "null",
      "numWrites" : 198,
      "numDeletes" : 0,
      "numUpdateWrites" : 0,
      "numInserts" : 198,
      "totalWriteBytes" : 14428338,
      "totalWriteErrors" : 0,
      "tempPath" : null,
      "partitionPath" : "year=2020/month=5/day=0/hour=0",
      "totalLogRecords" : 0,
      "totalLogFilesCompacted" : 0,
      "totalLogSizeCompacted" : 0,
      "totalUpdatedRecordsCompacted" : 0,
      "totalLogBlocks" : 0,
      "totalCorruptLogBlock" : 0,
      "totalRollbackBlocks" : 0,
      "fileSizeInBytes" : 14428338
    }, {
      "fileId" : "673c5550-39c3-4611-ac68-bc0c7da065e2-0",
      "path" : 
"year=2020/month=5/day=0/hour=0/673c5550-39c3-4611-ac68-bc0c7da065e2-0_3-248-163132_20200505023830.parquet",
      "prevCommit" : "null",
      "numWrites" : 179,
      "numDeletes" : 0,
      "numUpdateWrites" : 0,
      "numInserts" : 179,
      "totalWriteBytes" : 14425571,
      "totalWriteErrors" : 0,
      "tempPath" : null,
      "partitionPath" : "year=2020/month=5/day=0/hour=0",
      "totalLogRecords" : 0,
      "totalLogFilesCompacted" : 0,
      "totalLogSizeCompacted" : 0,
      "totalUpdatedRecordsCompacted" : 0,
      "totalLogBlocks" : 0,
      "totalCorruptLogBlock" : 0,
      "totalRollbackBlocks" : 0,
      "fileSizeInBytes" : 14425571
    }
{code}
 

 

> [DEBUGGING] Huge amount of tasks when writing files into HDFS
> -------------------------------------------------------------
>
>                 Key: HUDI-494
>                 URL: https://issues.apache.org/jira/browse/HUDI-494
>             Project: Apache Hudi (incubating)
>          Issue Type: Test
>            Reporter: Yanjia Gary Li
>            Assignee: Yanjia Gary Li
>            Priority: Major
>         Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot 
> 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, 
> image-2020-01-05-07-30-53-567.png
>
>
> I am using the manual build master after 
> [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65]
>  commit. EDIT: tried with the latest master but got the same result
> I am seeing 3 million tasks when the Hudi Spark job writing the files into 
> HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 
> million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. 
> I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ 
> folder in my HDFS. In the Spark UI, each task only writes less than 10 
> records in
> {code:java}
> count at HoodieSparkSqlWriter{code}
>  All the stages before this seem normal. Any idea what happened here? My 
> first guess would be something related to the bloom filter index. Maybe 
> somewhere trigger the repartitioning with the bloom filter index? But I am 
> not really familiar with that part of the code. 
> Thanks
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to