[jira] [Commented] (IMPALA-14349) Encode FileDescriptors in time in loading Iceberg Tables

ASF subversion and git services (Jira) Mon, 15 Sep 2025 21:52:32 -0700


    [ 
https://issues.apache.org/jira/browse/IMPALA-14349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18020493#comment-18020493
 ]


ASF subversion and git services commented on IMPALA-14349:
----------------------------------------------------------

Commit 68ab52f2c770c233bc1e287b6d3c40df1cdc8775 in impala's branch 
refs/heads/master from Riza Suminto
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=68ab52f2c ]

IMPALA-14437: Fix regression in FileMetadataLoader.createFd()

IMPALA-14349 caused a regression due to change in
FileMetadataLoader.createFd(). When default FS is S3, all files is S3
should not have any FileBlock. However, after IMPALA-14349, CTAS query
that scans functional.alltypes table in S3 hit following Preconditions
in HdfsScanNode.java:

  if (!fsHasBlocks) {
    Preconditions.checkState(fileDesc.getNumFileBlocks() == 0);

This is because FileMetadataLoader.createFd() skip checking if the
originating FileSystem support supportsStorageIds() or not. S3
dataloading from HDFS snapshot consistently failed due this regression.

This patch fix the issue by restoring FileMetadataLoader.createFd() to
its state before IMPALA-14349. It also make
FileMetadataLoader.createFd() calls more consistent by not allowing null
parameters except for 'absPath' that is only not null for Iceberg data
files. Generalize numUnknownDiskIds parameter from Reference<Long> to
AtomicLong for parallel usage.

Testing:
Pass dataloading, FE_TEST, EE_TEST, and CLUSTER_TEST in S3.

Change-Id: Ie16c5d7b020a59b5937b52dfbf66175ac94f60cd
Reviewed-on: http://gerrit.cloudera.org:8080/23423
Reviewed-by: Zoltan Borok-Nagy <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Encode FileDescriptors in time in loading Iceberg Tables
> --------------------------------------------------------
>
>                 Key: IMPALA-14349
>                 URL: https://issues.apache.org/jira/browse/IMPALA-14349
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Catalog
>            Reporter: Quanlong Huang
>            Assignee: Zoltán Borók-Nagy
>            Priority: Major
>              Labels: iceberg
>             Fix For: Impala 5.0.0
>
>
> When loading file metadata of an IcebergTable in 
> IcebergFileMetadataLoader#loadInternal() -> parallelListing(), we maintain a 
> map from paths to FileStatus objects:
> [https://github.com/apache/impala/blob/50926b5d8e941c5cc10fd77d0b4556e3441c41e7/fe/src/main/java/org/apache/impala/catalog/IcebergFileMetadataLoader.java#L171]
> This map consumes lot of memory space since the loaded FileStatus objects are 
> in HdfsLocatedFileStatus type and each of them consumes 6KB of the memory. 
> E.g.
> {noformat}
> Class Name                                                                    
>                            | Shallow Heap | Retained Heap
> ----------------------------------------------------------------------------------------------------------------------------------------
> org.apache.hadoop.hdfs.protocol.HdfsLocatedFileStatus @ 0x1008511620          
>                            |          120 |         6,192
> |- <class> class org.apache.hadoop.hdfs.protocol.HdfsLocatedFileStatus @ 
> 0x1009e2a058                    |           16 |            40
> |- isdir java.lang.Boolean @ 0x10056a7638  false                              
>                            |           16 |            16
> |- path org.apache.hadoop.fs.Path @ 0x1008511310                              
>                            |           16 |           784
> |- permission org.apache.hadoop.hdfs.protocol.FsPermissionExtension @ 
> 0x1008511698                       |           32 |            32
> |- owner java.lang.String @ 0x10085116b8  id971832                            
>                            |           24 |            48
> |- group java.lang.String @ 0x10085116e8  hive                                
>                            |           24 |            48
> |- attr java.util.RegularEnumSet @ 0x1008511718                               
>                            |           32 |            32
> |- locations org.apache.hadoop.fs.BlockLocation[1] @ 0x1008511738             
>                            |           24 |           192
> |- uPath byte[62] @ 0x1008511838  
> 00668-28396-9dd59fc9-3ed9-40ca-8f39-e68bd2724c14-00040.parquet         |      
>      80 |            80
> |- hdfsloc org.apache.hadoop.hdfs.protocol.LocatedBlocks @ 0x1008511888       
>                            |           40 |         5,576
> |  |- <class> class org.apache.hadoop.hdfs.protocol.LocatedBlocks @ 
> 0x1009e20278                         |            8 |           512
> |  |- blocks java.util.ArrayList @ 0x10085118b0                               
>                            |           24 |         2,760
> |  |  |- <class> class java.util.ArrayList @ 0x100573da10 System Class        
>                            |           32 |           240
> |  |  |- elementData java.lang.Object[1] @ 0x10085118c8                       
>                            |           24 |         2,736
> |  |  |  |- class java.lang.Object[] @ 0x1005fc4650                           
>                            |            0 |             0
> |  |  |  |- [0] org.apache.hadoop.hdfs.protocol.LocatedBlock @ 0x10085118e0   
>                            |           48 |         2,712
> |  |  |  |  |- <class> class org.apache.hadoop.hdfs.protocol.LocatedBlock @ 
> 0x1009e26700                 |           16 |           424
> |  |  |  |  |- storageIDs java.lang.String[3] @ 0x10085117f8                  
>                            |           32 |            32
> |  |  |  |  |- storageTypes org.apache.hadoop.fs.StorageType[3] @ 
> 0x1008511818                           |           32 |            32
> |  |  |  |  |- b org.apache.hadoop.hdfs.protocol.ExtendedBlock @ 0x1008511910 
>                            |           24 |            64
> |  |  |  |  |- locs 
> org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage[3] @ 0x1008511950     
>        |           32 |         2,456
> |  |  |  |  |  |- class 
> org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage[] @ 0x102005b000      
>    |            0 |             0
> |  |  |  |  |  |- [2] org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage 
> @ 0x1008511970             |          200 |           808
> |  |  |  |  |  |- [1] org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage 
> @ 0x1008511c98             |          200 |           808
> |  |  |  |  |  |- [0] org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage 
> @ 0x1008511fc0             |          200 |           808
> |  |  |  |  |  '- Total: 4 entries                                            
>                            |              |
> |  |  |  |  |- blockToken org.apache.hadoop.security.token.Token @ 
> 0x10085122e8                          |           32 |           144
> |  |  |  |  |- cachedLocs 
> org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage[0] @ 0x101b01f328     
>  |           16 |            16
> |  |  |  |  '- Total: 7 entries                                               
>                            |              |
> |  |  |  '- Total: 2 entries                                                  
>                            |              |
> |  |  '- Total: 2 entries                                                     
>                            |              |
> |  |- lastLocatedBlock org.apache.hadoop.hdfs.protocol.LocatedBlock @ 
> 0x1008512378                       |           48 |         2,776
> |  |  |- <class> class org.apache.hadoop.hdfs.protocol.LocatedBlock @ 
> 0x1009e26700                       |           16 |           424
> |  |  |- b org.apache.hadoop.hdfs.protocol.ExtendedBlock @ 0x10085123a8       
>                            |           24 |            64
> |  |  |- locs org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage[3] @ 
> 0x10085123e8                  |           32 |         2,216
> |  |  |  |- class org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage[] @ 
> 0x102005b000               |            0 |             0
> |  |  |  |- [2] org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage @ 
> 0x1008512408                   |          200 |           728
> |  |  |  |- [1] org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage @ 
> 0x1008512730                   |          200 |           728
> |  |  |  |- [0] org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage @ 
> 0x1008512a58                   |          200 |           728
> |  |  |  |  |- <class> class 
> org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage @ 0x102005aee8      | 
>            8 |           104
> |  |  |  |  |- ipAddr java.lang.String @ 0x1008512b20  xxx.xxx.xxx.xxx        
>                            |           24 |            56
> |  |  |  |  |- ipAddrBytes com.google.protobuf.LiteralByteString @ 
> 0x1008512b58                          |           24 |            56
> |  |  |  |  |- hostName java.lang.String @ 0x1008512b90  www.abc.com          
>                            |           24 |            56
> |  |  |  |  |- hostNameBytes com.google.protobuf.LiteralByteString @ 
> 0x1008512bc8                        |           24 |            56
> |  |  |  |  |- xferAddr java.lang.String @ 0x1008512c00  xxx.xxx.xxx.xxx:9866 
>                            |           24 |            64
> |  |  |  |  |- datanodeUuid java.lang.String @ 0x1008512c40  
> 2f6e6e42-9347-4370-a318-79efdadcc3cf        |           24 |            80
> |  |  |  |  |- datanodeUuidBytes com.google.protobuf.LiteralByteString @ 
> 0x1008512c90                    |           24 |            80
> |  |  |  |  |- location java.lang.String @ 0x1008512ce0  /default             
>                            |           24 |            48
> |  |  |  |  |- dependentHostNames java.util.LinkedList @ 0x1008512d10         
>                            |           32 |            32
> |  |  |  |  |- storageID java.lang.String @ 0x1008512d30  
> DS-f190d2ef-755b-4f73-bb3d-67b6e72805e2        |           24 |            80
> |  |  |  |  |- adminState 
> org.apache.hadoop.hdfs.protocol.DatanodeInfo$AdminStates @ 0x101b01ef50  
> NORMAL|           24 |            24
> |  |  |  |  |- storageType org.apache.hadoop.fs.StorageType @ 0x101b01f000  
> DISK                         |           24 |            24
> |  |  |  |  '- Total: 13 entries                                              
>                            |              |
> |  |  |  '- Total: 4 entries                                                  
>                            |              |
> |  |  |- storageIDs java.lang.String[3] @ 0x1008512d80                        
>                            |           32 |            32
> |  |  |- storageTypes org.apache.hadoop.fs.StorageType[3] @ 0x1008512da0      
>                            |           32 |            32
> |  |  |- blockToken org.apache.hadoop.security.token.Token @ 0x1008512dc0     
>                            |           32 |           144
> |  |  |- cachedLocs 
> org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage[0] @ 0x101b01f328     
>        |           16 |            16
> |  |  '- Total: 7 entries                                                     
>                            |              |
> |  '- Total: 3 entries                                                        
>                            |              |
> '- Total: 10 entries{noformat}
> There are some duplicate strings like storageIDs and hostnames. We can invoke 
> String.intern() on them to save some space. But it'd be better to convert 
> these FileStatus objects into IcebergFileDescriptor in time to reduce the 
> space usage. Encoding IcebergFileDescriptor into bytes (which usually takes 
> 200 bytes for each file) in time can further save more space.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-14349) Encode FileDescriptors in time in loading Iceberg Tables

Reply via email to