[
https://issues.apache.org/jira/browse/IMPALA-14349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zoltán Borók-Nagy reassigned IMPALA-14349:
------------------------------------------
Assignee: Zoltán Borók-Nagy
> Encode FileDescriptors in time in loading Iceberg Tables
> --------------------------------------------------------
>
> Key: IMPALA-14349
> URL: https://issues.apache.org/jira/browse/IMPALA-14349
> Project: IMPALA
> Issue Type: Improvement
> Components: Catalog
> Reporter: Quanlong Huang
> Assignee: Zoltán Borók-Nagy
> Priority: Major
> Labels: iceberg
>
> When loading file metadata of an IcebergTable in
> IcebergFileMetadataLoader#loadInternal() -> parallelListing(), we maintain a
> map from paths to FileStatus objects:
> [https://github.com/apache/impala/blob/50926b5d8e941c5cc10fd77d0b4556e3441c41e7/fe/src/main/java/org/apache/impala/catalog/IcebergFileMetadataLoader.java#L171]
> This map consumes lot of memory space since the loaded FileStatus objects are
> in HdfsLocatedFileStatus type and each of them consumes 6KB of the memory.
> E.g.
> {noformat}
> Class Name
> | Shallow Heap | Retained Heap
> ----------------------------------------------------------------------------------------------------------------------------------------
> org.apache.hadoop.hdfs.protocol.HdfsLocatedFileStatus @ 0x1008511620
> | 120 | 6,192
> |- <class> class org.apache.hadoop.hdfs.protocol.HdfsLocatedFileStatus @
> 0x1009e2a058 | 16 | 40
> |- isdir java.lang.Boolean @ 0x10056a7638 false
> | 16 | 16
> |- path org.apache.hadoop.fs.Path @ 0x1008511310
> | 16 | 784
> |- permission org.apache.hadoop.hdfs.protocol.FsPermissionExtension @
> 0x1008511698 | 32 | 32
> |- owner java.lang.String @ 0x10085116b8 id971832
> | 24 | 48
> |- group java.lang.String @ 0x10085116e8 hive
> | 24 | 48
> |- attr java.util.RegularEnumSet @ 0x1008511718
> | 32 | 32
> |- locations org.apache.hadoop.fs.BlockLocation[1] @ 0x1008511738
> | 24 | 192
> |- uPath byte[62] @ 0x1008511838
> 00668-28396-9dd59fc9-3ed9-40ca-8f39-e68bd2724c14-00040.parquet |
> 80 | 80
> |- hdfsloc org.apache.hadoop.hdfs.protocol.LocatedBlocks @ 0x1008511888
> | 40 | 5,576
> | |- <class> class org.apache.hadoop.hdfs.protocol.LocatedBlocks @
> 0x1009e20278 | 8 | 512
> | |- blocks java.util.ArrayList @ 0x10085118b0
> | 24 | 2,760
> | | |- <class> class java.util.ArrayList @ 0x100573da10 System Class
> | 32 | 240
> | | |- elementData java.lang.Object[1] @ 0x10085118c8
> | 24 | 2,736
> | | | |- class java.lang.Object[] @ 0x1005fc4650
> | 0 | 0
> | | | |- [0] org.apache.hadoop.hdfs.protocol.LocatedBlock @ 0x10085118e0
> | 48 | 2,712
> | | | | |- <class> class org.apache.hadoop.hdfs.protocol.LocatedBlock @
> 0x1009e26700 | 16 | 424
> | | | | |- storageIDs java.lang.String[3] @ 0x10085117f8
> | 32 | 32
> | | | | |- storageTypes org.apache.hadoop.fs.StorageType[3] @
> 0x1008511818 | 32 | 32
> | | | | |- b org.apache.hadoop.hdfs.protocol.ExtendedBlock @ 0x1008511910
> | 24 | 64
> | | | | |- locs
> org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage[3] @ 0x1008511950
> | 32 | 2,456
> | | | | | |- class
> org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage[] @ 0x102005b000
> | 0 | 0
> | | | | | |- [2] org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage
> @ 0x1008511970 | 200 | 808
> | | | | | |- [1] org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage
> @ 0x1008511c98 | 200 | 808
> | | | | | |- [0] org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage
> @ 0x1008511fc0 | 200 | 808
> | | | | | '- Total: 4 entries
> | |
> | | | | |- blockToken org.apache.hadoop.security.token.Token @
> 0x10085122e8 | 32 | 144
> | | | | |- cachedLocs
> org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage[0] @ 0x101b01f328
> | 16 | 16
> | | | | '- Total: 7 entries
> | |
> | | | '- Total: 2 entries
> | |
> | | '- Total: 2 entries
> | |
> | |- lastLocatedBlock org.apache.hadoop.hdfs.protocol.LocatedBlock @
> 0x1008512378 | 48 | 2,776
> | | |- <class> class org.apache.hadoop.hdfs.protocol.LocatedBlock @
> 0x1009e26700 | 16 | 424
> | | |- b org.apache.hadoop.hdfs.protocol.ExtendedBlock @ 0x10085123a8
> | 24 | 64
> | | |- locs org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage[3] @
> 0x10085123e8 | 32 | 2,216
> | | | |- class org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage[] @
> 0x102005b000 | 0 | 0
> | | | |- [2] org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage @
> 0x1008512408 | 200 | 728
> | | | |- [1] org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage @
> 0x1008512730 | 200 | 728
> | | | |- [0] org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage @
> 0x1008512a58 | 200 | 728
> | | | | |- <class> class
> org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage @ 0x102005aee8 |
> 8 | 104
> | | | | |- ipAddr java.lang.String @ 0x1008512b20 xxx.xxx.xxx.xxx
> | 24 | 56
> | | | | |- ipAddrBytes com.google.protobuf.LiteralByteString @
> 0x1008512b58 | 24 | 56
> | | | | |- hostName java.lang.String @ 0x1008512b90 www.abc.com
> | 24 | 56
> | | | | |- hostNameBytes com.google.protobuf.LiteralByteString @
> 0x1008512bc8 | 24 | 56
> | | | | |- xferAddr java.lang.String @ 0x1008512c00 xxx.xxx.xxx.xxx:9866
> | 24 | 64
> | | | | |- datanodeUuid java.lang.String @ 0x1008512c40
> 2f6e6e42-9347-4370-a318-79efdadcc3cf | 24 | 80
> | | | | |- datanodeUuidBytes com.google.protobuf.LiteralByteString @
> 0x1008512c90 | 24 | 80
> | | | | |- location java.lang.String @ 0x1008512ce0 /default
> | 24 | 48
> | | | | |- dependentHostNames java.util.LinkedList @ 0x1008512d10
> | 32 | 32
> | | | | |- storageID java.lang.String @ 0x1008512d30
> DS-f190d2ef-755b-4f73-bb3d-67b6e72805e2 | 24 | 80
> | | | | |- adminState
> org.apache.hadoop.hdfs.protocol.DatanodeInfo$AdminStates @ 0x101b01ef50
> NORMAL| 24 | 24
> | | | | |- storageType org.apache.hadoop.fs.StorageType @ 0x101b01f000
> DISK | 24 | 24
> | | | | '- Total: 13 entries
> | |
> | | | '- Total: 4 entries
> | |
> | | |- storageIDs java.lang.String[3] @ 0x1008512d80
> | 32 | 32
> | | |- storageTypes org.apache.hadoop.fs.StorageType[3] @ 0x1008512da0
> | 32 | 32
> | | |- blockToken org.apache.hadoop.security.token.Token @ 0x1008512dc0
> | 32 | 144
> | | |- cachedLocs
> org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage[0] @ 0x101b01f328
> | 16 | 16
> | | '- Total: 7 entries
> | |
> | '- Total: 3 entries
> | |
> '- Total: 10 entries{noformat}
> There are some duplicate strings like storageIDs and hostnames. We can invoke
> String.intern() on them to save some space. But it'd be better to convert
> these FileStatus objects into IcebergFileDescriptor in time to reduce the
> space usage. Encoding IcebergFileDescriptor into bytes (which usually takes
> 200 bytes for each file) in time can further save more space.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]