Hello Impala Public Jenkins,
I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/22157
to look at the new patch set (#2).
Change subject: IMPALA-11265: Part3: Group the Iceberg file descriptors by
partition
......................................................................
IMPALA-11265: Part3: Group the Iceberg file descriptors by partition
Originally IcebergContentFileStore organizes the file descriptors
into a map where the keys are the file path hashes and the values are
the file descriptors. This results in a very fast lookup for a
particular file descriptor, however, the memory usage of such a
structure is very greedy.
One example:
Test table has 16400 partitions and 110k data files.
The catalogd JVM usage of the table is 61,8MB, where the file path
hash strings took 11,44MB of JVM memory, 18,5% of the memory usage of
the whole table.
This patch enhances the catalogd JVM memory usage of Iceberg tables
with restructuring the IcebergContentFileStore to have a mapping by
partitions to a list of file descriptors. Note, HdfsTable also holds
the per partition file descriptors in a list.
With this, there is a sacrifice on the file descriptor lookup front,
while the JVM memory usage of an Iceberg table is reduced.
Measurements:
Test table has 16400 partitions and 110k data files.
- Memory usage
The JVM memory size of this table is reduced from 61,8MB to 48MB.
Compared to a Hive table with same characteristics the memory size
difference is reduced from 1,55X to 1,23X.
- Table loading times
The time required to do a full metadata load of an Iceberg table is the
same with this patch.
- Query planning time #1
Wrote a query that filters on a partition column where all 110k files
survives the filter and has to be looked up in IcebergContentFileStore.
I used a 'WHERE id > 0' predicate where the table is partitioned by
'id' and all the values are greater than zero. This query can be
considered a worst case scenario for this table.
With such a query the planning times are longer by ~40%, but still
negligible in terms of the full query runtimes as the planning time
increased from an average 0,8s to 1,14s.
I think this is a regression we can live with.
- Query planning time #2
In another query I used a predicate on a non-partition column. In this
case the Iceberg lib doesn't pre-filter the file descriptors and Impala
simply gets all the file descriptors from the ContentFileStore.
The average planning times in fact reduced with this patch from 0,44s to
0,23s by ~47%.
I think the reason is that the file descriptors are already arranged in
lists and there is only a slight overhead when creating an aggregated
list to return all of them.
- Query planning time #3
Test #1 was a worst case scenario where Impala did a lookup in cache for
all the 110k file descriptors. In test #3 I used a predicate that
filtered out 3/4 of the file descriptors. Here, there is still a 35% of
degradation in planning times, however, it is still so fast that I'd
consider this negligible. The average planning times changed from 0,19s
to 0,26s.
Change-Id: I276d839335c0aa39fa31a06ce08588a91a313768
---
M common/thrift/CatalogObjects.thrift
M fe/src/main/java/org/apache/impala/catalog/IcebergContentFileStore.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java
M fe/src/main/java/org/apache/impala/util/IcebergUtil.java
M fe/src/test/java/org/apache/impala/catalog/IcebergContentFileStoreTest.java
M fe/src/test/java/org/apache/impala/catalog/local/LocalCatalogTest.java
M fe/src/test/java/org/apache/impala/util/IcebergUtilTest.java
7 files changed, 215 insertions(+), 112 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/57/22157/2
--
To view, visit http://gerrit.cloudera.org:8080/22157
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I276d839335c0aa39fa31a06ce08588a91a313768
Gerrit-Change-Number: 22157
Gerrit-PatchSet: 2
Gerrit-Owner: Gabor Kaszab <[email protected]>
Gerrit-Reviewer: Gabor Kaszab <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>