Hi Team,

Recently we have been playing around 100GB TCP-DS queries above Iceberg backed 
Hive tables.
We have found that for queries accessing big partitioned tables had very-very 
slow compilation time. One example is query77 the compilation took ~20min:

INFO  : Completed compiling 
command(queryId=hive_20210219113124_c398b956-a507-4a82-82fc-c35d97acd3c2); Time 
taken: 1190.796 seconds 

Run some preliminary tests, and the main bottleneck seems to be the 
Catalogs.loadTable() method.

As another example I have checked the 
TestHiveIcebergStorageHandlerWithEngine.testInsert() method, and found that for 
a simple insert we load the table 9 times.

I will definitely dig into the details on how to decrease the number of times 
we load the table, but I also started to look around in the codebase to find a 
way to cache the tables. This is how I have found the 
org.apache.iceberg.CachingCatalog class. After some testing I have found that 
the implementation is lacking some features we need:
When issuing a CachingCatalog.loadTable() it does not refresh the table but 
returns the table in the last seen state (contrary to the default behavior for 
the other Catalog implementations)
When some outside process drops the table, then we do not notice it - this for 
example causes problems when recreating stuff
I am just guessing but I do not think we can share the Table objects between 
threads

Are the things above bugs or features ensuring to use the same table snapshot 
during the execution?

Shall we try to fix these bugs, or we might want to add a metadata cache layer 
instead? Caching immutable metadata is much easier and less error prone - also 
probably solves the main bottleneck (S3 read and parsing)

Thanks,
Peter

Reply via email to