Re: CachingCatalog question

Peter Vary Thu, 25 Feb 2021 22:19:02 -0800

Hi Nicolas,

Sadly, no profiling info yet. At the time of the run we were checking
correctness, and only "took note" of the perf issue. Still there are some
fixes and a few more important base features we would like to tackle (CTAS,
IOW), before we start to focus on performance. Still in the meantime I am
trying to gather info on the possible improvements.


Based on my previous experience with HMS, CachingCatalog seems like a good
candidate, but for me it behaves oddly, so I was trying to understand the
rationale behind it.

Thanks,
Peter

Grant Nicholas <gr...@spothero.com> ezt írta (időpont: 2021. febr. 25., Csü
16:42):

> Were you able to confirm why the loadTable() call was slow through
> profiling?
>
> From personal experience, I've seen similar calls behave slowly when the
> connection to the hive metastore was unstable and retries to the hive
> metastore(with exponential backoff) occurred.
>
> On Thu, Feb 25, 2021 at 4:45 AM Peter Vary <pv...@cloudera.com.invalid>
> wrote:
>
>> Hi Team,
>>
>> Recently we have been playing around 100GB TCP-DS queries above Iceberg
>> backed Hive tables.
>> We have found that for queries accessing big partitioned tables had
>> very-very slow compilation time. One example is *query77* the
>> compilation took ~20min:
>>
>> *INFO  : Completed compiling
>> command(queryId=hive_20210219113124_c398b956-a507-4a82-82fc-c35d97acd3c2);
>> Time taken: 1190.796 seconds*
>>
>>
>> Run some preliminary tests, and the main bottleneck seems to be the
>> *Catalogs.loadTable()* method.
>>
>> As another example I have checked the
>> *TestHiveIcebergStorageHandlerWithEngine.testInsert()* method, and found
>> that for a simple insert we load the table 9 times.
>>
>> I will definitely dig into the details on how to decrease the number of
>> times we load the table, but I also started to look around in the codebase
>> to find a way to cache the tables. This is how I have found the
>> *org.apache.iceberg.CachingCatalog* class. After some testing I have
>> found that the implementation is lacking some features we need:
>>
>>    - When issuing a *CachingCatalog.loadTable()* it does not refresh the
>>    table but returns the table in the last seen state (contrary to the 
>> default
>>    behavior for the other Catalog implementations)
>>    - When some outside process drops the table, then we do not notice it
>>    - this for example causes problems when recreating stuff
>>    - I am just guessing but I do not think we can share the Table
>>    objects between threads
>>
>>
>> Are the things above bugs or features ensuring to use the same table
>> snapshot during the execution?
>>
>> Shall we try to fix these bugs, or we might want to add a metadata cache
>> layer instead? Caching immutable metadata is much easier and less error
>> prone - also probably solves the main bottleneck (S3 read and parsing)
>>
>> Thanks,
>> Peter
>>
>
>
> --
>
> [image: SpotHero] <http://spothero.com/>
>
> Grant Nicholas / Senior Data Engineer
> gr...@spothero.com
> spothero.com <http://www.spothero.com/> | LinkedIn
> <https://www.linkedin.com/in/grantanicholas>
> *Your perfect spot is waiting for you! Learn more at spothero.com/careers
> <http://spothero.com/careers>*
>

Re: CachingCatalog question

Reply via email to