Hi Nicolas,

Sadly, no profiling info yet. At the time of the run we were checking
correctness, and only "took note" of the perf issue. Still there are some
fixes and a few more important base features we would like to tackle (CTAS,
IOW), before we start to focus on performance. Still in the meantime I am
trying to gather info on the possible improvements.

Based on my previous experience with HMS, CachingCatalog seems like a good
candidate, but for me it behaves oddly, so I was trying to understand the
rationale behind it.

Thanks,
Peter

Grant Nicholas <gr...@spothero.com> ezt írta (időpont: 2021. febr. 25., Csü
16:42):

> Were you able to confirm why the loadTable() call was slow through
> profiling?
>
> From personal experience, I've seen similar calls behave slowly when the
> connection to the hive metastore was unstable and retries to the hive
> metastore(with exponential backoff) occurred.
>
> On Thu, Feb 25, 2021 at 4:45 AM Peter Vary <pv...@cloudera.com.invalid>
> wrote:
>
>> Hi Team,
>>
>> Recently we have been playing around 100GB TCP-DS queries above Iceberg
>> backed Hive tables.
>> We have found that for queries accessing big partitioned tables had
>> very-very slow compilation time. One example is *query77* the
>> compilation took ~20min:
>>
>> *INFO  : Completed compiling
>> command(queryId=hive_20210219113124_c398b956-a507-4a82-82fc-c35d97acd3c2);
>> Time taken: 1190.796 seconds*
>>
>>
>> Run some preliminary tests, and the main bottleneck seems to be the
>> *Catalogs.loadTable()* method.
>>
>> As another example I have checked the
>> *TestHiveIcebergStorageHandlerWithEngine.testInsert()* method, and found
>> that for a simple insert we load the table 9 times.
>>
>> I will definitely dig into the details on how to decrease the number of
>> times we load the table, but I also started to look around in the codebase
>> to find a way to cache the tables. This is how I have found the
>> *org.apache.iceberg.CachingCatalog* class. After some testing I have
>> found that the implementation is lacking some features we need:
>>
>>    - When issuing a *CachingCatalog.loadTable()* it does not refresh the
>>    table but returns the table in the last seen state (contrary to the 
>> default
>>    behavior for the other Catalog implementations)
>>    - When some outside process drops the table, then we do not notice it
>>    - this for example causes problems when recreating stuff
>>    - I am just guessing but I do not think we can share the Table
>>    objects between threads
>>
>>
>> Are the things above bugs or features ensuring to use the same table
>> snapshot during the execution?
>>
>> Shall we try to fix these bugs, or we might want to add a metadata cache
>> layer instead? Caching immutable metadata is much easier and less error
>> prone - also probably solves the main bottleneck (S3 read and parsing)
>>
>> Thanks,
>> Peter
>>
>
>
> --
>
> [image: SpotHero] <http://spothero.com/>
>
> Grant Nicholas / Senior Data Engineer
> gr...@spothero.com
> spothero.com <http://www.spothero.com/> | LinkedIn
> <https://www.linkedin.com/in/grantanicholas>
> *Your perfect spot is waiting for you! Learn more at spothero.com/careers
> <http://spothero.com/careers>*
>

Reply via email to