Re: CachingCatalog question

Grant Nicholas Thu, 25 Feb 2021 07:42:29 -0800

Were you able to confirm why the loadTable() call was slow through
profiling?


>From personal experience, I've seen similar calls behave slowly when the
connection to the hive metastore was unstable and retries to the hive
metastore(with exponential backoff) occurred.

On Thu, Feb 25, 2021 at 4:45 AM Peter Vary <pv...@cloudera.com.invalid>
wrote:

> Hi Team,
>
> Recently we have been playing around 100GB TCP-DS queries above Iceberg
> backed Hive tables.
> We have found that for queries accessing big partitioned tables had
> very-very slow compilation time. One example is *query77* the compilation
> took ~20min:
>
> *INFO  : Completed compiling
> command(queryId=hive_20210219113124_c398b956-a507-4a82-82fc-c35d97acd3c2);
> Time taken: 1190.796 seconds*
>
>
> Run some preliminary tests, and the main bottleneck seems to be the
> *Catalogs.loadTable()* method.
>
> As another example I have checked the
> *TestHiveIcebergStorageHandlerWithEngine.testInsert()* method, and found
> that for a simple insert we load the table 9 times.
>
> I will definitely dig into the details on how to decrease the number of
> times we load the table, but I also started to look around in the codebase
> to find a way to cache the tables. This is how I have found the
> *org.apache.iceberg.CachingCatalog* class. After some testing I have
> found that the implementation is lacking some features we need:
>
>    - When issuing a *CachingCatalog.loadTable()* it does not refresh the
>    table but returns the table in the last seen state (contrary to the default
>    behavior for the other Catalog implementations)
>    - When some outside process drops the table, then we do not notice it
>    - this for example causes problems when recreating stuff
>    - I am just guessing but I do not think we can share the Table objects
>    between threads
>
>
> Are the things above bugs or features ensuring to use the same table
> snapshot during the execution?
>
> Shall we try to fix these bugs, or we might want to add a metadata cache
> layer instead? Caching immutable metadata is much easier and less error
> prone - also probably solves the main bottleneck (S3 read and parsing)
>
> Thanks,
> Peter
>


-- 

[image: SpotHero] <http://spothero.com/>

Grant Nicholas / Senior Data Engineer
gr...@spothero.com
spothero.com <http://www.spothero.com/> | LinkedIn
<https://www.linkedin.com/in/grantanicholas>
*Your perfect spot is waiting for you! Learn more at spothero.com/careers
<http://spothero.com/careers>*

Re: CachingCatalog question

Reply via email to