Hi Nicolas, Sadly, no profiling info yet. At the time of the run we were checking correctness, and only "took note" of the perf issue. Still there are some fixes and a few more important base features we would like to tackle (CTAS, IOW), before we start to focus on performance. Still in the meantime I am trying to gather info on the possible improvements.
Based on my previous experience with HMS, CachingCatalog seems like a good candidate, but for me it behaves oddly, so I was trying to understand the rationale behind it. Thanks, Peter Grant Nicholas <gr...@spothero.com> ezt írta (időpont: 2021. febr. 25., Csü 16:42): > Were you able to confirm why the loadTable() call was slow through > profiling? > > From personal experience, I've seen similar calls behave slowly when the > connection to the hive metastore was unstable and retries to the hive > metastore(with exponential backoff) occurred. > > On Thu, Feb 25, 2021 at 4:45 AM Peter Vary <pv...@cloudera.com.invalid> > wrote: > >> Hi Team, >> >> Recently we have been playing around 100GB TCP-DS queries above Iceberg >> backed Hive tables. >> We have found that for queries accessing big partitioned tables had >> very-very slow compilation time. One example is *query77* the >> compilation took ~20min: >> >> *INFO : Completed compiling >> command(queryId=hive_20210219113124_c398b956-a507-4a82-82fc-c35d97acd3c2); >> Time taken: 1190.796 seconds* >> >> >> Run some preliminary tests, and the main bottleneck seems to be the >> *Catalogs.loadTable()* method. >> >> As another example I have checked the >> *TestHiveIcebergStorageHandlerWithEngine.testInsert()* method, and found >> that for a simple insert we load the table 9 times. >> >> I will definitely dig into the details on how to decrease the number of >> times we load the table, but I also started to look around in the codebase >> to find a way to cache the tables. This is how I have found the >> *org.apache.iceberg.CachingCatalog* class. After some testing I have >> found that the implementation is lacking some features we need: >> >> - When issuing a *CachingCatalog.loadTable()* it does not refresh the >> table but returns the table in the last seen state (contrary to the >> default >> behavior for the other Catalog implementations) >> - When some outside process drops the table, then we do not notice it >> - this for example causes problems when recreating stuff >> - I am just guessing but I do not think we can share the Table >> objects between threads >> >> >> Are the things above bugs or features ensuring to use the same table >> snapshot during the execution? >> >> Shall we try to fix these bugs, or we might want to add a metadata cache >> layer instead? Caching immutable metadata is much easier and less error >> prone - also probably solves the main bottleneck (S3 read and parsing) >> >> Thanks, >> Peter >> > > > -- > > [image: SpotHero] <http://spothero.com/> > > Grant Nicholas / Senior Data Engineer > gr...@spothero.com > spothero.com <http://www.spothero.com/> | LinkedIn > <https://www.linkedin.com/in/grantanicholas> > *Your perfect spot is waiting for you! Learn more at spothero.com/careers > <http://spothero.com/careers>* >