[jira] [Commented] (FLINK-20416) Need a cached catalog for batch SQL job

Sebastian Liu (Jira) Tue, 08 Dec 2020 04:37:05 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-20416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17245865#comment-17245865
 ]


Sebastian Liu commented on FLINK-20416:
---------------------------------------

Hi [~jark], appreciate for your reply and suggestion. I totally agree that we 
should make a consensus before actual coding. The pull request in this ticket 
is an attempt that we have made in tpc_ds benchmark, and we hope to share it 
with our community after improved some query performance.

Firstly, let me answer some of your questions briefly：

1. If this is a framework cache? 

We hope that it's not a framework cache, but a special common catalog just like 
the "GenericInMemoryCatalog". The difference is that this 
"GenericCachedCatalog" should delegate the requests for other kinds of catalogs 
and cache/update the results gracefully. It is up to the user to decide whether 
his catalog implementation needs delegate to "GenericCachedCatalog". 

2. How to enable it? By job configuration?

To answer this question, I think we should confirm the cache usage scenario. 
Long running streaming job in per job mode cluster is not suitable. Flink Sql 
Gateway + Session mode cluster for batch sql job is suitable. So we can enable 
it in related catalog properties, and CatalogFactory can check the properties 
to decide whether to create the original catalog or create a 
"GenericCachedCatalog" and put his own implementation in a delegate. Strictly, 
this should be a cluster configuration.

3. caching in specific catalog?

Yes, I agree too. And we have add this "GenericCachedCatalog" for hive catalog 
in PR. And won't affect other catalogs if they do not use this. 

 

In general, this cache implementation is similar to the relevant implementation 
in Presto, and our goal is also to improve the OLAP performance of Flink Batch 
SQL

> Need a cached catalog for batch SQL job
> ---------------------------------------
>
>                 Key: FLINK-20416
>                 URL: https://issues.apache.org/jira/browse/FLINK-20416
>             Project: Flink
>          Issue Type: Improvement
>          Components: Connectors / Common, Connectors / Hive, Table SQL / API, 
> Table SQL / Planner
>            Reporter: Sebastian Liu
>            Priority: Major
>              Labels: pull-request-available
>
> For OLAP scenarios, There are usually some analytical queries which running 
> time is relatively short. These queries are also sensitive to latency. In the 
> current Blink sql processing, parse/validate/optimize stages are all need 
> meta data from catalog API. But each request to the catalog requires re-run 
> of the underlying meta query. 
>  
> We may need a cached catalog which can cache the table schema and statistic 
> info to avoid unnecessary repeated meta requests. 
> I have submitted a related PR for adding a genetic cached catalog, which can 
> delegate other implementations of {{AbstractCatalog. }}
> {{[https://github.com/apache/flink/pull/14260]}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-20416) Need a cached catalog for batch SQL job

Reply via email to