Materialized views are not a hack, as Vladimir claims. Materialized views are a fundamental concept in relational algebra, and they are an elegant way - in my opinion the correct way - to model indexes (and many other structures).
In Calcite materialized views are a feature in the planner that allows you to declare a table as equivalent to a query. (You do not need to issue a CREATE MATERIALIZED VIEW statement. They are internal.) Then you can use algebra to substitute one for the other, and apply cost-based optimization to choose between plans with & without the materialized view. Other people have used this approach in the past. Search this list and you should find discussions. Also I have a talk with Maryann Xue about planning for indexes in Apache Phoenix[1]. Julian [1] https://www.slideshare.net/julianhyde/costbased-query-optimization-in-apache-phoenix-using-apache-calcite <https://www.slideshare.net/julianhyde/costbased-query-optimization-in-apache-phoenix-using-apache-calcite> > On May 29, 2020, at 5:28 AM, Roman Kondakov <[email protected]> > wrote: > > Hi Tim, > > In Apache Ignite we've already faced this challenge. We solved it using > materialized views and FilterableTable. Let's consider your example: > >> select * from users where country='UK' and some_other_column='foo'; > > with a primary index and a sorted secondary index (B+Tree?) over the > 'country' field. > > As a first step Ignite registers all indexes that might be helpful for > the query plan in VolcanoPlanner as materialized views. For now we > register all indexes in all tables that participate in the query. > Registering all indexes might be excessive, but maybe we will apply some > pruning later. So it's ok for now to register all indexes. > > Index materialized view is very simple: it's just a sorted table scan: > > 'SELECT * from tbl ORDER BY idx_fields' > > After registering indexes as materialized views, the optimizer's search > space will look like this: > > Project([*]) > Filter[country='UK' and some_other_column='foo'] > Set0 > > where Set0 consists of table and index scans: > > TableScan('tbl', collation=[]) > IndexScan('PK', collation=[PK_field]) > IndexScan('country_idx', collation=[country]) > > At this step we have our index scans registered in the optimizer within > the same equivalence set as a TableScan. > > The next step is a pushing filters down to the scans. We do it with a > rule which is similar to 'FilterTableScanRule'. After applying this rule > we have a search space that is looking like this: > > Project([*]) > TableScan('tbl', collation=[], filter=[country='UK' and > some_other_column='foo']) > IndexScan('PK', collation=[PK_field], filter=[country='UK' and > some_other_column='foo']) > IndexScan('country_idx', collation=[country], filter=[country='UK' and > some_other_column='foo']) > > And the final step is adjusting the cost model to make it select the > scan with the lower cost which depends on the filter conditions within > the Scan. For example, full table scan with filters > > TableScan('tbl', collation=[], filter=[country='UK' and > some_other_column='foo']) > > will cost, say, 100. Because it have to scan all rows and then filter > out some set of them. On the other hand the index scan that can do index > lookup instead of full scan will have a less cost. For example > > IndexScan('country_idx', collation=[country], filter=[country='UK' and > some_other_column='foo']) > > will have cost about ~10, or so because it has a good index condition > `country='UK'` which can be used for index lookup that that returns only > 10% of rows. And therefore this IndexScan should be chosen as the best > plan by the optimizer. > > We've recently implemented this approach in Apache Ignite and it works > well for us. You can find it in [1]. This PR has many changes that are > unrelated to the main topic. So particularly you can look at > `IgnitePlanner.materializations()' method which registers indexes as > materialized views and `IgniteTableScan` which performs filter > conditions assessment and index scan cost estimation. > > > [1] https://github.com/apache/ignite/pull/7813 > > -- > Kind Regards > Roman Kondakov > > > On 29.05.2020 11:44, Tim Fox wrote: >> Hi, >> >> I'm building a query engine with Calcite - really enjoying working with >> Calcite so far! >> >> When creating a plan, it seems Calcite always creates a plan where the >> sources are table scans, however in my implementation the tables can have >> indexes on them so a table scan is not always the right choice. >> >> I was wondering if there was any way of making Calcite "index aware" - e.g. >> perhaps providing hints to the table scan instance that, actually, an index >> scan or a primary key lookup should be used instead of actually scanning >> the table. E.g. On the table meta-data if we provided information about any >> indexes on the table, then Calcite could figure out what parts of the query >> to push to the table scan and which to keep in the rest of the plan. >> >> There are two specific cases I really care about: >> >> 1. Queries that contain a primary key lookup: >> >> select * from some_table where key_column=23 AND some_other_column='foo'; >> >> In the above case the 'select * from some_table where key_column=23' can be >> implemented as a simple PK lookup in the source table, not requiring a >> scan, thus leaving just the filter corresponding to >> 'some_other_column='foo'' in the rest of the plan >> >> 2. Queries with expressions on a column which has a secondary index >> >> select * from users where country='UK' and some_other_column='foo'; >> >> We have many users, and let's say 10% of them are from UK (still a lot). We >> have a secondary index in the country column in the source table so we can >> do an efficient index scan to retrieve the matching records. >> >> I found this document >> https://calcite.apache.org/docs/materialized_views.html which seems like it >> might help me in some way. >> >> The idea being if I can think of my indexes as materialized views then the >> query can be written against those materialized views as sources instead of >> the original table sources. There appears to be a rule >> 'MaterializedViewRule' that does this already (?). >> >> This seems to get me a bit further, however, for this approach to work, it >> seems I would have to create materialized views _dynamically_ during >> evaluation of the query, register them, rewrite the query, execute it, then >> deregister the materialized view. >> >> E.g. for the primary key lookup example above, for the following query: >> >> select * from some_table where key_column=23 AND some_other_column='foo'; >> >> I would need to dynamically create a materialized view corresponding to: >> >> select * from some_table where key_column=23 >> >> Then rewrite the query using MaterializedViewRule. >> >> In the general case, in order to figure out what materialized views I need >> to dynamically create I would need to examine the query, figure out which >> columns in expressions have indexes on them and from them work out the best >> materialized view to create based on that information. This seems non >> trivial. >> >> Does anyone have any suggestions or pointers for how to implement this kind >> of thing? I suspect I'm not the first person to have tried to do this, as >> using indexes on tables seems a pretty common thing in many systems (?) >>
