For the record - I agree that the semantic layer is probably a better place for this. But I would've said something similar about measures before I read your paper haha. So I'm pushing on this mostly out of curiosity.
Agreed - this is essentially a virtual cube. And MDX (or SPARQL) might be a better language for this type of query - but alas, SQL is king. > There are so many columns it’s impossible to give them meaningful names. Is that right? Maybe I'm thinking of it slightly differently. Yes, something has to flatten the graph. I think that's the job of the semantic layer. And it's got to choose meaningful names - hard, but not impossible. At least if thinking from the perspective of a particular business domain. The semantic layer first has to flatten each cube and conform the dimensions: ``` crew view orders_cube as select SUM(o.revenue) as measure sum_revenue, l.country as origin_country from orders o left join customers c on o.customerId = c.id left join locations l on c.locationId = l.id crew view clicks_cube as select COUNT(DISTINCT c.cookieId) as measure unique_clickers, l.country as origin_country from clicks c left join sessions s on c.sessionId = s.id left join locations l on s.locationId = l.id create view __all as select coalesce(o.origin_country, c.origin_country) as origin_country, o.sum_revenue, c.unique_clickers, from orders_cube o full outer join clicks_cube c on o.origin_country = c.origin_country ``` Yes, this is done at DDL time. And yes, this certainly can't answer all types of queries. But I think it works for simple slicing and dicing on a virtual cube? On Fri, Oct 4, 2024 at 6:52 PM Julian Hyde <jhyde.apa...@gmail.com> wrote: > Let’s review where we are. > * We have several fact tables (each representing a business process - > taking orders, shipping orders, canceling orders, returning products, > storing products in a warehouse) and we have surrounded each fact table > with dimension tables and defined some measures in a ’star join’ view or > ‘cube’. > * We can join several cubes (or regular tables) in a single query, > joining on their common dimensions. If the query has no GROUP BY the number > of rows will be astronomical, but if there is a GROUP BY the number of rows > will be reasonable, and because measures manage their own grain their > values will be correct. > > This is, in essence, what you can achieve with virtual cubes in MDX [ > https://learn.microsoft.com/en-us/previous-versions/windows/desktop/ms709837(v=vs.85) > ], and furthermore you don’t have to do it at DDL time, you can do it at > query time. > > This looks like a graph: cubes connected, via its dimensions, to entities > (date, customer, product, warehouse, campaign). If two cubes have at least > one entity in common you can join them. A query would seem to be just a > list of attributes (of entities) and a list of measures, provided that > there is a path joining them. > > But we have a problem. The problem is that as the graph gets larger, it > begins to have (undirected) cycles: there is more than one path between > some pairs of entities. For example, you are interested in customers and > related products. It is ambiguous whether you mean customers that have > purchased a product, or customers who have been targeted in a marketing > campaign for a product. > > That’s the problem with Barak’s ‘all’ table which represents a > flattened-graph. It’s not the large number of columns (the union of columns > from all the cubes and entities). Nor is it the colossal number of rows > (effectively the cartesian product of all entities). It’s that in order to > flatten the graph you need to repeat each entity several times, once for > each path. For a given product there are not just customers; there are > order-customers, and campaign-customers, and product-return-customers. > There are so many columns it’s impossible to give them meaningful names. > > To disambiguate, use measures. “Show me all customers and products for > which campaignClicks was greater than 3 in 2023 and orderCount was greater > than zero in 2023 or 2024.” The measures belong to cubes, and therefore the > join path between the entities (dimension tables) must pass through their > fact tables. > > I don’t think that preceding query can be expressed in SQL. Even if the > cubes, entities, measures etc. can all be represented as tables, SQL (quite > rightly) requires joins to be explicitly stated. I think there is a > higher-level “graph” query language which can resolve join paths and spit > out a SQL query with explicit joins. (Let’s start talking about adding > measures to SPARQL.) > > As for query-planning. Predicates on measures are a generalization of semi > joins. In fact “orderCount > 0” is exactly the same as “where exists > (select * from orders where pid = products.pid and cid = customers.cid)”. > So a query that says “cartesian product of all products and all customers, > with the side condition that orderCount > 0” becomes a join graph. > > The join graph may not have a tight tree structure - so techniques for > planning and executing “bushy trees” [1] may be helpful. > > Julian > > > [1] https://www.csd.uoc.gr/~hy460/pdf/Left-deep_vs_bushy_trees.pdf > > > > On Oct 2, 2024, at 6:52 AM, Barak Alon <barak.s.a...@gmail.com> wrote: > > > > Got it, that makes a lot of sense. > > > > Common dimensions is the use case I was thinking of. > > > > Our semantic layer exposes a SQL interface that allows for queries > similar > > to this: > > > > ``` > > select > > productId, > > sum_revenue, > > total_on_hand > > from __all > > where color = 'Red' > > group by productId > > ``` > > > > It knows that sum_revenue comes from orders, total_on_hand comes from > > inventory, and that productId and color are common dimensions. So it can > > rewrite this query as: > > > > ``` > > select > > coalesce(o.productId, v.productId) as productId, > > o.sum_revenue, > > v.total_on_hand > > from ( > > select > > productId, > > sum_revenue > > from orders > > where color = 'Red' > > group by productId > > ) as o > > full outer join ( > > select > > productId, > > total_on_hand > > from inventory > > where color = 'Red' > > group by productId > > ) as v > > on o.productId = v.productId > > ``` > > > > So you could define measures on this __all abomination - say, > > revenue_per_total = sum_revenue / total_on_hand. > > > > This has a lot of similar benefits proposed in the paper. For example, > this > > is an even easier target for generative AI - the model doesn't need to > know > > how to join cubes. > > > > re: your statement "I don’t think it would be a good idea to build a view > > on this many-to-many join (too many columns, too much complexity)." > > > > I want to dig into what such a view would look like. I'm imagining: > > > > ``` > > create view __all as > > select > > coalesce(o.productId, v.productId) as productId, > > coalesce(o.color, v.color) as color, > > -- ... many more > > > > o.sum_revenue, > > v.total_on_hand, > > o.sum_revenue / v.total_on_hand as measure revenue_per_total, -- a > > measure that spans multiple tables > > -- ... many more > > > > from o > > full outer join v > > on o.productId = v.productId > > and o.color = v.color > > -- ... many more > > > > -- .. many more > > ``` > > > > A lot of challenges here: > > Efficiently dealing with so many columns? > > Is it possible for an optimizer rule to prune the unused join conditions? > > Is there sleeker syntax that would make defining such a view more > tenable? > > > > Does such complexity belong as a fundamental concept like Measures or in > > some semantic layer with custom rewrite rules? I don't know. But I do > think > > there is value in such an abstraction, and it would be cool if it was > > standardized. > > > > On Tue, Oct 1, 2024 at 6:36 PM Julian Hyde <jhyde.apa...@gmail.com > <mailto:jhyde.apa...@gmail.com>> wrote: > > > >> Good question - thanks for asking. > >> > >> The short answer is that measures that span tables work, but they aren’t > >> quite as convenient as measures in a single table. > >> > >> When designing Measures in SQL I knew that measures sometimes span > tables > >> - star schemas are a case in point - but decided to make measures > belong to > >> one table for simplicity. This allowed me to put them into the same > >> namespace as columns, so I can write ’select e.avg_sal from emp as e’. > >> (Measures are not columns, but SQL semantic analyzers already know how > to > >> look up a column within the scope of a table.) > >> > >> My goal was to allow dimensional expressions, such as revenue-this-year > >> minus revenue-last-year, so we need to know what are the dimensions that > >> affect the value of a measure. By associating a measure with a table we > can > >> say that the (non-measure) columns of that table are its dimensions. > >> > >> One case of 'measures spanning tables’ is joining a table with a measure > >> to a regular table. For example, suppose the orders table has an > >> avg_shipping_cost measure, and we write this query: > >> > >> select o.zipcode, o.avg_shipping_cost, count(distinct i.color) > >> from orders as o > >> join orderItems as i on o.id <http://o.id/> <http://o.id/> = > i.orderId > >> group by o.zipcode > >> > >> If I’d written avg(o.shipping_cost), orders with many items would be > >> weighted more heavily in the total than orders with few items. But > measures > >> are ‘locked’ to the grain of their table. > >> > >> This is a crucial property of measures. It allows me to create a wide > join > >> view over a star schema - say of orders, order-items, customers and > >> products - and the measures in that view will be well-behaved. Users get > >> the benefits of the join without having to remember the right join > >> conditions. > >> > >> By the way, if I really want a weighted average shipping, I can unpack > >> convert the shipping-cost measure into a value, join it to the > order-items, > >> and then re-package it as a measure again: > >> > >> select o.zipcode, avg(o.avg_shipping_cost) as measure > >> weighted_avg_shipping_cost, > >> count(distinct i.color) > >> from orders as o > >> join orderItems as i on o.id <http://o.id/> <http://o.id/> = > i.orderId > >> group by o.zipcode > >> > >> Now, another case of ‘measures spanning tables’ is common dimensions > (also > >> known as conformed dimensions). Let’s suppose I have an orders table > and an > >> inventory table, both with measures, and both with a date column > (orderDate > >> and inventoryDate) and a reference to a product. I can write a query > that > >> joins these tables: > >> > >> select o.sum_revenue, v.total_on_hand > >> from orders as o > >> join inventory as v > >> on o.orderDate = v.inventoryDate > >> and o.productId = v.productId > >> where o.color = ‘Red' > >> > >> This is a many-to-many join — like joining two cubes in an OLAP system — > >> but the measures’ grain-locking ensures prevents double-counting. > >> > >> Notice that I have had to manually put the common dimensions into a join > >> condition. There is nothing in the definition of the orders.sum_revenue > and > >> inventory.total_on_hand measures that connects their date dimensions. I > >> have to remember to make that connection. > >> > >> Where to store those definitions? I don’t know. I don’t think it would > be > >> a good idea to build a view on this many-to-many join (too many columns, > >> too much complexity). > >> > >> One idea is to store these common dimensions in the semantic layer. > >> Perhaps a 'semantic layer' should have entities like ‘product’, ‘date’, > >> ‘customer’ and record which columns on the ‘cube’ tables connect to each > >> entity. And then the semantic layer can help people to write SQL. In my > >> opinion, measures are not really the semantic layer — they are already > >> there, as part of the tables, but their existence means that the > semantic > >> layer doesn’t have to worry about grain and double-counting, and so can > >> generate simpler SQL. > >> > >> Julian > >> > >> > >> > >>> On Sep 30, 2024, at 3:00 PM, Barak Alon <barak.s.a...@gmail.com> > wrote: > >>> > >>> Hey there - > >>> > >>> I've found my way here via Measures in SQL > >>> <https://dl.acm.org/doi/pdf/10.1145/3626246.3653374>. I'm a long time > >>> admirer of Calcite, but I'm not very familiar with internals - > apologies > >> if > >>> I stumble. > >>> > >>> I work on Airbnb's Minerva > >>> < > >> > https://medium.com/airbnb-engineering/how-airbnb-achieved-metric-consistency-at-scale-f23cc53dea70 > >>> > >>> project, > >>> an internal semantic layer for star schemas. Measures in SQL is a > >>> wonderfully elegant idea, and I've started exploring if we can > introduce > >>> them to our system. > >>> > >>> However, our users often write queries that span multiple tables/stars, > >>> requiring a drill across operation. Our system handles this by > exposing a > >>> virtual table that pretends as if our entire semantic layer is a > single, > >>> wide fact table. Incompatible measure/dimension combinations result in > >>> runtime errors. > >>> > >>> This need feels pretty common and a logical extension to some of > >> Measures' > >>> strengths - making SQL more powerful and expressive for "top down" > >> queries. > >>> > >>> I'm curious - are there plans to support measures that span multiple > >> tables > >>> in Calcite? > >>> > >>> - Barak > >