Re: [Discussion] Measures that span multiple tables

Barak Alon Sat, 05 Oct 2024 05:23:33 -0700

For the record - I agree that the semantic layer is probably a better place
for this. But I would've said something similar about measures before I
read your paper haha. So I'm pushing on this mostly out of curiosity.


Agreed - this is essentially a virtual cube. And MDX (or SPARQL) might be a
better language for this type of query - but alas, SQL is king.

> There are so many columns it’s impossible to give them meaningful names.

Is that right? Maybe I'm thinking of it slightly differently.

Yes, something has to flatten the graph. I think that's the job of the
semantic layer. And it's got to choose meaningful names - hard, but not
impossible. At least if thinking from the perspective of a particular
business domain.

The semantic layer first has to flatten each cube and conform the
dimensions:

```
crew view orders_cube as
select
  SUM(o.revenue) as measure sum_revenue,
  l.country as origin_country
from orders o
left join customers c
  on o.customerId = c.id
left join locations l
  on c.locationId = l.id


crew view clicks_cube as
select
  COUNT(DISTINCT c.cookieId) as measure unique_clickers,
  l.country as origin_country
from clicks c
left join sessions s
  on c.sessionId = s.id
left join locations l
  on s.locationId = l.id


create view __all as
select
  coalesce(o.origin_country, c.origin_country) as origin_country,
  o.sum_revenue,
  c.unique_clickers,
from orders_cube o
full outer join clicks_cube c
  on o.origin_country = c.origin_country
```

Yes, this is done at DDL time. And yes, this certainly can't answer all
types of queries. But I think it works for simple slicing and dicing on a
virtual cube?

On Fri, Oct 4, 2024 at 6:52 PM Julian Hyde <jhyde.apa...@gmail.com> wrote:

> Let’s review where we are.
>  * We have several fact tables (each representing a business process -
> taking orders, shipping orders, canceling orders, returning products,
> storing products in a warehouse) and we have surrounded each fact table
> with dimension tables and defined some measures in a ’star join’ view or
> ‘cube’.
>  * We can join several cubes (or regular tables) in a single query,
> joining on their common dimensions. If the query has no GROUP BY the number
> of rows will be astronomical, but if there is a GROUP BY the number of rows
> will be reasonable, and because measures manage their own grain their
> values will be correct.
>
> This is, in essence, what you can achieve with virtual cubes in MDX [
> https://learn.microsoft.com/en-us/previous-versions/windows/desktop/ms709837(v=vs.85)
> ], and furthermore you don’t have to do it at DDL time, you can do it at
> query time.
>
> This looks like a graph: cubes connected, via its dimensions, to entities
> (date, customer, product, warehouse, campaign). If two cubes have at least
> one entity in common you can join them. A query would seem to be just a
> list of attributes (of entities) and a list of measures, provided that
> there is a path joining them.
>
> But we have a problem. The problem is that as the graph gets larger, it
> begins to have (undirected) cycles: there is more than one path between
> some pairs of entities. For example, you are interested in customers and
> related products. It is ambiguous whether you mean customers that have
> purchased a product, or customers who have been targeted in a marketing
> campaign for a product.
>
> That’s the problem with Barak’s ‘all’ table which represents a
> flattened-graph. It’s not the large number of columns (the union of columns
> from all the cubes and entities). Nor is it the colossal number of rows
> (effectively the cartesian product of all entities). It’s that in order to
> flatten the graph you need to repeat each entity several times, once for
> each path. For a given product there are not just customers; there are
> order-customers, and campaign-customers, and product-return-customers.
> There are so many columns it’s impossible to give them meaningful names.
>
> To disambiguate, use measures. “Show me all customers and products for
> which campaignClicks was greater than 3 in 2023 and orderCount was greater
> than zero in 2023 or 2024.” The measures belong to cubes, and therefore the
> join path between the entities (dimension tables) must pass through their
> fact tables.
>
> I don’t think that preceding query can be expressed in SQL. Even if the
> cubes, entities, measures etc. can all be represented as tables, SQL (quite
> rightly) requires joins to be explicitly stated. I think there is a
> higher-level “graph” query language which can resolve join paths and spit
> out a SQL query with explicit joins. (Let’s start talking about adding
> measures to SPARQL.)
>
> As for query-planning. Predicates on measures are a generalization of semi
> joins. In fact “orderCount > 0” is exactly the same as “where exists
> (select * from orders where pid = products.pid and cid = customers.cid)”.
> So a query that says “cartesian product of all products and all customers,
> with the side condition that orderCount > 0” becomes a join graph.
>
> The join graph may not have a tight tree structure - so techniques for
> planning and executing “bushy trees” [1] may be helpful.
>
> Julian
>
>
> [1] https://www.csd.uoc.gr/~hy460/pdf/Left-deep_vs_bushy_trees.pdf
>
>
> > On Oct 2, 2024, at 6:52 AM, Barak Alon <barak.s.a...@gmail.com> wrote:
> >
> > Got it, that makes a lot of sense.
> >
> > Common dimensions is the use case I was thinking of.
> >
> > Our semantic layer exposes a SQL interface that allows for queries
> similar
> > to this:
> >
> > ```
> > select
> >  productId,
> >  sum_revenue,
> >  total_on_hand
> > from __all
> > where color = 'Red'
> > group by productId
> > ```
> >
> > It knows that sum_revenue comes from orders, total_on_hand comes from
> > inventory, and that productId and color are common dimensions. So it can
> > rewrite this query as:
> >
> > ```
> > select
> >  coalesce(o.productId, v.productId) as productId,
> >  o.sum_revenue,
> >  v.total_on_hand
> > from (
> >  select
> >    productId,
> >    sum_revenue
> >  from orders
> >  where color = 'Red'
> >  group by productId
> > ) as o
> > full outer join (
> >  select
> >    productId,
> >    total_on_hand
> >  from inventory
> >  where color = 'Red'
> >  group by productId
> > ) as v
> >  on o.productId = v.productId
> > ```
> >
> > So you could define measures on this __all abomination - say,
> > revenue_per_total = sum_revenue / total_on_hand.
> >
> > This has a lot of similar benefits proposed in the paper. For example,
> this
> > is an even easier target for generative AI - the model doesn't need to
> know
> > how to join cubes.
> >
> > re: your statement "I don’t think it would be a good idea to build a view
> > on this many-to-many join (too many columns, too much complexity)."
> >
> > I want to dig into what such a view would look like. I'm imagining:
> >
> > ```
> > create view __all as
> > select
> >  coalesce(o.productId, v.productId) as productId,
> >  coalesce(o.color, v.color) as color,
> >  -- ... many more
> >
> >  o.sum_revenue,
> >  v.total_on_hand,
> >  o.sum_revenue / v.total_on_hand as measure revenue_per_total,  -- a
> > measure that spans multiple tables
> >  -- ... many more
> >
> > from o
> > full outer join v
> >  on o.productId = v.productId
> >  and o.color = v.color
> >  -- ... many more
> >
> > -- .. many more
> > ```
> >
> > A lot of challenges here:
> > Efficiently dealing with so many columns?
> > Is it possible for an optimizer rule to prune the unused join conditions?
> > Is there sleeker syntax that would make defining such a view more
> tenable?
> >
> > Does such complexity belong as a fundamental concept like Measures or in
> > some semantic layer with custom rewrite rules? I don't know. But I do
> think
> > there is value in such an abstraction, and it would be cool if it was
> > standardized.
> >
> > On Tue, Oct 1, 2024 at 6:36 PM Julian Hyde <jhyde.apa...@gmail.com
> <mailto:jhyde.apa...@gmail.com>> wrote:
> >
> >> Good question - thanks for asking.
> >>
> >> The short answer is that measures that span tables work, but they aren’t
> >> quite as convenient as measures in a single table.
> >>
> >> When designing Measures in SQL I knew that measures sometimes span
> tables
> >> - star schemas are a case in point - but decided to make measures
> belong to
> >> one table for simplicity. This allowed me to put them into the same
> >> namespace as columns, so I can write ’select e.avg_sal from emp as e’.
> >> (Measures are not columns, but SQL semantic analyzers already know how
> to
> >> look up a column within the scope of a table.)
> >>
> >> My goal was to allow dimensional expressions, such as revenue-this-year
> >> minus revenue-last-year, so we need to know what are the dimensions that
> >> affect the value of a measure. By associating a measure with a table we
> can
> >> say that the (non-measure) columns of that table are its dimensions.
> >>
> >> One case of 'measures spanning tables’ is joining a table with a measure
> >> to a regular table. For example, suppose the orders table has an
> >> avg_shipping_cost measure, and we write this query:
> >>
> >>  select o.zipcode, o.avg_shipping_cost, count(distinct i.color)
> >>  from orders as o
> >>    join orderItems as i on o.id <http://o.id/> <http://o.id/> =
> i.orderId
> >>  group by o.zipcode
> >>
> >> If I’d written avg(o.shipping_cost), orders with many items would be
> >> weighted more heavily in the total than orders with few items. But
> measures
> >> are ‘locked’ to the grain of their table.
> >>
> >> This is a crucial property of measures. It allows me to create a wide
> join
> >> view over a star schema - say of orders, order-items, customers and
> >> products - and the measures in that view will be well-behaved. Users get
> >> the benefits of the join without having to remember the right join
> >> conditions.
> >>
> >> By the way, if I really want a weighted average shipping, I can unpack
> >> convert the shipping-cost measure into a value, join it to the
> order-items,
> >> and then re-package it as a measure again:
> >>
> >>  select o.zipcode, avg(o.avg_shipping_cost) as measure
> >> weighted_avg_shipping_cost,
> >>      count(distinct i.color)
> >>  from orders as o
> >>    join orderItems as i on o.id <http://o.id/> <http://o.id/> =
> i.orderId
> >>  group by o.zipcode
> >>
> >> Now, another case of ‘measures spanning tables’ is common dimensions
> (also
> >> known as conformed dimensions). Let’s suppose I have an orders table
> and an
> >> inventory table, both with measures, and both with a date column
> (orderDate
> >> and inventoryDate) and a reference to a product. I can write a query
> that
> >> joins these tables:
> >>
> >>  select o.sum_revenue, v.total_on_hand
> >>  from orders as o
> >>    join inventory as v
> >>    on o.orderDate = v.inventoryDate
> >>    and  o.productId = v.productId
> >>  where o.color = ‘Red'
> >>
> >> This is a many-to-many join — like joining two cubes in an OLAP system —
> >> but the measures’ grain-locking ensures prevents double-counting.
> >>
> >> Notice that I have had to manually put the common dimensions into a join
> >> condition. There is nothing in the definition of the orders.sum_revenue
> and
> >> inventory.total_on_hand measures that connects their date dimensions. I
> >> have to remember to make that connection.
> >>
> >> Where to store those definitions? I don’t know. I don’t think it would
> be
> >> a good idea to build a view on this many-to-many join (too many columns,
> >> too much complexity).
> >>
> >> One idea is to store these common dimensions in the semantic layer.
> >> Perhaps a 'semantic layer' should have entities like ‘product’, ‘date’,
> >> ‘customer’ and record which columns on the ‘cube’ tables connect to each
> >> entity. And then the semantic layer can help people to write SQL. In my
> >> opinion, measures are not really the semantic layer — they are already
> >> there, as part of the tables, but their existence means that the
> semantic
> >> layer doesn’t have to worry about grain and double-counting, and so can
> >> generate simpler SQL.
> >>
> >> Julian
> >>
> >>
> >>
> >>> On Sep 30, 2024, at 3:00 PM, Barak Alon <barak.s.a...@gmail.com>
> wrote:
> >>>
> >>> Hey there -
> >>>
> >>> I've found my way here via Measures in SQL
> >>> <https://dl.acm.org/doi/pdf/10.1145/3626246.3653374>. I'm a long time
> >>> admirer of Calcite, but I'm not very familiar with internals -
> apologies
> >> if
> >>> I stumble.
> >>>
> >>> I work on Airbnb's Minerva
> >>> <
> >>
> https://medium.com/airbnb-engineering/how-airbnb-achieved-metric-consistency-at-scale-f23cc53dea70
> >>>
> >>> project,
> >>> an internal semantic layer for star schemas. Measures in SQL is a
> >>> wonderfully elegant idea, and I've started exploring if we can
> introduce
> >>> them to our system.
> >>>
> >>> However, our users often write queries that span multiple tables/stars,
> >>> requiring a drill across operation. Our system handles this by
> exposing a
> >>> virtual table that pretends as if our entire semantic layer is a
> single,
> >>> wide fact table. Incompatible measure/dimension combinations result in
> >>> runtime errors.
> >>>
> >>> This need feels pretty common and a logical extension to some of
> >> Measures'
> >>> strengths - making SQL more powerful and expressive for "top down"
> >> queries.
> >>>
> >>> I'm curious - are there plans to support measures that span multiple
> >> tables
> >>> in Calcite?
> >>>
> >>> - Barak
>
>

Re: [Discussion] Measures that span multiple tables

Reply via email to