Re: [Discussion] Measures that span multiple tables

Julian Hyde Tue, 01 Oct 2024 15:34:44 -0700

Good question - thanks for asking.

The short answer is that measures that span tables work, but they aren’t quite 
as convenient as measures in a single table.

When designing Measures in SQL I knew that measures sometimes span tables - 
star schemas are a case in point - but decided to make measures belong to one 
table for simplicity. This allowed me to put them into the same namespace as 
columns, so I can write ’select e.avg_sal from emp as e’. (Measures are not 
columns, but SQL semantic analyzers already know how to look up a column within 
the scope of a table.)

My goal was to allow dimensional expressions, such as revenue-this-year minus 
revenue-last-year, so we need to know what are the dimensions that affect the 
value of a measure. By associating a measure with a table we can say that the 
(non-measure) columns of that table are its dimensions.

One case of 'measures spanning tables’ is joining a table with a measure to a 
regular table. For example, suppose the orders table has an avg_shipping_cost 
measure, and we write this query:

  select o.zipcode, o.avg_shipping_cost, count(distinct i.color)
  from orders as o
    join orderItems as i on o.id <http://o.id/> = i.orderId
  group by o.zipcode

If I’d written avg(o.shipping_cost), orders with many items would be weighted 
more heavily in the total than orders with few items. But measures are ‘locked’ 
to the grain of their table.

This is a crucial property of measures. It allows me to create a wide join view 
over a star schema - say of orders, order-items, customers and products - and 
the measures in that view will be well-behaved. Users get the benefits of the 
join without having to remember the right join conditions.

By the way, if I really want a weighted average shipping, I can unpack convert 
the shipping-cost measure into a value, join it to the order-items, and then 
re-package it as a measure again:

  select o.zipcode, avg(o.avg_shipping_cost) as measure 
weighted_avg_shipping_cost,
      count(distinct i.color)
  from orders as o
    join orderItems as i on o.id <http://o.id/> = i.orderId
  group by o.zipcode

Now, another case of ‘measures spanning tables’ is common dimensions (also 
known as conformed dimensions). Let’s suppose I have an orders table and an 
inventory table, both with measures, and both with a date column (orderDate and 
inventoryDate) and a reference to a product. I can write a query that joins 
these tables:

  select o.sum_revenue, v.total_on_hand
  from orders as o
    join inventory as v
    on o.orderDate = v.inventoryDate
    and  o.productId = v.productId
  where o.color = ‘Red'

This is a many-to-many join — like joining two cubes in an OLAP system — but 
the measures’ grain-locking ensures prevents double-counting.

Notice that I have had to manually put the common dimensions into a join 
condition. There is nothing in the definition of the orders.sum_revenue and 
inventory.total_on_hand measures that connects their date dimensions. I have to 
remember to make that connection.

Where to store those definitions? I don’t know. I don’t think it would be a 
good idea to build a view on this many-to-many join (too many columns, too much 
complexity).

One idea is to store these common dimensions in the semantic layer. Perhaps a 
'semantic layer' should have entities like ‘product’, ‘date’, ‘customer’ and 
record which columns on the ‘cube’ tables connect to each entity. And then the 
semantic layer can help people to write SQL. In my opinion, measures are not 
really the semantic layer — they are already there, as part of the tables, but 
their existence means that the semantic layer doesn’t have to worry about grain 
and double-counting, and so can generate simpler SQL.

Julian

> On Sep 30, 2024, at 3:00 PM, Barak Alon <barak.s.a...@gmail.com> wrote:
> 
> Hey there -
> 
> I've found my way here via Measures in SQL
> <https://dl.acm.org/doi/pdf/10.1145/3626246.3653374>. I'm a long time
> admirer of Calcite, but I'm not very familiar with internals - apologies if
> I stumble.
> 
> I work on Airbnb's Minerva
> <https://medium.com/airbnb-engineering/how-airbnb-achieved-metric-consistency-at-scale-f23cc53dea70>
> project,
> an internal semantic layer for star schemas. Measures in SQL is a
> wonderfully elegant idea, and I've started exploring if we can introduce
> them to our system.
> 
> However, our users often write queries that span multiple tables/stars,
> requiring a drill across operation. Our system handles this by exposing a
> virtual table that pretends as if our entire semantic layer is a single,
> wide fact table. Incompatible measure/dimension combinations result in
> runtime errors.
> 
> This need feels pretty common and a logical extension to some of Measures'
> strengths - making SQL more powerful and expressive for "top down" queries.
> 
> I'm curious - are there plans to support measures that span multiple tables
> in Calcite?
> 
> - Barak

Re: [Discussion] Measures that span multiple tables

Reply via email to