Standard SQL would not implicitly multiply rows. When performing GROUP BY x, and if x is a multiset, it would combine rows whose values have the same multiset: thus MULTISET [1, 3] = MULTISET [1, 3] but not MULTISET [1, 3, 3].
You can use the “UNNEST” operator to unpack rows. A common thing to do is to join a row to one of its multiset columns. If we have table t (x int, y int multiset) then we could write select x, sum(y) from t cross join unnest t.y group by x Drill and Postgres allow unnest in the SELECT clause (rather than a table operator in the FROM clause) but that feels wrong to me: what’s in the SELECT clause should not affect the number of rows. Julian > On Feb 15, 2017, at 10:52 PM, Gian Merlino <[email protected]> wrote: > > ARRAY doesn't sound right: > >> Arrays can be compared for exact match with = or <>: >> No other inequality operator is supported, and usage of arrays in just > about any other context (including GROUP BY) is illegal. > > That doesn't line up well with how multi-value columns behave in Druid. > > Multiset sounds more promising. "'x' MEMBER OF a_multiset" looks like it > matches how {"type":"selector","dimension":"a_mv_column","value":"x"} would > behave in Druid -- our standard column-is-value filter. I don't see > something on that page about what happens if you GROUP BY a multiset, > though, or if any operation on multisets in SQL behaves like grouping on > multi-value columns does in Druid. Do you happen to know off the top of > your head how that works? If not I'll try to find some docs somewhere. > > Gian > > On Wed, Feb 15, 2017 at 12:10 PM, Julian Hyde <[email protected]> wrote: > >> How about modeling them as a collection (ARRAY or MULTISET)? This document >> describes them pretty well: http://farrago.sourceforge. >> net/design/CollectionTypes.html <http://farrago.sourceforge. >> net/design/CollectionTypes.html> >> >> >>> On Feb 14, 2017, at 5:08 PM, Gian Merlino <[email protected]> wrote: >>> >>> Hey Calcites, >>> >>> I'm hoping for some feedback on how to best handle Druid multi-value >>> columns in SQL. >>> >>> Background: Druid has a "multi-value column" feature that lets you have >>> string columns with more than one value per row. Details and examples of >>> how this works are here: >>> http://druid.io/docs/latest/querying/multi-value-dimensions.html. The >> short >>> version is that filtering on values works under the rule that "rows >> match a >>> filter if any value in a multi-value dimension matches your predicate". >>> Grouping works by sort of causing an explosion into multiple result rows, >>> similar to what Pig does when you flatten a bag. Selecting without >> grouping >>> doesn't do the exploding thing; instead it gives you the array of values. >>> >>> These behaviors are intended to make multi-value columns work well to >> hold >>> data like "tags" or "keywords" where you might want to ask questions >> like: >>> "how many rows have the tag 't1'" or "count the number of distinct users >>> for each tag". >>> >>> The current Calcite-based Druid SQL stuff doesn't handle this in any way >>> that really makes sense. The biggest issue is the expression simplifier, >>> which would incorrect simplify "tags = 't1' AND tags = 't2'" to "false". >>> But, it's possible for a row to match that if "tags" is multi-value. >>> Another issue is that the type is reported as a simple "varchar" and >> there >>> is no indication that multiple values are possible. >>> >>> I'm wondering what _would_ make the most sense in the SQL framework. The >>> simplest thing is to keep reporting it as "varchar", adjust the >> expression >>> simplifying rules to be aware of the fact that some optimizations >> shouldn't >>> be applied to multi-value columns, and leave it at that. The behavior >>> wouldn't be quite what you would expect for a varchar type but it should >>> "work" in a sense. >>> >>> Or we could report a different type than "varchar" and maybe do some >> other >>> things differently too? >>> >>> Gian >> >>
