I meant unnest, not unwrap.
> On 8 Dec 2022, at 10:34, Benedict <bened...@apache.org> wrote: > > > >> I do not think we should have functions that aggregate across rows and >> functions that operate within a row use the same name. > > I’m sympathetic to that view for sure. I wouldn’t be too disappointed by that > outcome, and SQL engines seem to take a similar approach, however they mostly > rely on sub-queries to get around this problem, and the SQL standard > introduces UNWRAP for operating on arrays (by translating them into a table), > permitting subqueries to aggregate them. It seems to me we have four options: > > 1) introduce functionality similar to UNWRAP and subqueries > 2) introduce new syntax to permit operating on collections with the same > functions > 3) permit the same functions to operate on both, with a precedence order, and > introduce syntax to permit breaking the precedence order > 4) introduce new functions > > (1) might look like SELECT (SELECT MAX(item) FROM UNWRAP(list)) AS max_item > FROM table > > (2) and (3) might look something like: > > SELECT MAX(list AS COLLECTION) or > SELECT MAX(list AS ROWS) > > (4) might look something like we have already, but perhaps with different > names > >> The comparator for collections is the lexicographical compare on the >> collection items > > This is a fair point, I mistakenly thought it sorted first on size. Even this > definition is a little funkier for Map types, where the values of a key may > cause something to sort earlier than a map whose next key sorts first. There > are multiple potential lexicographical sorts for Maps (i.e., by keys first, > then values, or by (key, value) pairs), so this is particularly poorly > defined IMO. > > The maximum of a blob type is pretty well defined I think, as are boolean, > inetaddress etc. However, even for List or Set collections there’s multiple > reasonable functions one could define for maximum, so it would make more > sense to me to permit the user to define the comparison as part of the MAX > function if we are to offer it. However, with the lexicographical definition > we have I am somewhat less concerned for Set and List. Map seems like a real > problem though, if we support these operators (which perhaps we do not). > > >>> On 7 Dec 2022, at 12:13, Andrés de la Peña <adelap...@apache.org> wrote: >>> >> >> The comparator for collections is the lexicographical compare on the >> collection items. That might nor be the more useful thing but it's not >> impossible to imagine cases where that ordering can be useful. To make a >> random example, you can use a list column to store the name and surnames of >> a person, considering that some persons can have multiple surnames. You can >> then sort rows based on that list column, to get the names in alphabetical >> order, or to get the first or last person according to that order. I'm sure >> we can think on more cases were the lexicographical order of a list can be >> useful, although I agree that's is not the greatest feature ever. >> >> It's worth mentioning that collections are not the only data types where the >> MIN/MAX functions are of dubious utility. For example, blob columns can also >> be used with MIN/MAX. Same as with collections, the min/max blobs are >> selected according to the comparator for the data type. That comparator is >> the lexicographic compare on the unsigned values of the byte contents. >> >> The utility of MIN/MAX on inet and boolean columns isn't very clear either, >> although one can always imagine use cases. Fox example, MAX of a boolean >> column can be used as a logical disjunction. >> >> If we were to special-case MIN/MAX functions to reject collections, we >> should also reject other data types such as, at least, blobs. That would >> require a deprecation plan. >> >> Also, it's not that the comparator used by MIN/MAX is an internal obscure >> thing. The action of that comparator is very visible when any of those data >> types is used in a clustering column, and it's used as the basis for "ORDER >> BY" clauses. Should we also reject blobs, collections, tuples and UDTs on >> "ORDER BY"? I don't think so. >> >> I rather think that basing MIN/MAX on the regular order of the column data >> type is consistent, easy to do and easy to understand. >> >> I don't see the need to add rules explicitly forbidding some data types on >> MIN/MAX functions just because we can't easily figure out a use case for >> their ordering. Especially when we are exposing that same ordering on >> clusterings and "ORDER BY". >> >>> On Tue, 6 Dec 2022 at 18:56, J. D. Jordan <jeremiah.jor...@gmail.com> wrote: >>> If the functionality truly has never actually worked, then throwing an >>> error that MAX is not supported for collections seems reasonable. >>> >>> But we should throw an error, I do not think we should have functions that >>> aggregate across rows and functions that operate within a row use the same >>> name. >>> >>> My expectation as a user would be that MAX either always aggregates across >>> rows, so results in a single row of output or always operates within a row, >>> so returns the full set of rows matching the query. >>> >>> So if we want a max that aggregates across rows that works for collections >>> we could change it to return the aggregated max across all rows. Or we just >>> leave it as an error and if someone wants the max across all rows they >>> would ask for MAX(COLLECTION_MAX(column)). Yes I still agree COLLECTION_MAX >>> may be a bad name. >>> >>> > On Dec 6, 2022, at 11:55 AM, Benedict <bened...@apache.org> wrote: >>> > >>> > As far as I am aware it has never worked in a release, and so >>> > deprecating it is probably not as challenging as you think. Only folk >>> > that have been able to parse the raw bytes of the collection in storage >>> > format would be affected - which we can probably treat as zero. >>> > >>> > >>> >> On 6 Dec 2022, at 17:31, Jeremiah D Jordan <jeremiah.jor...@gmail.com> >>> >> wrote: >>> >> >>> >> >>> >>> >>> >>> 1. I think it is a mistake to offer a function MAX that operates over >>> >>> rows containing collections, returning the collection with the most >>> >>> elements. This is just a nonsensical operation to support IMO. We >>> >>> should decide as a community whether we “fix” this aggregation, or >>> >>> remove it. >>> >> >>> >> The current MAX function does not work this way afaik? It returns the >>> >> row with the column that has the highest value in clustering order >>> >> sense, like if the collection was used as a clustering key. While that >>> >> also may have limited use, I don’t think it worth while to deprecate >>> >> such use and all the headache that comes with doing so. >>> >> >>> >>> 2. I think “collection_" prefixed methods are non-intuitive for >>> >>> discovery, and all-else equal it would be better to use MAX,MIN, etc, >>> >>> same as for aggregations. >>> >> >>> >> If we actually wanted to move towards using the existing names with new >>> >> meanings, then I think that would take us multiple major releases. >>> >> First deprecate existing use in current releases. Then make it an error >>> >> in the next major release X. Then change the behavior in major release >>> >> X+1. Just switching the behavior without having a major where such >>> >> queries error out would make a bunch of user queries start returning >>> >> “wrong” data. >>> >> Also I don’t think those functions being cross row aggregations for some >>> >> column types, but within row collection operations for other types, is >>> >> any more intuitive, and actually would be more confusing. So I am -1 on >>> >> using the same names. >>> >> >>> >>> 3. I think it is peculiar to permit methods named collection_ to >>> >>> operate over non-collection types when they are explicitly collection >>> >>> variants. >>> >> >>> >> While I could see some point to this, I do not think it would be >>> >> confusing for something named collection_XXX to treat a non-collection >>> >> as a collection of 1. But maybe there is a better name for these >>> >> function. Rather than seeing them as collection variants, we should see >>> >> them as variants that operate on the data in a single row, rather than >>> >> aggregating across multiple rows. But even with that perspective I >>> >> don’t know what the best name would be. >>> >> >>> >>>> On Dec 6, 2022, at 7:30 AM, Benedict <bened...@apache.org> wrote: >>> >>> >>> >>> Thanks Andres, I think community input on direction here will be >>> >>> invaluable. There’s a bunch of interrelated tickets, and my opinions >>> >>> are as follows: >>> >>> >>> >>> 1. I think it is a mistake to offer a function MAX that operates over >>> >>> rows containing collections, returning the collection with the most >>> >>> elements. This is just a nonsensical operation to support IMO. We >>> >>> should decide as a community whether we “fix” this aggregation, or >>> >>> remove it. >>> >>> 2. I think “collection_" prefixed methods are non-intuitive for >>> >>> discovery, and all-else equal it would be better to use MAX,MIN, etc, >>> >>> same as for aggregations. >>> >>> 3. I think it is peculiar to permit methods named collection_ to >>> >>> operate over non-collection types when they are explicitly collection >>> >>> variants. >>> >>> >>> >>> Given (1), (2) becomes simple except for COUNT which remains ambiguous, >>> >>> but this could be solved by either providing a separate method for >>> >>> collections (e.g. SIZE) which seems fine to me, or by offering a >>> >>> precedence order for matching and a keyword for overriding the >>> >>> precedence order (e.g. COUNT(collection AS COLLECTION)). >>> >>> >>> >>> Given (2), (3) is a little more difficult. However, I think this can be >>> >>> solved several ways. >>> >>> - We could permit explicit casts to collection types, that for a >>> >>> collection type would be a no-op, and for a single value would create a >>> >>> collection >>> >>> - With precedence orders, by always selecting the scalar function last >>> >>> - By permitting WRITETIME to accept a binary operator reduce function >>> >>> to resolve multiple values >>> >>> >>> >>> These decisions all imply trade-offs on each other, and affect the >>> >>> evolution of CQL, so I think community input would be helpful. >>> >>> >>> >>>>> On 6 Dec 2022, at 12:44, Andrés de la Peña <adelap...@apache.org> >>> >>>>> wrote: >>> >>>> >>> >>>> >>> >>>> This will require some long introduction for context: >>> >>>> >>> >>>> The MAX/MIN functions aggregate rows to get the row with min/max >>> >>>> column value according to their comparator. For collections, the >>> >>>> comparison is on the lexicographical order of the collection elements. >>> >>>> That's the very same comparator that is used when collections are used >>> >>>> as clustering keys and for ORDER BY. >>> >>>> >>> >>>> However, a bug in the MIN/MAX aggregate functions used to make that >>> >>>> the results were presented in their unserialized form, although the >>> >>>> row selection was correct. That bug was recently solved by >>> >>>> CASSANDRA-17811. During that ticket it was also considered the option >>> >>>> of simply disabling MIN/MAX on collection since applying those >>> >>>> functions to collections, since they don't seem super useful. However, >>> >>>> that option was quickly discarded and the operation was fixed so the >>> >>>> MIN/MAX functions correctly work for every data type. >>> >>>> >>> >>>> As a byproduct of the internal improvements of that fix, >>> >>>> CASSANDRA-8877 introduced a new set of functions that can perform >>> >>>> aggregations of the elements of a collection. Those where named >>> >>>> "map_keys", "map_values", "collection_min", "collection_max", >>> >>>> "collection_sum", and "collection_count". Those are the names >>> >>>> mentioned on the mail list thread about function naming conventions. >>> >>>> Despite doing a kind of within-collection aggregation, these functions >>> >>>> are not what we usually call aggregate functions, since they don't >>> >>>> aggregate multiple rows together. >>> >>>> >>> >>>> On a different line of work, CASSANDRA-17425 added to trunk a >>> >>>> MAXWRITETIME function to get the max timestamp of a multi-cell column. >>> >>>> However, the new collection functions can be used in combination with >>> >>>> the WRITETIME and TTL functions to retrieve the min/max/sum/avg >>> >>>> timestamp or ttl of a multi-cell column. Since the new functions give >>> >>>> a generic way of aggreagting timestamps ant TTLs of multi-cell >>> >>>> columns, CASSANDRA-18078 proposed to remove that MAXWRITETIME function. >>> >>>> >>> >>>> Yifan Cai, author of the MAXWRITETIME function, agreed to remove that >>> >>>> function in favour of the new generic collection functions. However, >>> >>>> the MAXWRITETIME function can work on both single-cell and multi-cell >>> >>>> columns, whereas "COLLECTION_MAX(WRITETIME(column))" would only work >>> >>>> on multi-cell columns, That's because MAXWRITETIME of a not-multicell >>> >>>> column doesn't return a collection, and one should simply use >>> >>>> "WRITETIME(column)" instead. So it was proposed in CASSANDRA-18037 >>> >>>> that collections functions applied to a not-collection value consider >>> >>>> that value as the only element of a singleton collection. So, for >>> >>>> example, COLLECTION_MAX(7) = COLLECTION_MAX([7]) = 7. That ticket has >>> >>>> already been reviewed and it's mostly ready to commit. >>> >>>> >>> >>>> Now we can go straight to the point: >>> >>>> >>> >>>> Recently Benedict brought back the idea of deprecating aggregate >>> >>>> functions applied to collections, the very same idea that was >>> >>>> mentioned on CASSANDRA-17811 description almost four months ago. That >>> >>>> way we could rename the new collection functions MIN/MAX/SUM/AVG, same >>> >>>> as the classic aggregate functions. That way MIN/MAX/SUM/AVG would be >>> >>>> an aggregate function when applied to not-collection columns, and a >>> >>>> scalar function when applied to collection. We can't do that with >>> >>>> COUNT because there would be an ambiguity, so the proposal for that >>> >>>> case is renaming COLLECTION_COUNT to SIZE. Benedict, please correct me >>> >>>> if I'm not correctly exposing the proposal. >>> >>>> >>> >>>> I however would prefer to keep aggregate functions working on >>> >>>> collections, and keep the names of the new collection functions as >>> >>>> "COLLECTION_*". Reasons are: >>> >>>> >>> >>>> 1 - Making aggregate functions not work on collections might be >>> >>>> cosidered as breaking backward compatibility and require a deprecation >>> >>>> plan. >>> >>>> 2 - Keeping aggregate functions working on collections might not look >>> >>>> superuseful, but they make the set of aggregate functions consistent >>> >>>> and applicable to every column type. >>> >>>> 3 - Using the "COLLECTION_" prefix on collection functions establishes >>> >>>> a clear distinction between row aggregations and collection >>> >>>> aggregations, while at the same time exposing the analogy between each >>> >>>> pair of functions. >>> >>>> 4 - Not using the "COLLECTION_" prefix forces us to search for >>> >>>> workarounds such as using the column type when possible, or trying to >>> >>>> figure out synonyms like in the case of COUNT/SIZE. Even if that works >>> >>>> for this case, future functions can find more trouble when trying to >>> >>>> figure out workarounds to avoid clashing with existing function names. >>> >>>> For example, we might want to add a SIZE function that gets the size >>> >>>> in bytes of any column, or we might want to add a MAX function that >>> >>>> gets the maximum of a set of columns, etc. And example of the >>> >>>> synonym-based approach that comes to mind is MySQL's MAX and GREATEST >>> >>>> functions, where MAX is for row aggregation and GREATEST is for column >>> >>>> aggregation. >>> >>>> 5 - If MIN/MAX function selection is based on the column type, we >>> >>>> can't implement Yifan's proposal of making COLLECTION_MAX(7) = >>> >>>> COLLECTION_MAX([7]) = 7, which would be very useful for combining >>> >>>> collection functions with time functions. >>> >>>> >>> >>>> What do others think? What should we do with aggregate functions on >>> >>>> collections, collection functions and MAXWRITETIME? >>> >>