Thanks Andres, I think community input on direction here will be invaluable. There’s a bunch of interrelated tickets, and my opinions are as follows:
1. I think it is a mistake to offer a function MAX that operates over rows containing collections, returning the collection with the most elements. This is just a nonsensical operation to support IMO. We should decide as a community whether we “fix” this aggregation, or remove it. 2. I think “collection_" prefixed methods are non-intuitive for discovery, and all-else equal it would be better to use MAX,MIN, etc, same as for aggregations. 3. I think it is peculiar to permit methods named collection_ to operate over non-collection types when they are explicitly collection variants. Given (1), (2) becomes simple except for COUNT which remains ambiguous, but this could be solved by either providing a separate method for collections (e.g. SIZE) which seems fine to me, or by offering a precedence order for matching and a keyword for overriding the precedence order (e.g. COUNT(collection AS COLLECTION)). Given (2), (3) is a little more difficult. However, I think this can be solved several ways. - We could permit explicit casts to collection types, that for a collection type would be a no-op, and for a single value would create a collection - With precedence orders, by always selecting the scalar function last - By permitting WRITETIME to accept a binary operator reduce function to resolve multiple values These decisions all imply trade-offs on each other, and affect the evolution of CQL, so I think community input would be helpful. > On 6 Dec 2022, at 12:44, Andrés de la Peña <adelap...@apache.org> wrote: > > > This will require some long introduction for context: > > The MAX/MIN functions aggregate rows to get the row with min/max column value > according to their comparator. For collections, the comparison is on the > lexicographical order of the collection elements. That's the very same > comparator that is used when collections are used as clustering keys and for > ORDER BY. > > However, a bug in the MIN/MAX aggregate functions used to make that the > results were presented in their unserialized form, although the row selection > was correct. That bug was recently solved by CASSANDRA-17811. During that > ticket it was also considered the option of simply disabling MIN/MAX on > collection since applying those functions to collections, since they don't > seem super useful. However, that option was quickly discarded and the > operation was fixed so the MIN/MAX functions correctly work for every data > type. > > As a byproduct of the internal improvements of that fix, CASSANDRA-8877 > introduced a new set of functions that can perform aggregations of the > elements of a collection. Those where named "map_keys", "map_values", > "collection_min", "collection_max", "collection_sum", and "collection_count". > Those are the names mentioned on the mail list thread about function naming > conventions. Despite doing a kind of within-collection aggregation, these > functions are not what we usually call aggregate functions, since they don't > aggregate multiple rows together. > > On a different line of work, CASSANDRA-17425 added to trunk a MAXWRITETIME > function to get the max timestamp of a multi-cell column. However, the new > collection functions can be used in combination with the WRITETIME and TTL > functions to retrieve the min/max/sum/avg timestamp or ttl of a multi-cell > column. Since the new functions give a generic way of aggreagting timestamps > ant TTLs of multi-cell columns, CASSANDRA-18078 proposed to remove that > MAXWRITETIME function. > > Yifan Cai, author of the MAXWRITETIME function, agreed to remove that > function in favour of the new generic collection functions. However, the > MAXWRITETIME function can work on both single-cell and multi-cell columns, > whereas "COLLECTION_MAX(WRITETIME(column))" would only work on multi-cell > columns, That's because MAXWRITETIME of a not-multicell column doesn't return > a collection, and one should simply use "WRITETIME(column)" instead. So it > was proposed in CASSANDRA-18037 that collections functions applied to a > not-collection value consider that value as the only element of a singleton > collection. So, for example, COLLECTION_MAX(7) = COLLECTION_MAX([7]) = 7. > That ticket has already been reviewed and it's mostly ready to commit. > > Now we can go straight to the point: > > Recently Benedict brought back the idea of deprecating aggregate functions > applied to collections, the very same idea that was mentioned on > CASSANDRA-17811 description almost four months ago. That way we could rename > the new collection functions MIN/MAX/SUM/AVG, same as the classic aggregate > functions. That way MIN/MAX/SUM/AVG would be an aggregate function when > applied to not-collection columns, and a scalar function when applied to > collection. We can't do that with COUNT because there would be an ambiguity, > so the proposal for that case is renaming COLLECTION_COUNT to SIZE. Benedict, > please correct me if I'm not correctly exposing the proposal. > > I however would prefer to keep aggregate functions working on collections, > and keep the names of the new collection functions as "COLLECTION_*". Reasons > are: > > 1 - Making aggregate functions not work on collections might be cosidered as > breaking backward compatibility and require a deprecation plan. > 2 - Keeping aggregate functions working on collections might not look > superuseful, but they make the set of aggregate functions consistent and > applicable to every column type. > 3 - Using the "COLLECTION_" prefix on collection functions establishes a > clear distinction between row aggregations and collection aggregations, while > at the same time exposing the analogy between each pair of functions. > 4 - Not using the "COLLECTION_" prefix forces us to search for workarounds > such as using the column type when possible, or trying to figure out synonyms > like in the case of COUNT/SIZE. Even if that works for this case, future > functions can find more trouble when trying to figure out workarounds to > avoid clashing with existing function names. For example, we might want to > add a SIZE function that gets the size in bytes of any column, or we might > want to add a MAX function that gets the maximum of a set of columns, etc. > And example of the synonym-based approach that comes to mind is MySQL's MAX > and GREATEST functions, where MAX is for row aggregation and GREATEST is for > column aggregation. > 5 - If MIN/MAX function selection is based on the column type, we can't > implement Yifan's proposal of making COLLECTION_MAX(7) = COLLECTION_MAX([7]) > = 7, which would be very useful for combining collection functions with time > functions. > > What do others think? What should we do with aggregate functions on > collections, collection functions and MAXWRITETIME?