Re: Aggregate functions on collections, collection functions and MAXWRITETIME

Benedict Tue, 06 Dec 2022 05:30:57 -0800

Thanks Andres, I think community input on direction here will be invaluable. 
There’s a bunch of interrelated tickets, and my opinions are as follows:


1. I think it is a mistake to offer a function MAX that operates over rows 
containing collections, returning the collection with the most elements. This 
is just a nonsensical operation to support IMO. We should decide as a community 
whether we “fix” this aggregation, or remove it.
2. I think “collection_" prefixed methods are non-intuitive for discovery, and 
all-else equal it would be better to use MAX,MIN, etc, same as for aggregations.
3. I think it is peculiar to permit methods named collection_ to operate over 
non-collection types when they are explicitly collection variants.

Given (1), (2) becomes simple except for COUNT which remains ambiguous, but 
this could be solved by either providing a separate method for collections 
(e.g. SIZE) which seems fine to me, or by offering a precedence order for 
matching and a keyword for overriding the precedence order (e.g. 
COUNT(collection AS COLLECTION)).

Given (2), (3) is a little more difficult. However, I think this can be solved 
several ways. 
 - We could permit explicit casts to collection types, that for a collection 
type would be a no-op, and for a single value would create a collection
 - With precedence orders, by always selecting the scalar function last
 - By permitting WRITETIME to accept a binary operator reduce function to 
resolve multiple values

These decisions all imply trade-offs on each other, and affect the evolution of 
CQL, so I think community input would be helpful.

> On 6 Dec 2022, at 12:44, Andrés de la Peña <adelap...@apache.org> wrote:
> 
> 
> This will require some long introduction for context:
> 
> The MAX/MIN functions aggregate rows to get the row with min/max column value 
> according to their comparator. For collections, the comparison is on the 
> lexicographical order of the collection elements. That's the very same 
> comparator that is used when collections are used as clustering keys and for 
> ORDER BY.
> 
> However, a bug in the MIN/MAX aggregate functions used to make that the 
> results were presented in their unserialized form, although the row selection 
> was correct. That bug was recently solved by CASSANDRA-17811. During that 
> ticket it was also considered the option of simply disabling MIN/MAX on 
> collection since applying those functions to collections, since they don't 
> seem super useful. However, that option was quickly discarded and the 
> operation was fixed so the MIN/MAX functions correctly work for every data 
> type.
> 
> As a byproduct of the internal improvements of that fix, CASSANDRA-8877 
> introduced a new set of functions that can perform aggregations of the 
> elements of a collection. Those where named "map_keys", "map_values", 
> "collection_min", "collection_max", "collection_sum", and "collection_count". 
> Those are the names mentioned on the mail list thread about function naming 
> conventions. Despite doing a kind of within-collection aggregation, these 
> functions are not what we usually call aggregate functions, since they don't 
> aggregate multiple rows together.
> 
> On a different line of work, CASSANDRA-17425 added to trunk a MAXWRITETIME 
> function to get the max timestamp of a multi-cell column. However, the new 
> collection functions can be used in combination with the WRITETIME and TTL 
> functions to retrieve the min/max/sum/avg timestamp or ttl of a multi-cell 
> column. Since the new functions give a generic way of aggreagting timestamps 
> ant TTLs of multi-cell columns, CASSANDRA-18078 proposed to remove that 
> MAXWRITETIME function.
> 
> Yifan Cai, author of the MAXWRITETIME function, agreed to remove that 
> function in favour of the new generic collection functions. However, the 
> MAXWRITETIME function can work on both single-cell and multi-cell columns, 
> whereas "COLLECTION_MAX(WRITETIME(column))" would only work on multi-cell 
> columns, That's because MAXWRITETIME of a not-multicell column doesn't return 
> a collection, and one should simply use "WRITETIME(column)" instead. So it 
> was proposed in CASSANDRA-18037 that collections functions applied to a 
> not-collection value consider that value as the only element of a singleton 
> collection. So, for example, COLLECTION_MAX(7) = COLLECTION_MAX([7]) = 7. 
> That ticket has already been reviewed and it's mostly ready to commit.
> 
> Now we can go straight to the point:
> 
> Recently Benedict brought back the idea of deprecating aggregate functions 
> applied to collections, the very same idea that was mentioned on 
> CASSANDRA-17811 description almost four months ago. That way we could rename 
> the new collection functions MIN/MAX/SUM/AVG, same as the classic aggregate 
> functions. That way MIN/MAX/SUM/AVG would be an aggregate function when 
> applied to not-collection columns, and a scalar function when applied to 
> collection. We can't do that with COUNT because there would be an ambiguity, 
> so the proposal for that case is renaming COLLECTION_COUNT to SIZE. Benedict, 
> please correct me if I'm not correctly exposing the proposal.
> 
> I however would prefer to keep aggregate functions working on collections, 
> and keep the names of the new collection functions as "COLLECTION_*". Reasons 
> are:
> 
> 1 - Making aggregate functions not work on collections might be cosidered as 
> breaking backward compatibility and require a deprecation plan. 
> 2 - Keeping aggregate functions working on collections might not look 
> superuseful, but they make the set of aggregate functions consistent and 
> applicable to every column type.
> 3 - Using the "COLLECTION_" prefix on collection functions establishes a 
> clear distinction between row aggregations and collection aggregations, while 
> at the same time exposing the analogy between each pair of functions.
> 4 - Not using the "COLLECTION_" prefix forces us to search for workarounds 
> such as using the column type when possible, or trying to figure out synonyms 
> like in the case of COUNT/SIZE. Even if that works for this case, future 
> functions can find more trouble when trying to figure out workarounds to 
> avoid clashing with existing function names. For example, we might want to 
> add a SIZE function that gets the size in bytes of any column, or we might 
> want to add a MAX function that gets the maximum of a set of columns, etc. 
> And example of the synonym-based approach that comes to mind is MySQL's MAX 
> and GREATEST functions, where MAX is for row aggregation and GREATEST is for 
> column aggregation.
> 5 - If MIN/MAX function selection is based on the column type, we can't 
> implement Yifan's proposal of making COLLECTION_MAX(7) = COLLECTION_MAX([7]) 
> = 7, which would be very useful for combining collection functions with time 
> functions.
> 
> What do others think? What should we do with aggregate functions on 
> collections, collection functions and MAXWRITETIME?

Re: Aggregate functions on collections, collection functions and MAXWRITETIME

Reply via email to