Re: Aggregate functions on collections, collection functions and MAXWRITETIME

Benedict Thu, 08 Dec 2022 02:57:58 -0800

I meant unnest, not unwrap.


> On 8 Dec 2022, at 10:34, Benedict <bened...@apache.org> wrote:
> 
> 
> 
>> I do not think we should have functions that aggregate across rows and 
>> functions that operate within a row use the same name.
> 
> I’m sympathetic to that view for sure. I wouldn’t be too disappointed by that 
> outcome, and SQL engines seem to take a similar approach, however they mostly 
> rely on sub-queries to get around this problem, and the SQL standard 
> introduces UNWRAP for operating on arrays (by translating them into a table), 
> permitting subqueries to aggregate them. It seems to me we have four options:
> 
> 1) introduce functionality similar to UNWRAP and subqueries 
> 2) introduce new syntax to permit operating on collections with the same 
> functions
> 3) permit the same functions to operate on both, with a precedence order, and 
> introduce syntax to permit breaking the precedence order
> 4) introduce new functions
> 
> (1) might look like SELECT (SELECT MAX(item) FROM UNWRAP(list)) AS max_item 
> FROM table
> 
> (2) and (3) might look something like:
> 
> SELECT MAX(list AS COLLECTION) or 
> SELECT MAX(list AS ROWS)
> 
> (4) might look something like we have already, but perhaps with different 
> names
> 
>> The comparator for collections is the lexicographical compare on the 
>> collection items
> 
> This is a fair point, I mistakenly thought it sorted first on size. Even this 
> definition is a little funkier for Map types, where the values of a key may 
> cause something to sort earlier than a map whose next key sorts first. There 
> are multiple potential lexicographical sorts for Maps (i.e., by keys first, 
> then values, or by (key, value) pairs), so this is particularly poorly 
> defined IMO.
> 
> The maximum of a blob type is pretty well defined I think, as are boolean, 
> inetaddress etc. However, even for List or Set collections there’s multiple 
> reasonable functions one could define for maximum, so it would make more 
> sense to me to permit the user to define the comparison as part of the MAX 
> function if we are to offer it. However, with the lexicographical definition 
> we have I am somewhat less concerned for Set and List. Map seems like a real 
> problem though, if we support these operators (which perhaps we do not).
> 
> 
>>> On 7 Dec 2022, at 12:13, Andrés de la Peña <adelap...@apache.org> wrote:
>>> 
>> 
>> The comparator for collections is the lexicographical compare on the 
>> collection items. That might nor be the more useful thing but it's not 
>> impossible to imagine cases where that ordering can be useful. To make a 
>> random example, you can use a list column to store the name and surnames of 
>> a person, considering that some persons can have multiple surnames. You can 
>> then sort rows based on that list column, to get the names in alphabetical 
>> order, or to get the first or last person according to that order. I'm sure 
>> we can think on more cases were the lexicographical order of a list can be 
>> useful, although I agree that's is not the greatest feature ever.
>> 
>> It's worth mentioning that collections are not the only data types where the 
>> MIN/MAX functions are of dubious utility. For example, blob columns can also 
>> be used with MIN/MAX. Same as with collections, the min/max blobs are 
>> selected according to the comparator for the data type. That comparator is 
>> the lexicographic compare on the unsigned values of the byte contents. 
>> 
>> The utility of MIN/MAX on inet and boolean columns isn't very clear either, 
>> although one can always imagine use cases. Fox example, MAX of a boolean 
>> column can be used as a logical disjunction.
>> 
>> If we were to special-case MIN/MAX functions to reject collections, we 
>> should also reject other data types such as, at least, blobs. That would 
>> require a deprecation plan.
>> 
>> Also, it's not that the comparator used by MIN/MAX is an internal obscure 
>> thing. The action of that comparator is very visible when any of those data 
>> types is used in a clustering column, and it's used as the basis for "ORDER 
>> BY" clauses. Should we also reject blobs, collections, tuples and UDTs on 
>> "ORDER BY"? I don't think so.
>> 
>> I rather think that basing MIN/MAX on the regular order of the column data 
>> type is consistent, easy to do and easy to understand. 
>> 
>> I don't see the need to add rules explicitly forbidding some data types on 
>> MIN/MAX functions just because we can't easily figure out a use case for 
>> their ordering. Especially when we are exposing that same ordering on 
>> clusterings and "ORDER BY".
>> 
>>> On Tue, 6 Dec 2022 at 18:56, J. D. Jordan <jeremiah.jor...@gmail.com> wrote:
>>> If the functionality truly has never actually worked, then throwing an 
>>> error that MAX is not supported for collections seems reasonable.
>>> 
>>> But we should throw an error, I do not think we should have functions that 
>>> aggregate across rows and functions that operate within a row use the same 
>>> name.
>>> 
>>> My expectation as a user would be that MAX either always aggregates across 
>>> rows, so results in a single row of output or always operates within a row, 
>>> so returns the full set of rows matching the query.
>>> 
>>> So if we want a max that aggregates across rows that works for collections 
>>> we could change it to return the aggregated max across all rows. Or we just 
>>> leave it as an error and if someone wants the max across all rows they 
>>> would ask for MAX(COLLECTION_MAX(column)). Yes I still agree COLLECTION_MAX 
>>> may be a bad name.
>>> 
>>> > On Dec 6, 2022, at 11:55 AM, Benedict <bened...@apache.org> wrote:
>>> > 
>>> > As far as I am aware it has never worked in a release, and so 
>>> > deprecating it is probably not as challenging as you think. Only folk 
>>> > that have been able to parse the raw bytes of the collection in storage 
>>> > format would be affected - which we can probably treat as zero.
>>> > 
>>> > 
>>> >> On 6 Dec 2022, at 17:31, Jeremiah D Jordan <jeremiah.jor...@gmail.com> 
>>> >> wrote:
>>> >> 
>>> >> 
>>> >>> 
>>> >>> 1. I think it is a mistake to offer a function MAX that operates over 
>>> >>> rows containing collections, returning the collection with the most 
>>> >>> elements. This is just a nonsensical operation to support IMO. We 
>>> >>> should decide as a community whether we “fix” this aggregation, or 
>>> >>> remove it.
>>> >> 
>>> >> The current MAX function does not work this way afaik?  It returns the 
>>> >> row with the column that has the highest value in clustering order 
>>> >> sense, like if the collection was used as a clustering key.  While that 
>>> >> also may have limited use, I don’t think it worth while to deprecate 
>>> >> such use and all the headache that comes with doing so.
>>> >> 
>>> >>> 2. I think “collection_" prefixed methods are non-intuitive for 
>>> >>> discovery, and all-else equal it would be better to use MAX,MIN, etc, 
>>> >>> same as for aggregations.
>>> >> 
>>> >> If we actually wanted to move towards using the existing names with new 
>>> >> meanings, then I think that would take us multiple major releases.  
>>> >> First deprecate existing use in current releases.  Then make it an error 
>>> >> in the next major release X.  Then change the behavior in major release 
>>> >> X+1.  Just switching the behavior without having a major where such 
>>> >> queries error out would make a bunch of user queries start returning 
>>> >> “wrong” data.
>>> >> Also I don’t think those functions being cross row aggregations for some 
>>> >> column types, but within row collection operations for other types, is 
>>> >> any more intuitive, and actually would be more confusing.  So I am -1 on 
>>> >> using the same names.
>>> >> 
>>> >>> 3. I think it is peculiar to permit methods named collection_ to 
>>> >>> operate over non-collection types when they are explicitly collection 
>>> >>> variants.
>>> >> 
>>> >> While I could see some point to this, I do not think it would be 
>>> >> confusing for something named collection_XXX to treat a non-collection 
>>> >> as a collection of 1.  But maybe there is a better name for these 
>>> >> function.  Rather than seeing them as collection variants, we should see 
>>> >> them as variants that operate on the data in a single row, rather than 
>>> >> aggregating across multiple rows.  But even with that perspective I 
>>> >> don’t know what the best name would be.
>>> >> 
>>> >>>> On Dec 6, 2022, at 7:30 AM, Benedict <bened...@apache.org> wrote:
>>> >>> 
>>> >>> Thanks Andres, I think community input on direction here will be 
>>> >>> invaluable. There’s a bunch of interrelated tickets, and my opinions 
>>> >>> are as follows:
>>> >>> 
>>> >>> 1. I think it is a mistake to offer a function MAX that operates over 
>>> >>> rows containing collections, returning the collection with the most 
>>> >>> elements. This is just a nonsensical operation to support IMO. We 
>>> >>> should decide as a community whether we “fix” this aggregation, or 
>>> >>> remove it.
>>> >>> 2. I think “collection_" prefixed methods are non-intuitive for 
>>> >>> discovery, and all-else equal it would be better to use MAX,MIN, etc, 
>>> >>> same as for aggregations.
>>> >>> 3. I think it is peculiar to permit methods named collection_ to 
>>> >>> operate over non-collection types when they are explicitly collection 
>>> >>> variants.
>>> >>> 
>>> >>> Given (1), (2) becomes simple except for COUNT which remains ambiguous, 
>>> >>> but this could be solved by either providing a separate method for 
>>> >>> collections (e.g. SIZE) which seems fine to me, or by offering a 
>>> >>> precedence order for matching and a keyword for overriding the 
>>> >>> precedence order (e.g. COUNT(collection AS COLLECTION)).
>>> >>> 
>>> >>> Given (2), (3) is a little more difficult. However, I think this can be 
>>> >>> solved several ways. 
>>> >>> - We could permit explicit casts to collection types, that for a 
>>> >>> collection type would be a no-op, and for a single value would create a 
>>> >>> collection
>>> >>> - With precedence orders, by always selecting the scalar function last
>>> >>> - By permitting WRITETIME to accept a binary operator reduce function 
>>> >>> to resolve multiple values
>>> >>> 
>>> >>> These decisions all imply trade-offs on each other, and affect the 
>>> >>> evolution of CQL, so I think community input would be helpful.
>>> >>> 
>>> >>>>> On 6 Dec 2022, at 12:44, Andrés de la Peña <adelap...@apache.org> 
>>> >>>>> wrote:
>>> >>>> 
>>> >>>> 
>>> >>>> This will require some long introduction for context:
>>> >>>> 
>>> >>>> The MAX/MIN functions aggregate rows to get the row with min/max 
>>> >>>> column value according to their comparator. For collections, the 
>>> >>>> comparison is on the lexicographical order of the collection elements. 
>>> >>>> That's the very same comparator that is used when collections are used 
>>> >>>> as clustering keys and for ORDER BY.
>>> >>>> 
>>> >>>> However, a bug in the MIN/MAX aggregate functions used to make that 
>>> >>>> the results were presented in their unserialized form, although the 
>>> >>>> row selection was correct. That bug was recently solved by 
>>> >>>> CASSANDRA-17811. During that ticket it was also considered the option 
>>> >>>> of simply disabling MIN/MAX on collection since applying those 
>>> >>>> functions to collections, since they don't seem super useful. However, 
>>> >>>> that option was quickly discarded and the operation was fixed so the 
>>> >>>> MIN/MAX functions correctly work for every data type.
>>> >>>> 
>>> >>>> As a byproduct of the internal improvements of that fix, 
>>> >>>> CASSANDRA-8877 introduced a new set of functions that can perform 
>>> >>>> aggregations of the elements of a collection. Those where named 
>>> >>>> "map_keys", "map_values", "collection_min", "collection_max", 
>>> >>>> "collection_sum", and "collection_count". Those are the names 
>>> >>>> mentioned on the mail list thread about function naming conventions. 
>>> >>>> Despite doing a kind of within-collection aggregation, these functions 
>>> >>>> are not what we usually call aggregate functions, since they don't 
>>> >>>> aggregate multiple rows together.
>>> >>>> 
>>> >>>> On a different line of work, CASSANDRA-17425 added to trunk a 
>>> >>>> MAXWRITETIME function to get the max timestamp of a multi-cell column. 
>>> >>>> However, the new collection functions can be used in combination with 
>>> >>>> the WRITETIME and TTL functions to retrieve the min/max/sum/avg 
>>> >>>> timestamp or ttl of a multi-cell column. Since the new functions give 
>>> >>>> a generic way of aggreagting timestamps ant TTLs of multi-cell 
>>> >>>> columns, CASSANDRA-18078 proposed to remove that MAXWRITETIME function.
>>> >>>> 
>>> >>>> Yifan Cai, author of the MAXWRITETIME function, agreed to remove that 
>>> >>>> function in favour of the new generic collection functions. However, 
>>> >>>> the MAXWRITETIME function can work on both single-cell and multi-cell 
>>> >>>> columns, whereas "COLLECTION_MAX(WRITETIME(column))" would only work 
>>> >>>> on multi-cell columns, That's because MAXWRITETIME of a not-multicell 
>>> >>>> column doesn't return a collection, and one should simply use 
>>> >>>> "WRITETIME(column)" instead. So it was proposed in CASSANDRA-18037 
>>> >>>> that collections functions applied to a not-collection value consider 
>>> >>>> that value as the only element of a singleton collection. So, for 
>>> >>>> example, COLLECTION_MAX(7) = COLLECTION_MAX([7]) = 7. That ticket has 
>>> >>>> already been reviewed and it's mostly ready to commit.
>>> >>>> 
>>> >>>> Now we can go straight to the point:
>>> >>>> 
>>> >>>> Recently Benedict brought back the idea of deprecating aggregate 
>>> >>>> functions applied to collections, the very same idea that was 
>>> >>>> mentioned on CASSANDRA-17811 description almost four months ago. That 
>>> >>>> way we could rename the new collection functions MIN/MAX/SUM/AVG, same 
>>> >>>> as the classic aggregate functions. That way MIN/MAX/SUM/AVG would be 
>>> >>>> an aggregate function when applied to not-collection columns, and a 
>>> >>>> scalar function when applied to collection. We can't do that with 
>>> >>>> COUNT because there would be an ambiguity, so the proposal for that 
>>> >>>> case is renaming COLLECTION_COUNT to SIZE. Benedict, please correct me 
>>> >>>> if I'm not correctly exposing the proposal.
>>> >>>> 
>>> >>>> I however would prefer to keep aggregate functions working on 
>>> >>>> collections, and keep the names of the new collection functions as 
>>> >>>> "COLLECTION_*". Reasons are:
>>> >>>> 
>>> >>>> 1 - Making aggregate functions not work on collections might be 
>>> >>>> cosidered as breaking backward compatibility and require a deprecation 
>>> >>>> plan. 
>>> >>>> 2 - Keeping aggregate functions working on collections might not look 
>>> >>>> superuseful, but they make the set of aggregate functions consistent 
>>> >>>> and applicable to every column type.
>>> >>>> 3 - Using the "COLLECTION_" prefix on collection functions establishes 
>>> >>>> a clear distinction between row aggregations and collection 
>>> >>>> aggregations, while at the same time exposing the analogy between each 
>>> >>>> pair of functions.
>>> >>>> 4 - Not using the "COLLECTION_" prefix forces us to search for 
>>> >>>> workarounds such as using the column type when possible, or trying to 
>>> >>>> figure out synonyms like in the case of COUNT/SIZE. Even if that works 
>>> >>>> for this case, future functions can find more trouble when trying to 
>>> >>>> figure out workarounds to avoid clashing with existing function names. 
>>> >>>> For example, we might want to add a SIZE function that gets the size 
>>> >>>> in bytes of any column, or we might want to add a MAX function that 
>>> >>>> gets the maximum of a set of columns, etc. And example of the 
>>> >>>> synonym-based approach that comes to mind is MySQL's MAX and GREATEST 
>>> >>>> functions, where MAX is for row aggregation and GREATEST is for column 
>>> >>>> aggregation.
>>> >>>> 5 - If MIN/MAX function selection is based on the column type, we 
>>> >>>> can't implement Yifan's proposal of making COLLECTION_MAX(7) = 
>>> >>>> COLLECTION_MAX([7]) = 7, which would be very useful for combining 
>>> >>>> collection functions with time functions.
>>> >>>> 
>>> >>>> What do others think? What should we do with aggregate functions on 
>>> >>>> collections, collection functions and MAXWRITETIME?
>>> >>

Re: Aggregate functions on collections, collection functions and MAXWRITETIME

Reply via email to