Re: Using User Defined Functions in UPDATE queries

Kim Liu Fri, 11 Mar 2016 08:10:49 -0800

Just for sake of clarification, then, what is the use-case for having UDFs in 
an UPDATE?


If they cannot read data from the data store, then all of the parameters to the 
UDF must be supplied by the client, correct?

If the client has all the parameters, the client could perform the equivalent 
of the UDF on the client side, first, then send the results to the server, 
instead of pushing the computation work onto the server.  So I am curious as to 
what one is supposed to use a UDF in an UPDATE for.



Long-winded explanation of the use-case I was poking at using UPDATE UDFs for 
below for the morbidly curious.




That morbidly curious, huh?

The scenario is, roughly, that the application receives a set of data which is 
broken up over, say, four messages (A,B,C,D).  However, the messages can arrive 
in any order, possibly with duplicates, and the data set is not complete until 
the all four messages are received.  There are multiple message receivers in 
order to scale to the volume of messages coming in, so each of the four 
messages per data set could arrive at any receiver (in any chronological 
pattern), and each receiving station would then insert the partial data into 
Cassandra.

I looked at the Cassandra SET implementation, thinking that I could just add 
‘A’, ‘B’, ‘C’, ‘D’ (or 1,2,3,4) to a set with a secondary index.  Then 
periodically search for where the set had all elements to spot which rows had a 
complete data set ready for processing.  However, there does not appear to be 
an equality check for SETs.  (Adding elements to a set is another place where 
UPDATE appears to allow for the “x = x <operator> <data>” pattern which added 
to my confusion about using a UDF in the UPDATE.)

So instead of using sets, the idea was to have a UDF perform a bit-wise OR 
operation.  Roughly:
  CREATE FUNCTION bitwise_or( a int, b int ) CALLED ON NULL INPUT RETURNS int 
LANGUAGE java AS 'return Integer.valueOf((a == null ? 0 : a)|(b == null ? 0 : 
b));';

Then as each message segment came in, I had intended, roughly:
  UPDATE MessageData SET messageComplete = bitwise_or(messageComplete,2), 
data2=… ;
  UPDATE MessageData SET messageComplete = bitwise_or(messageComplete,1), 
data1=… ;
  UPDATE MessageData SET messageComplete = bitwise_or(messageComplete,8), 
data4=… ;
  UPDATE MessageData SET messageComplete = bitwise_or(messageComplete,4), 
data3=… ;

Then, with a secondary index on ‘messageComplete’, periodically scrape out all 
rows where messageComplete was equal to 15.  (At most, sixteen unique values in 
the secondary index.)  (And use a TTL to expire messages that did not 
eventually complete, etc.  Boilerplate infrastructure, etc.)

This was based upon my incorrect assumption about UPDATE UDFs, since this 
looked like an optimal way to avoid having all the clients perform read-updates 
patterns and worrying about the clients stepping on each others data, as well 
as handling cases where duplicate messages were received by different 
receivers.  So it’s starting to look like I might need to use something else to 
perform the correlation between messages.

—Kim

From: Sylvain Lebresne <sylv...@datastax.com<mailto:sylv...@datastax.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Friday, March 11, 2016 at 00:35
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Re: Using User Defined Functions in UPDATE queries

UDF are usable in UPDATE statement as actually trying them shows, it's just the 
documented grammar that needs fixing.

But as far as doing something like:
  UPDATE test_table SET data=max_int(data,5) WHERE idx='abc’;
this is indeed *not* supported and likely never will. One big pillar of C* 
design is that normal writes like this don't do a read-before-write, both for 
performance and because of consistency constraints, so we can't have update 
depend on the previous value in any way.
I'll note that maybe that make UDF useless for you and if so, I'm sorry, but 
you just can't use UDF in C* for that and you'd have to do a manual 
read-before-write client side to achieve this.

For the sake of avoiding confusion, I will not that we do allow:
  UPDATE test_table SET c = c + 1 WHERE idx='abc';
if c is a counter, but that's a very special case. Counters have a completely 
separate path and implementation and do have a read-before-write (and are 
slower than normal update as a result).

Re: Using User Defined Functions in UPDATE queries

Reply via email to