Hi, I have the following use case: A product (P) has 3 or more Devices associated with it. Each device (Di) emits a set of names (size of the set is less than or equal to 250) every minute. Now the ask is: Compute the function f(product,hour) which is defined as follows: *foo*(*product*,*hour*) = Number of strings which are seen by all the devices associated with the given *product *in the given *hour*. Example: Lets say product p1 has devices d1,d2,d3 associated with it. Lets say S(d,h) is the *set* of names seen by the device d in hour h. So, now foo(p1,h) = length(S(d1,h) intersect S(d2,h) intersect S(d3,h))
I came up with the following approaches but i am not convinced with them: Approach A. Create a column family with the following schema: column family name : hour_data hour_data(hour,product,name,device_id_set) device_id_set is Set<String> Primary Key: (hour,product,name) *Issue*: I can't just run a query like SELECT COUNT(*) FROM hour_data where hour=<h> and product=p and length(device_id_set)=3 as querying on collections is not possible Approach B. Create a column family with the following schema: column family name : hour_data hour_data(hour,product,name,num_devices_counter) num_devices_counter is counter Primary Key: (hour,product,name) *Issue*: I can't just run a query like SELECT COUNT(*) FROM hour_data where hour=<h> and product=p and num_devices_counter=3 as querying on collections is not possible Approach C. Column family schema: hour_data(hour,device,name) Primary Key: (hour,device,name) If we have to compute foo(p1,h) then read the data for every deice from *hour_data *and perform intersection in spark. *Issue*: This is a heavy operation and is demanding big and multiple machines. Could you please help me in refining the Schemas or defining a new schema to solve my problem?