Hi,
I have the following use case:
A product (P) has 3 or more Devices associated with it. Each device (Di)
emits a set of names (size of the set is less than or equal to 250) every
minute.
Now the ask is: Compute the function f(product,hour) which is defined as
follows:
*foo*(*product*,*hour*) = Number of strings which are seen by all the
devices associated with the given *product *in the given *hour*.
Example:
Lets say product p1 has devices d1,d2,d3 associated with it.
Lets say S(d,h) is the *set* of names seen by the device d in hour h.
So, now foo(p1,h) = length(S(d1,h) intersect S(d2,h) intersect S(d3,h))

I came up with the following approaches but i am not convinced with them:
Approach A.
Create a column family with the following schema:
column family name : hour_data
hour_data(hour,product,name,device_id_set)
device_id_set is Set<String>
Primary Key: (hour,product,name)
*Issue*:
I can't just run a query like SELECT COUNT(*) FROM hour_data where hour=<h>
and product=p and length(device_id_set)=3 as querying on collections is not
possible

Approach B.
Create a column family with the following schema:
column family name : hour_data
hour_data(hour,product,name,num_devices_counter)
num_devices_counter is counter
Primary Key: (hour,product,name)
*Issue*:
I can't just run a query like SELECT COUNT(*) FROM hour_data where hour=<h>
and product=p and num_devices_counter=3 as querying on collections is not
possible

Approach C.
Column family schema:
hour_data(hour,device,name)
Primary Key: (hour,device,name)
If we have to compute foo(p1,h) then read the data for every deice
from *hour_data
*and perform intersection in spark.
*Issue*:
This is a heavy operation and is demanding big and multiple machines.

Could you please help me in refining the Schemas or defining a new schema
to solve my problem?

Reply via email to