Step 1 in data modeling in Cassandra is to define all of your queries. Are
these in fact the ONLY queries that you need?

If you are doing significant analytics, Spark is indeed the way to go.

Cassandra works best for point queries and narrow slice queries (sequence
of consecutive rows within a single partition).

-- Jack Krupansky

On Tue, Jan 26, 2016 at 4:46 AM, srungarapu vamsi <srungarapu1...@gmail.com>
wrote:

> Hi,
> I have the following use case:
> A product (P) has 3 or more Devices associated with it. Each device (Di)
> emits a set of names (size of the set is less than or equal to 250) every
> minute.
> Now the ask is: Compute the function f(product,hour) which is defined as
> follows:
> *foo*(*product*,*hour*) = Number of strings which are seen by all the
> devices associated with the given *product *in the given *hour*.
> Example:
> Lets say product p1 has devices d1,d2,d3 associated with it.
> Lets say S(d,h) is the *set* of names seen by the device d in hour h.
> So, now foo(p1,h) = length(S(d1,h) intersect S(d2,h) intersect S(d3,h))
>
> I came up with the following approaches but i am not convinced with them:
> Approach A.
> Create a column family with the following schema:
> column family name : hour_data
> hour_data(hour,product,name,device_id_set)
> device_id_set is Set<String>
> Primary Key: (hour,product,name)
> *Issue*:
> I can't just run a query like SELECT COUNT(*) FROM hour_data where
> hour=<h> and product=p and length(device_id_set)=3 as querying on
> collections is not possible
>
> Approach B.
> Create a column family with the following schema:
> column family name : hour_data
> hour_data(hour,product,name,num_devices_counter)
> num_devices_counter is counter
> Primary Key: (hour,product,name)
> *Issue*:
> I can't just run a query like SELECT COUNT(*) FROM hour_data where
> hour=<h> and product=p and num_devices_counter=3 as querying on collections
> is not possible
>
> Approach C.
> Column family schema:
> hour_data(hour,device,name)
> Primary Key: (hour,device,name)
> If we have to compute foo(p1,h) then read the data for every deice from 
> *hour_data
> *and perform intersection in spark.
> *Issue*:
> This is a heavy operation and is demanding big and multiple machines.
>
> Could you please help me in refining the Schemas or defining a new schema
> to solve my problem?
>
>

Reply via email to