Jack, This is one of the analytics jobs i have to run. For the given problem, i want to optimize the schema so that instead of loading the data as rdd to spark machines , i want to get the direct number from cassandra queries. The rationale behind this logic is i want to save on spark machine types :)
On Wed, 27 Jan 2016 at 02:07 Jack Krupansky <[email protected]> wrote: > Step 1 in data modeling in Cassandra is to define all of your queries. Are > these in fact the ONLY queries that you need? > > If you are doing significant analytics, Spark is indeed the way to go. > > Cassandra works best for point queries and narrow slice queries (sequence > of consecutive rows within a single partition). > > -- Jack Krupansky > > On Tue, Jan 26, 2016 at 4:46 AM, srungarapu vamsi < > [email protected]> wrote: > >> Hi, >> I have the following use case: >> A product (P) has 3 or more Devices associated with it. Each device (Di) >> emits a set of names (size of the set is less than or equal to 250) every >> minute. >> Now the ask is: Compute the function f(product,hour) which is defined as >> follows: >> *foo*(*product*,*hour*) = Number of strings which are seen by all the >> devices associated with the given *product *in the given *hour*. >> Example: >> Lets say product p1 has devices d1,d2,d3 associated with it. >> Lets say S(d,h) is the *set* of names seen by the device d in hour h. >> So, now foo(p1,h) = length(S(d1,h) intersect S(d2,h) intersect S(d3,h)) >> >> I came up with the following approaches but i am not convinced with them: >> Approach A. >> Create a column family with the following schema: >> column family name : hour_data >> hour_data(hour,product,name,device_id_set) >> device_id_set is Set<String> >> Primary Key: (hour,product,name) >> *Issue*: >> I can't just run a query like SELECT COUNT(*) FROM hour_data where >> hour=<h> and product=p and length(device_id_set)=3 as querying on >> collections is not possible >> >> Approach B. >> Create a column family with the following schema: >> column family name : hour_data >> hour_data(hour,product,name,num_devices_counter) >> num_devices_counter is counter >> Primary Key: (hour,product,name) >> *Issue*: >> I can't just run a query like SELECT COUNT(*) FROM hour_data where >> hour=<h> and product=p and num_devices_counter=3 as querying on collections >> is not possible >> >> Approach C. >> Column family schema: >> hour_data(hour,device,name) >> Primary Key: (hour,device,name) >> If we have to compute foo(p1,h) then read the data for every deice from >> *hour_data >> *and perform intersection in spark. >> *Issue*: >> This is a heavy operation and is demanding big and multiple machines. >> >> Could you please help me in refining the Schemas or defining a new schema >> to solve my problem? >> >> >
