If the goal is to maximize performance/throughput, you need to assure that
data is contiguous as much as possible. IOW, so you can ask Cassandra for a
slice of consecutive rows rather than require slow and expensive scanning.
Typically this means careful attention to partition keys so that the data
will be clustered together, and careful attention to clustering keys for
the same reason. Sometimes you might need to have a two-level cluster key
(or even partition key) so that you can query by larger or smaller slices
using different key levels.

The new Materialized View feature should help since you can store (and
automatically update) multiple copies of the raw data in different forms
(orders) for different queries.

DataStax Enterprise Search also gives you some analytic capabilities in
terms of ad hoc queries on arbitrary columns, facet counting, grouping, etc.


-- Jack Krupansky

On Wed, Jan 27, 2016 at 1:33 AM, srungarapu vamsi <srungarapu1...@gmail.com>
wrote:

> Jack,
> This is one of the analytics jobs i have to run. For the given problem, i
> want to optimize the schema so that instead of loading the data as rdd to
> spark machines , i want to get the direct number from cassandra queries.
> The rationale behind this logic is i want to save on spark machine types :)
>
> On Wed, 27 Jan 2016 at 02:07 Jack Krupansky <jack.krupan...@gmail.com>
> wrote:
>
>> Step 1 in data modeling in Cassandra is to define all of your queries.
>> Are these in fact the ONLY queries that you need?
>>
>> If you are doing significant analytics, Spark is indeed the way to go.
>>
>> Cassandra works best for point queries and narrow slice queries (sequence
>> of consecutive rows within a single partition).
>>
>> -- Jack Krupansky
>>
>> On Tue, Jan 26, 2016 at 4:46 AM, srungarapu vamsi <
>> srungarapu1...@gmail.com> wrote:
>>
>>> Hi,
>>> I have the following use case:
>>> A product (P) has 3 or more Devices associated with it. Each device (Di)
>>> emits a set of names (size of the set is less than or equal to 250) every
>>> minute.
>>> Now the ask is: Compute the function f(product,hour) which is defined as
>>> follows:
>>> *foo*(*product*,*hour*) = Number of strings which are seen by all the
>>> devices associated with the given *product *in the given *hour*.
>>> Example:
>>> Lets say product p1 has devices d1,d2,d3 associated with it.
>>> Lets say S(d,h) is the *set* of names seen by the device d in hour h.
>>> So, now foo(p1,h) = length(S(d1,h) intersect S(d2,h) intersect S(d3,h))
>>>
>>> I came up with the following approaches but i am not convinced with them:
>>> Approach A.
>>> Create a column family with the following schema:
>>> column family name : hour_data
>>> hour_data(hour,product,name,device_id_set)
>>> device_id_set is Set<String>
>>> Primary Key: (hour,product,name)
>>> *Issue*:
>>> I can't just run a query like SELECT COUNT(*) FROM hour_data where
>>> hour=<h> and product=p and length(device_id_set)=3 as querying on
>>> collections is not possible
>>>
>>> Approach B.
>>> Create a column family with the following schema:
>>> column family name : hour_data
>>> hour_data(hour,product,name,num_devices_counter)
>>> num_devices_counter is counter
>>> Primary Key: (hour,product,name)
>>> *Issue*:
>>> I can't just run a query like SELECT COUNT(*) FROM hour_data where
>>> hour=<h> and product=p and num_devices_counter=3 as querying on collections
>>> is not possible
>>>
>>> Approach C.
>>> Column family schema:
>>> hour_data(hour,device,name)
>>> Primary Key: (hour,device,name)
>>> If we have to compute foo(p1,h) then read the data for every deice from 
>>> *hour_data
>>> *and perform intersection in spark.
>>> *Issue*:
>>> This is a heavy operation and is demanding big and multiple machines.
>>>
>>> Could you please help me in refining the Schemas or defining a new
>>> schema to solve my problem?
>>>
>>>
>>

Reply via email to