Re: Help with select IN query in cassandra

Subodh Nijsure Mon, 01 Sep 2014 09:10:09 -0700

I am guessing OP => Original Poster. I had not asked about deletion,
but Jack thanks for your suggestion for overall handling of deletion.
For the type of sensor data I am collecting we really don't want to
delete the data as we want to figure out data trend over real long
term ( 1, 2 years).


BTW - I found a good link that describes behavior of where clause in
cassandra, just posting that link in case someone stumbles upon this
thread in the future

http://mechanics.flite.com/blog/2013/11/05/breaking-down-the-cql-where-clause/

-Subodh

On Mon, Sep 1, 2014 at 8:56 AM, Jack Krupansky <j...@basetechnology.com> wrote:
> I did see a reference to deletions: “overall approaches considering volumes,
> deletion, compaction etc.” Did I merely misunderstand the reference? That’s
> all I was responding to... sorry if my misunderstanding added any confusion!
>
> -- Jack Krupansky
>
> From: Laing, Michael
> Sent: Monday, September 1, 2014 11:34 AM
> To: user@cassandra.apache.org
> Subject: Re: Help with select IN query in cassandra
>
> Did the OP propose that?
>
>
> On Mon, Sep 1, 2014 at 10:53 AM, Jack Krupansky <j...@basetechnology.com>
> wrote:
>>
>> One comment on deletions – aren’t deletions kind of an anti-pattern for
>> modern data processing, such as sensor data, time series data, and social
>> media? I mean, isn’t it usually better to return a full history of the data,
>> with some aging scheme, and manage the tracking of which values are
>> “current” (or “recent”)? Shouldn’t we be looking for and promoting “write
>> once” approaches as a much stronger preference/pattern? Or maybe I should
>> say “write once and bulk delete on aging” rather than the exercise in
>> futility of doing a massive number of deletes and updates in place?
>>
>> -- Jack Krupansky
>>
>> From: Laing, Michael
>> Sent: Monday, September 1, 2014 9:33 AM
>> To: user@cassandra.apache.org
>> Subject: Re: Help with select IN query in cassandra
>>
>> This should work for your query requirements - 2 tables w same info
>> because disk is cheap and writes are fast so optimize for reads:
>>
>> CREATE TABLE sensor_asset (
>>   asset_id text,
>>   event_time timestamp,
>>   tuuid timeuuid,
>>   sensor_reading map<text, text>,
>>   sensor_serial_number text,
>>   sensor_type int,
>>   PRIMARY KEY ((asset_id), event_time)
>> );
>>
>> CREATE TABLE sensor_tuuid (
>>   asset_id text,
>>   event_time timestamp,
>>   tuuid timeuuid,
>>   sensor_reading map<text, text>,
>>   sensor_serial_number text,
>>   sensor_type int,
>>   PRIMARY KEY (tuuid)
>> );
>>
>> 1. Give me all sensor data for an asset:
>>
>> select * from sensor_asset where asset_id = <asset>;
>>
>> 2. Give me sensor data that matches a set of timeuuids:
>>
>> select * from sensor_tuuid where tuuid in (<tuuid1>, <tuuid2>, ...);
>>
>> 3. Give me all sensor data for an asset collected after | before | between
>> event_time(s):
>>
>> select * from sensor_asset where asset_id = <asset> and event_time >
>> <ts1>;
>> select * from sensor_asset where asset_id = <asset> and event_time <
>> <ts1>;
>> select * from sensor_asset where asset_id = <asset> and event_time < <ts1>
>> and event_time > <ts2>;
>>
>> ***
>>
>> Many people (not me) handle sensor data, so there may be better overall
>> approaches considering volumes, deletion, compaction etc.
>>
>> But the above is simple and should make your current approach workable as
>> you iterate toward a complete solution.
>>
>> Cheers,
>> ml
>>
>>
>>
>> On Sun, Aug 31, 2014 at 11:08 PM, Subodh Nijsure
>> <subodh.nijs...@gmail.com> wrote:
>>>
>>> Thanks for your help Michael.
>>>
>>> If specifying asset_id would help I can construct queries that can
>>> include asset_id
>>>
>>> So I have been "playing" around with PRIMARY KEY definition and
>>> following table definition
>>>
>>> CREATE TABLE sensor_info_table (
>>>   asset_id text,
>>>   event_time timestamp,
>>>   "timestamp" timeuuid,
>>>   sensor_reading map<text, text>,
>>>   sensor_serial_number text,
>>>   sensor_type int,
>>>   PRIMARY KEY ((asset_id, "timestamp"), event_time)
>>> );
>>>
>>> It does what I want to do, and I removed the index for timestamp item
>>> since now it is part of primary key and thus my query like this works.
>>>
>>> SELECT * from sigsense.sensor_info_table where  asset_id='3' AND
>>> timestamp IN (
>>> 17830bb0-316a-11e4-800f-b888e30f5d17,16ddbdfe-316a-11e4-9f50-b888e30f5d17
>>> );
>>>
>>> But now this doesn't work it give
>>>
>>> SELECT * from sensor_info_table where  asset_id='3' ;
>>>
>>> Bad Request: Partition key part timestamp must be restricted since
>>> preceding part is
>>>
>>> I am keeping index on event_time as I sometime need to query something
>>> "give me all data since time x" i.e. something like this works.
>>>
>>> SELECT * from sensor_info_table where  event_time > '2014-08-31
>>> 16:54:02-0700' ALLOW FILTERING;
>>>
>>> However if I do this things then this don't work:
>>>
>>> SELECT * from sensor_info_table where  asset_id='3' AND event_time >
>>> '2014-08-31 16:54:02-0700';
>>>
>>> Bad Request: Partition key part timestamp must be restricted since
>>> preceding part is
>>>
>>> Also  I am not conformable with fact that I need to specify ALLOW
>>> FILTERING.
>>>
>>> I guess cassandra schema design task asks designer to write down
>>> queries before designing schema.
>>>
>>> For the above table definition I want to do following queries:
>>>
>>> - Give me all sensor data for given asset.
>>> - Give me sensor data that matches given set of timeuuids
>>> - Give me all sendor data for a given asset, that were collected after
>>> | before | between  certain event_time.
>>>
>>> Given these query criteria how should  I construct my schema? One
>>> thought has occurred to me is make three tables with each item
>>> asset_id , event_time, timeuuid as primary keys and depending on type
>>> of query choose the table to do query upon. That seems like a waste of
>>> resources (disk, cpu ), also increasing insert times(!) but thats the
>>> way things need to happen in cassandra world its okay. ( I am
>>> two-three weeks into learning about cassandra).
>>>
>>> -Subodh
>>>
>>> On Sun, Aug 31, 2014 at 6:44 PM, Laing, Michael
>>> <michael.la...@nytimes.com> wrote:
>>> > Oh it must be late - I missed the fact that you didn't want to specify
>>> > asset_id. The above queries will still work but you have to use 'allow
>>> > filtering' - generally not a good idea. I'll look again in the morning.
>>> >
>>> >
>>> > On Sun, Aug 31, 2014 at 9:41 PM, Laing, Michael
>>> > <michael.la...@nytimes.com>
>>> > wrote:
>>> >>
>>> >> Hmm. Because the clustering key is (event_time, "timestamp"),
>>> >> event_time
>>> >> must be specified as well - hopefully that info is available to the
>>> >> ux.
>>> >>
>>> >> Unfortunately you will then hit another problem with your query: you
>>> >> are
>>> >> selecting a collection field... this will not work with IN on
>>> >> "timestamp".
>>> >>
>>> >> So you could select all the "timestamp"s for an asset_id/event_time:
>>> >>>
>>> >>> SELECT * from sensor_info_table where asset_id = 'a' and event_time =
>>> >>> 1231234;
>>> >>
>>> >>
>>> >> Or you could apply a range of "timestamp"s:
>>> >>>
>>> >>> SELECT * from sensor_info_table where asset_id = 'a' and event_time =
>>> >>> 1231234 and "timestamp" > 1d934af3-3178-11e4-ba8d-406c8f1838fa and
>>> >>> "timestamp" < 20b82021-3178-11e4-abc2-406c8f1838fa;
>>> >>
>>> >>
>>> >> BTW the secondary indices are not a good idea: high cardinality and of
>>> >> no
>>> >> use in this query that I can see.
>>> >>
>>> >> ml
>>> >>
>>> >>
>>> >> On Sun, Aug 31, 2014 at 8:40 PM, Subodh Nijsure
>>> >> <subodh.nijs...@gmail.com>
>>> >> wrote:
>>> >>>
>>> >>> Not really event time stamp is created by the sensor when it reads
>>> >>> data
>>> >>> and  timestamp is something server creates when inserting data into
>>> >>> cassandra db.  At later point in time my django ux allows users to
>>> >>> browse
>>> >>> this data and reference interesting data points via the timestamp
>>> >>> field. The
>>> >>> timestamp field is my bridge between Sal and nosql world.
>>> >>>
>>> >>> Subodh
>>> >>>
>>> >>> On Aug 31, 2014 5:33 PM, "Laing, Michael" <michael.la...@nytimes.com>
>>> >>> wrote:
>>> >>>>
>>> >>>> Are event_time and timestamp essentially representing the same
>>> >>>> datetime?
>>> >>>>
>>> >>>> On Sunday, August 31, 2014, Subodh Nijsure
>>> >>>> <subodh.nijs...@gmail.com>
>>> >>>> wrote:
>>> >>>>>
>>> >>>>> I have following database schema
>>> >>>>>
>>> >>>>> CREATE TABLE sensor_info_table (
>>> >>>>>   asset_id text,
>>> >>>>>   event_time timestamp,
>>> >>>>>   "timestamp" timeuuid,
>>> >>>>>   sensor_reading map<text, text>,
>>> >>>>>   sensor_serial_number text,
>>> >>>>>   sensor_type int,
>>> >>>>>   PRIMARY KEY ((asset_id), event_time, "timestamp")
>>> >>>>> );
>>> >>>>>
>>> >>>>> CREATE INDEX event_time_index ON sensor_info_table (event_time);
>>> >>>>>
>>> >>>>> CREATE INDEX timestamp_index ON sensor_info_table ("timestamp");
>>> >>>>>
>>> >>>>> Now I am able to insert the data into this table, however I am
>>> >>>>> unable
>>> >>>>> to do following query where I want to select items with specific
>>> >>>>> timeuuid values.
>>> >>>>>
>>> >>>>> It gives me following error.
>>> >>>>>
>>> >>>>> SELECT * from mydb.sensor_info_table where timestamp IN (
>>> >>>>> bfdfa614-3166-11e4-a61d-b888e30f5d17 ,
>>> >>>>> bf4521ac-3166-11e4-87a3-b888e30f5d17) ;
>>> >>>>>
>>> >>>>> Bad Request: PRIMARY KEY column "timestamp" cannot be restricted
>>> >>>>> (preceding column "event_time" is either not restricted or by a
>>> >>>>> non-EQ
>>> >>>>> relation)
>>> >>>>>
>>> >>>>> What do I have to do to make this work?
>>> >>>>>
>>> >>>>> For what its worth I am using django for my front end development
>>> >>>>> and
>>> >>>>> I am using "timestamp timeuuid" field as unique indentifier to
>>> >>>>> reference specific sensor reading from django framework -- since
>>> >>>>> cassandra doesn't have way to generate unique id upon insert (like
>>> >>>>> old-style rdms's auto-fields).
>>> >>>>>
>>> >>>>>
>>> >>>>> Below is software version info.
>>> >>>>>
>>> >>>>> show VERSION ; [cqlsh 4.1.1 | Cassandra 2.0.9 | CQL spec 3.1.1 |
>>> >>>>> Thrift protocol 19.39.0]
>>> >>>>>
>>> >>>>> I really don't understand what the error message preceeding column
>>> >>>>> "event_time" is either not restricted or by no-EQ relation?
>>> >>>>>
>>> >>>>> -Subodh Nijsure
>>> >>
>>> >>
>>> >
>>
>>
>
>

Re: Help with select IN query in cassandra

Reply via email to