I am guessing OP => Original Poster. I had not asked about deletion, but Jack thanks for your suggestion for overall handling of deletion. For the type of sensor data I am collecting we really don't want to delete the data as we want to figure out data trend over real long term ( 1, 2 years).
BTW - I found a good link that describes behavior of where clause in cassandra, just posting that link in case someone stumbles upon this thread in the future http://mechanics.flite.com/blog/2013/11/05/breaking-down-the-cql-where-clause/ -Subodh On Mon, Sep 1, 2014 at 8:56 AM, Jack Krupansky <j...@basetechnology.com> wrote: > I did see a reference to deletions: “overall approaches considering volumes, > deletion, compaction etc.” Did I merely misunderstand the reference? That’s > all I was responding to... sorry if my misunderstanding added any confusion! > > -- Jack Krupansky > > From: Laing, Michael > Sent: Monday, September 1, 2014 11:34 AM > To: user@cassandra.apache.org > Subject: Re: Help with select IN query in cassandra > > Did the OP propose that? > > > On Mon, Sep 1, 2014 at 10:53 AM, Jack Krupansky <j...@basetechnology.com> > wrote: >> >> One comment on deletions – aren’t deletions kind of an anti-pattern for >> modern data processing, such as sensor data, time series data, and social >> media? I mean, isn’t it usually better to return a full history of the data, >> with some aging scheme, and manage the tracking of which values are >> “current” (or “recent”)? Shouldn’t we be looking for and promoting “write >> once” approaches as a much stronger preference/pattern? Or maybe I should >> say “write once and bulk delete on aging” rather than the exercise in >> futility of doing a massive number of deletes and updates in place? >> >> -- Jack Krupansky >> >> From: Laing, Michael >> Sent: Monday, September 1, 2014 9:33 AM >> To: user@cassandra.apache.org >> Subject: Re: Help with select IN query in cassandra >> >> This should work for your query requirements - 2 tables w same info >> because disk is cheap and writes are fast so optimize for reads: >> >> CREATE TABLE sensor_asset ( >> asset_id text, >> event_time timestamp, >> tuuid timeuuid, >> sensor_reading map<text, text>, >> sensor_serial_number text, >> sensor_type int, >> PRIMARY KEY ((asset_id), event_time) >> ); >> >> CREATE TABLE sensor_tuuid ( >> asset_id text, >> event_time timestamp, >> tuuid timeuuid, >> sensor_reading map<text, text>, >> sensor_serial_number text, >> sensor_type int, >> PRIMARY KEY (tuuid) >> ); >> >> 1. Give me all sensor data for an asset: >> >> select * from sensor_asset where asset_id = <asset>; >> >> 2. Give me sensor data that matches a set of timeuuids: >> >> select * from sensor_tuuid where tuuid in (<tuuid1>, <tuuid2>, ...); >> >> 3. Give me all sensor data for an asset collected after | before | between >> event_time(s): >> >> select * from sensor_asset where asset_id = <asset> and event_time > >> <ts1>; >> select * from sensor_asset where asset_id = <asset> and event_time < >> <ts1>; >> select * from sensor_asset where asset_id = <asset> and event_time < <ts1> >> and event_time > <ts2>; >> >> *** >> >> Many people (not me) handle sensor data, so there may be better overall >> approaches considering volumes, deletion, compaction etc. >> >> But the above is simple and should make your current approach workable as >> you iterate toward a complete solution. >> >> Cheers, >> ml >> >> >> >> On Sun, Aug 31, 2014 at 11:08 PM, Subodh Nijsure >> <subodh.nijs...@gmail.com> wrote: >>> >>> Thanks for your help Michael. >>> >>> If specifying asset_id would help I can construct queries that can >>> include asset_id >>> >>> So I have been "playing" around with PRIMARY KEY definition and >>> following table definition >>> >>> CREATE TABLE sensor_info_table ( >>> asset_id text, >>> event_time timestamp, >>> "timestamp" timeuuid, >>> sensor_reading map<text, text>, >>> sensor_serial_number text, >>> sensor_type int, >>> PRIMARY KEY ((asset_id, "timestamp"), event_time) >>> ); >>> >>> It does what I want to do, and I removed the index for timestamp item >>> since now it is part of primary key and thus my query like this works. >>> >>> SELECT * from sigsense.sensor_info_table where asset_id='3' AND >>> timestamp IN ( >>> 17830bb0-316a-11e4-800f-b888e30f5d17,16ddbdfe-316a-11e4-9f50-b888e30f5d17 >>> ); >>> >>> But now this doesn't work it give >>> >>> SELECT * from sensor_info_table where asset_id='3' ; >>> >>> Bad Request: Partition key part timestamp must be restricted since >>> preceding part is >>> >>> I am keeping index on event_time as I sometime need to query something >>> "give me all data since time x" i.e. something like this works. >>> >>> SELECT * from sensor_info_table where event_time > '2014-08-31 >>> 16:54:02-0700' ALLOW FILTERING; >>> >>> However if I do this things then this don't work: >>> >>> SELECT * from sensor_info_table where asset_id='3' AND event_time > >>> '2014-08-31 16:54:02-0700'; >>> >>> Bad Request: Partition key part timestamp must be restricted since >>> preceding part is >>> >>> Also I am not conformable with fact that I need to specify ALLOW >>> FILTERING. >>> >>> I guess cassandra schema design task asks designer to write down >>> queries before designing schema. >>> >>> For the above table definition I want to do following queries: >>> >>> - Give me all sensor data for given asset. >>> - Give me sensor data that matches given set of timeuuids >>> - Give me all sendor data for a given asset, that were collected after >>> | before | between certain event_time. >>> >>> Given these query criteria how should I construct my schema? One >>> thought has occurred to me is make three tables with each item >>> asset_id , event_time, timeuuid as primary keys and depending on type >>> of query choose the table to do query upon. That seems like a waste of >>> resources (disk, cpu ), also increasing insert times(!) but thats the >>> way things need to happen in cassandra world its okay. ( I am >>> two-three weeks into learning about cassandra). >>> >>> -Subodh >>> >>> On Sun, Aug 31, 2014 at 6:44 PM, Laing, Michael >>> <michael.la...@nytimes.com> wrote: >>> > Oh it must be late - I missed the fact that you didn't want to specify >>> > asset_id. The above queries will still work but you have to use 'allow >>> > filtering' - generally not a good idea. I'll look again in the morning. >>> > >>> > >>> > On Sun, Aug 31, 2014 at 9:41 PM, Laing, Michael >>> > <michael.la...@nytimes.com> >>> > wrote: >>> >> >>> >> Hmm. Because the clustering key is (event_time, "timestamp"), >>> >> event_time >>> >> must be specified as well - hopefully that info is available to the >>> >> ux. >>> >> >>> >> Unfortunately you will then hit another problem with your query: you >>> >> are >>> >> selecting a collection field... this will not work with IN on >>> >> "timestamp". >>> >> >>> >> So you could select all the "timestamp"s for an asset_id/event_time: >>> >>> >>> >>> SELECT * from sensor_info_table where asset_id = 'a' and event_time = >>> >>> 1231234; >>> >> >>> >> >>> >> Or you could apply a range of "timestamp"s: >>> >>> >>> >>> SELECT * from sensor_info_table where asset_id = 'a' and event_time = >>> >>> 1231234 and "timestamp" > 1d934af3-3178-11e4-ba8d-406c8f1838fa and >>> >>> "timestamp" < 20b82021-3178-11e4-abc2-406c8f1838fa; >>> >> >>> >> >>> >> BTW the secondary indices are not a good idea: high cardinality and of >>> >> no >>> >> use in this query that I can see. >>> >> >>> >> ml >>> >> >>> >> >>> >> On Sun, Aug 31, 2014 at 8:40 PM, Subodh Nijsure >>> >> <subodh.nijs...@gmail.com> >>> >> wrote: >>> >>> >>> >>> Not really event time stamp is created by the sensor when it reads >>> >>> data >>> >>> and timestamp is something server creates when inserting data into >>> >>> cassandra db. At later point in time my django ux allows users to >>> >>> browse >>> >>> this data and reference interesting data points via the timestamp >>> >>> field. The >>> >>> timestamp field is my bridge between Sal and nosql world. >>> >>> >>> >>> Subodh >>> >>> >>> >>> On Aug 31, 2014 5:33 PM, "Laing, Michael" <michael.la...@nytimes.com> >>> >>> wrote: >>> >>>> >>> >>>> Are event_time and timestamp essentially representing the same >>> >>>> datetime? >>> >>>> >>> >>>> On Sunday, August 31, 2014, Subodh Nijsure >>> >>>> <subodh.nijs...@gmail.com> >>> >>>> wrote: >>> >>>>> >>> >>>>> I have following database schema >>> >>>>> >>> >>>>> CREATE TABLE sensor_info_table ( >>> >>>>> asset_id text, >>> >>>>> event_time timestamp, >>> >>>>> "timestamp" timeuuid, >>> >>>>> sensor_reading map<text, text>, >>> >>>>> sensor_serial_number text, >>> >>>>> sensor_type int, >>> >>>>> PRIMARY KEY ((asset_id), event_time, "timestamp") >>> >>>>> ); >>> >>>>> >>> >>>>> CREATE INDEX event_time_index ON sensor_info_table (event_time); >>> >>>>> >>> >>>>> CREATE INDEX timestamp_index ON sensor_info_table ("timestamp"); >>> >>>>> >>> >>>>> Now I am able to insert the data into this table, however I am >>> >>>>> unable >>> >>>>> to do following query where I want to select items with specific >>> >>>>> timeuuid values. >>> >>>>> >>> >>>>> It gives me following error. >>> >>>>> >>> >>>>> SELECT * from mydb.sensor_info_table where timestamp IN ( >>> >>>>> bfdfa614-3166-11e4-a61d-b888e30f5d17 , >>> >>>>> bf4521ac-3166-11e4-87a3-b888e30f5d17) ; >>> >>>>> >>> >>>>> Bad Request: PRIMARY KEY column "timestamp" cannot be restricted >>> >>>>> (preceding column "event_time" is either not restricted or by a >>> >>>>> non-EQ >>> >>>>> relation) >>> >>>>> >>> >>>>> What do I have to do to make this work? >>> >>>>> >>> >>>>> For what its worth I am using django for my front end development >>> >>>>> and >>> >>>>> I am using "timestamp timeuuid" field as unique indentifier to >>> >>>>> reference specific sensor reading from django framework -- since >>> >>>>> cassandra doesn't have way to generate unique id upon insert (like >>> >>>>> old-style rdms's auto-fields). >>> >>>>> >>> >>>>> >>> >>>>> Below is software version info. >>> >>>>> >>> >>>>> show VERSION ; [cqlsh 4.1.1 | Cassandra 2.0.9 | CQL spec 3.1.1 | >>> >>>>> Thrift protocol 19.39.0] >>> >>>>> >>> >>>>> I really don't understand what the error message preceeding column >>> >>>>> "event_time" is either not restricted or by no-EQ relation? >>> >>>>> >>> >>>>> -Subodh Nijsure >>> >> >>> >> >>> > >> >> > >