RE: What is the best way to model my time series?

SEAN_R_DURITY Fri, 25 Mar 2016 06:44:27 -0700

You might take a look at this previous conversation on queue-type applications 
and Cassandra. Generally this is an anti-pattern for a distributed system like 
Cassandra.
https://mail-archives.apache.org/mod_mbox/cassandra-user/201603.mbox/<CANeMN=-keixxgywlnsyknhqmwfwmmy3b68pklsw55cgsm_u...@mail.gmail.com<https://mail-archives.apache.org/mod_mbox/cassandra-user/201603.mbox/%3cCANeMN=-keixxgywlnsyknhqmwfwmmy3b68pklsw55cgsm_u...@mail.gmail.com>>


Sean Durity

From: K. Lawson [mailto:klawso...@gmail.com]
Sent: Friday, March 25, 2016 8:32 AM
To: user@cassandra.apache.org
Subject: What is the best way to model my time series?

While adhering to best practices, I am trying to model a time series in 
Cassandra that is compliant with the following access pattern directives:

            - Is to be both read and shrank by a single party, grown by 
multiple parties
            - Is to be read as a queue (in other words, its entries, from first 
to last, are to be paged through in order)
            - Is to grown as a queue (in other words, new entries (the number 
of which is expected to fall in the range of 0 to a couple of hundred per day) 
are always APPENDED to the series)
            - Is to be shrunk by way of the removal of any entries which have 
been processed by the application (immediately upon completion of said 
processing)

So far, I've come up with four solutions, listed below (along with their pros 
and cons), that are compliant with
the directives given above; is there any solution superior to these, and if 
not, which one of these is most optimal?


Solution #1:

  //Processing position markers (saved somewhere on disk)
  mostRecentProcessedItemInsertTime = 0
  mostRecentProcessedItemInsertDayStartTime = 0

  CREATE TABLE IF NOT EXISTS solution_table_1
  (
              itemInsertDayStartTime                     timestamp
              itemInsertTime                                                
timestamp
              itemId                                                 timeuuid
              PRIMARY KEY                                            
(itemInsertDayStartTime, itemInsertTime, itemId)
  );

  //Initial row retrieval query (presumably, the position markers will be 
appropriately updated after each retrieval)

  SELECT *

  FROM solution_table_1

  WHERE itemInsertDayStartTime IN (mostRecentProcessedItemInsertDayStartTime, 
mostRecentProcessedItemInsertDayStartTime + 86400000, ...)

  AND itemInsertTime > mostRecentProcessedItemInsertTime

  LIMIT 30

Pros:
            - Shards table data across the cluster

Cons:
            - Requires the maintenance of position markers
            - Requires the explicit specification of partitions (which may or 
may not have data) to target for retrievals which page the table data by 
itemInsertTime
            - Requires correspondence with multiple nodes to satisfy retrievals 
which page the table data by itemInsertTime


Solution #2:

CREATE TABLE IF NOT EXISTS solution_table_2
(
        itemInsertTime                                          timestamp
  itemId                                                 timeuuid
  PRIMARY KEY                                            (itemInserTime, itemId)
);

CREATE INDEX IF NOT EXISTS ON solution_table_2 (itemInsertTime);

//Initial row retrieval query
SELECT * FROM solution_table_2 WHERE itemInsertTime > 0 LIMIT 30 ALLOW FILTERING

Pros:
            - Shards table data across the cluster
            - Enables retrievals which page table data by itemInsertTime to be 
conducted without explicitly specifying partitions to target

Cons:
            - Specifies the creation of an index on a high-cardinality column
            - Requires correspondence with multiple nodes, as well as data 
filtering, to satisfy retrievals which page the table data by itemInsertTime


Solution #3:

CREATE TABLE IF NOT EXISTS solution_table_3
(
  itemInsertTime                                                timestamp
  itemId                                                 timeuuid
  itemInsertDayStartTime                     timestamp
  PRIMARY KEY                                            (itemInsertTime, 
itemId)
);

CREATE INDEX IF NOT EXISTS ON solution_table_3 (itemInsertDayStartTime);

//Initial row retrieval query
SELECT * FROM solution_table_3 WHERE itemInsertDayStartTime > 0 LIMIT 30 ALLOW 
FILTERING

Pros:
            - Shards table data across the cluster
            - Enables retrievals which page table data by itemInsertTime to be 
conducted without explicitly specifying partitions to target
            - Specifies the creation of an index on a column with 
anticipatively suitable cardinality

Cons:
            - Requires correspondence with multiple nodes, as well as data 
filtering, to satisfy retrievals which page the table data by itemInsertTime


Solution #4:

CREATE TABLE IF NOT EXISTS solution_table_4
(
  dummyPartitionInt                             int
  itemInsertTime                                                timestamp
  itemId                                                 timeuuid
  PRIMARY KEY                                            (dummyPartitionInt, 
itemInsertTime, itemId)
);

//Initial row retrieval query (assuming all rows are inserted with a 
dummyPartitionInt value of 0)
SELECT * FROM solution_table_4 WHERE dummyPartitionInt = 0 AND itemInsertTime > 
0 LIMIT 30

Pros:
            - Enables retrieval to be satisfied with a single replica set
            - Enables retrievals which page table data by itemInsertTime to be 
conducted without explicitly specifying more than one partition to target

Cons:
            - Requires the use of a "dummy" column
            - Specifies the constriction of table data (and as a result, all 
operations on it) to a single partition

________________________________

The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages in connection with this e-mail 
message or its attachment.

RE: What is the best way to model my time series?

Reply via email to