What is the best way to model my time series?

K. Lawson Fri, 25 Mar 2016 05:32:02 -0700

While adhering to best practices, I am trying to model a time series in
Cassandra that is compliant with the following access pattern directives:


- Is to be both read and shrank by a single party, grown by multiple parties
- Is to be read as a queue (in other words, its entries, from first to
last, are to be paged through in order)
- Is to grown as a queue (in other words, new entries (the number of which
is expected to fall in the range of 0 to a couple of hundred per day) are
always APPENDED to the series)
- Is to be shrunk by way of the removal of any entries which have been
processed by the application (immediately upon completion of said
processing)

So far, I've come up with four solutions, listed below (along with their
pros and cons), that are compliant with
the directives given above; is there any solution superior to these, and if
not, which one of these is most optimal?



Solution #1:


//Processing position markers (saved somewhere on disk)
mostRecentProcessedItemInsertTime = 0
mostRecentProcessedItemInsertDayStartTime = 0

CREATE TABLE IF NOT EXISTS solution_table_1
(
itemInsertDayStartTime timestamp
itemInsertTime timestamp
itemId timeuuid
PRIMARY KEY (itemInsertDayStartTime, itemInsertTime, itemId)
);
//Initial row retrieval query (presumably, the position markers will be
appropriately updated after each retrieval)

SELECT *

FROM solution_table_1

WHERE itemInsertDayStartTime IN (mostRecentProcessedItemInsertDayStartTime,
mostRecentProcessedItemInsertDayStartTime + 86400000, ...)

AND itemInsertTime > mostRecentProcessedItemInsertTime

LIMIT 30

Pros:
- Shards table data across the cluster

Cons:
- Requires the maintenance of position markers
- Requires the explicit specification of partitions (which may or may not
have data) to target for retrievals which page the table data by
itemInsertTime
- Requires correspondence with multiple nodes to satisfy retrievals which
page the table data by itemInsertTime


Solution #2:


CREATE TABLE IF NOT EXISTS solution_table_2
(
  itemInsertTime timestamp
itemId timeuuid
PRIMARY KEY (itemInserTime, itemId)
);
CREATE INDEX IF NOT EXISTS ON solution_table_2 (itemInsertTime);

//Initial row retrieval query
SELECT * FROM solution_table_2 WHERE itemInsertTime > 0 LIMIT 30 ALLOW
FILTERING

Pros:
- Shards table data across the cluster
- Enables retrievals which page table data by itemInsertTime to be
conducted without explicitly specifying partitions to target

Cons:
- Specifies the creation of an index on a high-cardinality column
- Requires correspondence with multiple nodes, as well as data filtering,
to satisfy retrievals which page the table data by itemInsertTime
Solution #3:

CREATE TABLE IF NOT EXISTS solution_table_3
(
itemInsertTime timestamp
itemId timeuuid
itemInsertDayStartTime timestamp
PRIMARY KEY (itemInsertTime, itemId)
);
CREATE INDEX IF NOT EXISTS ON solution_table_3 (itemInsertDayStartTime);
//Initial row retrieval query
SELECT * FROM solution_table_3 WHERE itemInsertDayStartTime > 0 LIMIT 30
ALLOW FILTERING

Pros:
- Shards table data across the cluster
- Enables retrievals which page table data by itemInsertTime to be
conducted without explicitly specifying partitions to target
- Specifies the creation of an index on a column with anticipatively
suitable cardinality

Cons:
- Requires correspondence with multiple nodes, as well as data filtering,
to satisfy retrievals which page the table data by itemInsertTime
Solution #4:

CREATE TABLE IF NOT EXISTS solution_table_4
(
dummyPartitionInt int
itemInsertTime timestamp
itemId timeuuid
PRIMARY KEY (dummyPartitionInt, itemInsertTime, itemId)
);
//Initial row retrieval query (assuming all rows are inserted with a
dummyPartitionInt value of 0)
SELECT * FROM solution_table_4 WHERE dummyPartitionInt = 0 AND
itemInsertTime > 0 LIMIT 30


Pros:
- Enables retrieval to be satisfied with a single replica set
- Enables retrievals which page table data by itemInsertTime to be
conducted without explicitly specifying more than one partition to target

Cons:
- Requires the use of a "dummy" column
- Specifies the constriction of table data (and as a result, all operations
on it) to a single partition

What is the best way to model my time series?

Reply via email to