Re: Time Series schema performance

Jonathan Haddad Tue, 29 May 2018 10:11:41 -0700

I wrote a post on this topic a while ago, might be worth reading over:
http://thelastpickle.com/blog/2017/08/02/time-series-data-modeling-massive-scale.html
On Tue, May 29, 2018 at 8:02 AM Jeff Jirsa <jji...@gmail.com> wrote:


> There’s a third option which is doing bucketing by time instead of by
hash, which tends to perform quite well if you’re using TWCS as it makes it
quite likely that a read can be served by a single sstable

> --
> Jeff Jirsa


> On May 29, 2018, at 6:49 AM, sujeet jog <sujeet....@gmail.com> wrote:

> Folks,
> I have two alternatives for the time series schema i have, and wanted to
weigh of on one of the schema .

> The query is given id, & timestamp, read the metrics associated with the
id

> The records are inserted every 5 mins, and the number of id's = 2
million,
> so at every 5mins  it will be 2 million records that will be written.

> Bucket Range  : 0 - 5K.

> Schema 1 )

> create table (
> id timeuuid,
> bucketid Int,
> date date,
> timestamp timestamp,
> metricName1   BigInt,
> metricName2 BigInt.
> ...
> .....
> metricName300 BigInt,

> Primary Key (( day, bucketid ) ,  id, timestamp)
> )

> BucketId is just a murmur3 hash of the id  which acts as a splitter to
group id's in a partition


> Pros : -

> Efficient write performance, since data is written to minimal partitions

> Cons : -

> While the first schema works best when queried programmatically, but is a
bit inflexible If it has to be integrated with 3rd party BI tools like
tableau, bucket-id cannot be generated from tableau as it's not part of the
view etc..


> Schema 2 )
> Same as above, without bucketid &  date.

> Primary Key (id, timestamp )

> Pros : -

> BI tools don't need to generate bucket id lookups,

> Cons :-
> Too many partitions are written every 5 mins,  say 2 million records
written in distinct 2 million partitions.



> I believe writing this data to commit log is same in case of Schema 1 &
Schema 2 ) , but the actual performance bottleneck could be compaction,
since the data from memtable is transformed to ssTables often based on the
memory settings, and
> the header for every SSTable would maintain partitionIndex with
  byteoffsets,

>   wanted to guage how bad can the performance of Schema-2 go with respect
to Write/Compaction having to do many diskseeks.

> compacting many tables but with too many partitionIndex entries because
of the high number of parititions ,  can this be a bottleneck ?..

> Any indept performance explanation of Schema-2 would be very much helpful


> Thanks,




-- 
Jon Haddad
http://www.rustyrazorblade.com
twitter: rustyrazorblade

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Re: Time Series schema performance

Reply via email to