I wrote a post on this topic a while ago, might be worth reading over: http://thelastpickle.com/blog/2017/08/02/time-series-data-modeling-massive-scale.html On Tue, May 29, 2018 at 8:02 AM Jeff Jirsa <jji...@gmail.com> wrote:
> There’s a third option which is doing bucketing by time instead of by hash, which tends to perform quite well if you’re using TWCS as it makes it quite likely that a read can be served by a single sstable > -- > Jeff Jirsa > On May 29, 2018, at 6:49 AM, sujeet jog <sujeet....@gmail.com> wrote: > Folks, > I have two alternatives for the time series schema i have, and wanted to weigh of on one of the schema . > The query is given id, & timestamp, read the metrics associated with the id > The records are inserted every 5 mins, and the number of id's = 2 million, > so at every 5mins it will be 2 million records that will be written. > Bucket Range : 0 - 5K. > Schema 1 ) > create table ( > id timeuuid, > bucketid Int, > date date, > timestamp timestamp, > metricName1 BigInt, > metricName2 BigInt. > ... > ..... > metricName300 BigInt, > Primary Key (( day, bucketid ) , id, timestamp) > ) > BucketId is just a murmur3 hash of the id which acts as a splitter to group id's in a partition > Pros : - > Efficient write performance, since data is written to minimal partitions > Cons : - > While the first schema works best when queried programmatically, but is a bit inflexible If it has to be integrated with 3rd party BI tools like tableau, bucket-id cannot be generated from tableau as it's not part of the view etc.. > Schema 2 ) > Same as above, without bucketid & date. > Primary Key (id, timestamp ) > Pros : - > BI tools don't need to generate bucket id lookups, > Cons :- > Too many partitions are written every 5 mins, say 2 million records written in distinct 2 million partitions. > I believe writing this data to commit log is same in case of Schema 1 & Schema 2 ) , but the actual performance bottleneck could be compaction, since the data from memtable is transformed to ssTables often based on the memory settings, and > the header for every SSTable would maintain partitionIndex with byteoffsets, > wanted to guage how bad can the performance of Schema-2 go with respect to Write/Compaction having to do many diskseeks. > compacting many tables but with too many partitionIndex entries because of the high number of parititions , can this be a bottleneck ?.. > Any indept performance explanation of Schema-2 would be very much helpful > Thanks, -- Jon Haddad http://www.rustyrazorblade.com twitter: rustyrazorblade --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org