Thanks Jeff & Jonathan,
On Tue, May 29, 2018 at 10:41 PM, Jonathan Haddad <j...@jonhaddad.com> wrote: > I wrote a post on this topic a while ago, might be worth reading over: > http://thelastpickle.com/blog/2017/08/02/time-series-data- > modeling-massive-scale.html > On Tue, May 29, 2018 at 8:02 AM Jeff Jirsa <jji...@gmail.com> wrote: > > > There’s a third option which is doing bucketing by time instead of by > hash, which tends to perform quite well if you’re using TWCS as it makes it > quite likely that a read can be served by a single sstable > > > -- > > Jeff Jirsa > > > > On May 29, 2018, at 6:49 AM, sujeet jog <sujeet....@gmail.com> wrote: > > > Folks, > > I have two alternatives for the time series schema i have, and wanted to > weigh of on one of the schema . > > > The query is given id, & timestamp, read the metrics associated with the > id > > > The records are inserted every 5 mins, and the number of id's = 2 > million, > > so at every 5mins it will be 2 million records that will be written. > > > Bucket Range : 0 - 5K. > > > Schema 1 ) > > > create table ( > > id timeuuid, > > bucketid Int, > > date date, > > timestamp timestamp, > > metricName1 BigInt, > > metricName2 BigInt. > > ... > > ..... > > metricName300 BigInt, > > > Primary Key (( day, bucketid ) , id, timestamp) > > ) > > > BucketId is just a murmur3 hash of the id which acts as a splitter to > group id's in a partition > > > > Pros : - > > > Efficient write performance, since data is written to minimal partitions > > > Cons : - > > > While the first schema works best when queried programmatically, but is a > bit inflexible If it has to be integrated with 3rd party BI tools like > tableau, bucket-id cannot be generated from tableau as it's not part of the > view etc.. > > > > Schema 2 ) > > Same as above, without bucketid & date. > > > Primary Key (id, timestamp ) > > > Pros : - > > > BI tools don't need to generate bucket id lookups, > > > Cons :- > > Too many partitions are written every 5 mins, say 2 million records > written in distinct 2 million partitions. > > > > > I believe writing this data to commit log is same in case of Schema 1 & > Schema 2 ) , but the actual performance bottleneck could be compaction, > since the data from memtable is transformed to ssTables often based on the > memory settings, and > > the header for every SSTable would maintain partitionIndex with > byteoffsets, > > > wanted to guage how bad can the performance of Schema-2 go with respect > to Write/Compaction having to do many diskseeks. > > > compacting many tables but with too many partitionIndex entries because > of the high number of parititions , can this be a bottleneck ?.. > > > Any indept performance explanation of Schema-2 would be very much helpful > > > > Thanks, > > > > > -- > Jon Haddad > http://www.rustyrazorblade.com > twitter: rustyrazorblade > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org > For additional commands, e-mail: user-h...@cassandra.apache.org > >