There’s a third option which is doing bucketing by time instead of by hash, which tends to perform quite well if you’re using TWCS as it makes it quite likely that a read can be served by a single sstable
-- Jeff Jirsa > On May 29, 2018, at 6:49 AM, sujeet jog <sujeet....@gmail.com> wrote: > > Folks, > I have two alternatives for the time series schema i have, and wanted to > weigh of on one of the schema . > > The query is given id, & timestamp, read the metrics associated with the id > > The records are inserted every 5 mins, and the number of id's = 2 million, > so at every 5mins it will be 2 million records that will be written. > > Bucket Range : 0 - 5K. > > Schema 1 ) > > create table ( > id timeuuid, > bucketid Int, > date date, > timestamp timestamp, > metricName1 BigInt, > metricName2 BigInt. > ... > ..... > metricName300 BigInt, > > Primary Key (( day, bucketid ) , id, timestamp) > ) > > BucketId is just a murmur3 hash of the id which acts as a splitter to group > id's in a partition > > > Pros : - > > Efficient write performance, since data is written to minimal partitions > > Cons : - > > While the first schema works best when queried programmatically, but is a bit > inflexible If it has to be integrated with 3rd party BI tools like tableau, > bucket-id cannot be generated from tableau as it's not part of the view etc.. > > > Schema 2 ) > Same as above, without bucketid & date. > > Primary Key (id, timestamp ) > > Pros : - > > BI tools don't need to generate bucket id lookups, > > Cons :- > Too many partitions are written every 5 mins, say 2 million records written > in distinct 2 million partitions. > > > > I believe writing this data to commit log is same in case of Schema 1 & > Schema 2 ) , but the actual performance bottleneck could be compaction, since > the data from memtable is transformed to ssTables often based on the memory > settings, and > the header for every SSTable would maintain partitionIndex with byteoffsets, > > wanted to guage how bad can the performance of Schema-2 go with respect to > Write/Compaction having to do many diskseeks. > > compacting many tables but with too many partitionIndex entries because of > the high number of parititions , can this be a bottleneck ?.. > > Any indept performance explanation of Schema-2 would be very much helpful > > > Thanks, > >