Folks, I have two alternatives for the time series schema i have, and wanted to weigh of on one of the schema .
The query is given id, & timestamp, read the metrics associated with the id The records are inserted every 5 mins, and the number of id's = 2 million, so at every 5mins it will be 2 million records that will be written. Bucket Range : 0 - 5K. Schema 1 ) create table ( id timeuuid, bucketid Int, date date, timestamp timestamp, metricName1 BigInt, metricName2 BigInt. ... ..... metricName300 BigInt, Primary Key (( day, bucketid ) , id, timestamp) ) BucketId is just a murmur3 hash of the id which acts as a splitter to group id's in a partition Pros : - Efficient write performance, since data is written to minimal partitions Cons : - While the first schema works best when queried programmatically, but is a bit inflexible If it has to be integrated with 3rd party BI tools like tableau, bucket-id cannot be generated from tableau as it's not part of the view etc.. Schema 2 ) Same as above, without bucketid & date. Primary Key (id, timestamp ) Pros : - BI tools don't need to generate bucket id lookups, Cons :- Too many partitions are written every 5 mins, say 2 million records written in distinct 2 million partitions. I believe writing this data to commit log is same in case of Schema 1 & Schema 2 ) , but the actual performance bottleneck could be compaction, since the data from memtable is transformed to ssTables often based on the memory settings, and the header for every SSTable would maintain partitionIndex with byteoffsets, wanted to guage how bad can the performance of Schema-2 go with respect to Write/Compaction having to do many diskseeks. compacting many tables but with too many partitionIndex entries because of the high number of parititions , can this be a bottleneck ?.. Any indept performance explanation of Schema-2 would be very much helpful Thanks,