Re: Time Series schema performance

sujeet jog Tue, 29 May 2018 23:07:43 -0700

Thanks Jeff & Jonathan,


On Tue, May 29, 2018 at 10:41 PM, Jonathan Haddad <j...@jonhaddad.com> wrote:

> I wrote a post on this topic a while ago, might be worth reading over:
> http://thelastpickle.com/blog/2017/08/02/time-series-data-
> modeling-massive-scale.html
> On Tue, May 29, 2018 at 8:02 AM Jeff Jirsa <jji...@gmail.com> wrote:
>
> > There’s a third option which is doing bucketing by time instead of by
> hash, which tends to perform quite well if you’re using TWCS as it makes it
> quite likely that a read can be served by a single sstable
>
> > --
> > Jeff Jirsa
>
>
> > On May 29, 2018, at 6:49 AM, sujeet jog <sujeet....@gmail.com> wrote:
>
> > Folks,
> > I have two alternatives for the time series schema i have, and wanted to
> weigh of on one of the schema .
>
> > The query is given id, & timestamp, read the metrics associated with the
> id
>
> > The records are inserted every 5 mins, and the number of id's = 2
> million,
> > so at every 5mins  it will be 2 million records that will be written.
>
> > Bucket Range  : 0 - 5K.
>
> > Schema 1 )
>
> > create table (
> > id timeuuid,
> > bucketid Int,
> > date date,
> > timestamp timestamp,
> > metricName1   BigInt,
> > metricName2 BigInt.
> > ...
> > .....
> > metricName300 BigInt,
>
> > Primary Key (( day, bucketid ) ,  id, timestamp)
> > )
>
> > BucketId is just a murmur3 hash of the id  which acts as a splitter to
> group id's in a partition
>
>
> > Pros : -
>
> > Efficient write performance, since data is written to minimal partitions
>
> > Cons : -
>
> > While the first schema works best when queried programmatically, but is a
> bit inflexible If it has to be integrated with 3rd party BI tools like
> tableau, bucket-id cannot be generated from tableau as it's not part of the
> view etc..
>
>
> > Schema 2 )
> > Same as above, without bucketid &  date.
>
> > Primary Key (id, timestamp )
>
> > Pros : -
>
> > BI tools don't need to generate bucket id lookups,
>
> > Cons :-
> > Too many partitions are written every 5 mins,  say 2 million records
> written in distinct 2 million partitions.
>
>
>
> > I believe writing this data to commit log is same in case of Schema 1 &
> Schema 2 ) , but the actual performance bottleneck could be compaction,
> since the data from memtable is transformed to ssTables often based on the
> memory settings, and
> > the header for every SSTable would maintain partitionIndex with
>   byteoffsets,
>
> >   wanted to guage how bad can the performance of Schema-2 go with respect
> to Write/Compaction having to do many diskseeks.
>
> > compacting many tables but with too many partitionIndex entries because
> of the high number of parititions ,  can this be a bottleneck ?..
>
> > Any indept performance explanation of Schema-2 would be very much helpful
>
>
> > Thanks,
>
>
>
>
> --
> Jon Haddad
> http://www.rustyrazorblade.com
> twitter: rustyrazorblade
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>

Re: Time Series schema performance

Reply via email to