Re: Interesting use case

kurt Greaves Wed, 08 Jun 2016 21:52:02 -0700

I would say it's probably due to a significantly larger number of
partitions when using the overwrite method - but really you should be
seeing similar performance unless one of the schemas ends up generating a
lot more disk IO.
If you're planning to read the last N values for an event at the same time
the widerow schema would be better, otherwise reading N events using the
overwrite schema will result in you hitting N partitions. You really need
to take into account how you're going to read the data when you design a
schema, not only how many writes you can push through.


On 8 June 2016 at 19:02, John Thomas <jthom...@gmail.com> wrote:

> We have a use case where we are storing event data for a given system and
> only want to retain the last N values.  Storing extra values for some time,
> as long as it isn’t too long, is fine but never less than N.  We can't use
> TTLs to delete the data because we can't be sure how frequently events will
> arrive and could end up losing everything.  Is there any built in mechanism
> to accomplish this or a known pattern that we can follow?  The events will
> be read and written at a pretty high frequency so the solution would have
> to be performant and not fragile under stress.
>
>
>
> We’ve played with a schema that just has N distinct columns with one value
> in each but have found overwrites seem to perform much poorer than wide
> rows.  The use case we tested only required we store the most recent value:
>
>
>
> CREATE TABLE eventyvalue_overwrite(
>
>     system_name text,
>
>     event_name text,
>
>     event_time timestamp,
>
>     event_value blob,
>
>     PRIMARY KEY (system_name,event_name))
>
>
>
> CREATE TABLE eventvalue_widerow (
>
>     system_name text,
>
>     event_name text,
>
>     event_time timestamp,
>
>     event_value blob,
>
>     PRIMARY KEY ((system_name, event_name), event_time))
>
>     WITH CLUSTERING ORDER BY (event_time DESC)
>
>
>
> We tested it against the DataStax AMI on EC2 with 6 nodes, replication 3,
> write consistency 2, and default settings with a write only workload and
> got 190K/s for wide row and 150K/s for overwrite.  Thinking through the
> write path it seems the performance should be pretty similar, with probably
> smaller sstables for the overwrite schema, can anyone explain the big
> difference?
>
>
>
> The wide row solution is more complex in that it requires a separate clean
> up thread that will handle deleting the extra values.  If that’s the path
> we have to follow we’re thinking we’d add a bucket of some sort so that we
> can delete an entire partition at a time after copying some values
> forward, on the assumption that deleting the whole partition is much better
> than deleting some slice of the partition.  Is that true?  Also, is there
> any difference between setting a really short ttl and doing a delete?
>
>
>
> I know there are a lot of questions in there but we’ve been going back and
> forth on this for a while and I’d really appreciate any help you could give.
>
>
>
> Thanks,
>
> John
>



-- 
Kurt Greaves
k...@instaclustr.com
www.instaclustr.com

Re: Interesting use case

Reply via email to