Hello,

  Compaction strategy, leveled vs. sized tier, will impact the amount of
compaction that occurs, i.e. compaction time, more than the two data model
options.  Check out this blog for more information on the types of
compaction strategy -
http://www.datastax.com/dev/blog/when-to-use-leveled-compaction

  My recommendation would be to choose the right compaction strategy based
on the article listed above and right model based on your query access
pattern, as opposed to trying to figure out which model would be more
advantageous for compaction.  If your system can handle the disk i/o, has
ssd's, then leveraging Leveled compaction + Wide rows should give you the
best read performance.  Your queries will be able to be satisfied by a lot
fewer disk seeks because of the wide row access pattern and advantages of
Leveled compaction.

  Per your specific question about which model is going to be "better" for
compaction, I don't know the answer for that question.   I would think that
the merge sorting of more, smaller records could be somewhat "better" for
compaction but have no way to quantify that for you.  However, the wide row
scenario sounds like it will provide a significant advantage to your query
access times.

  Hope that helps and if not, maybe someone else can provide the answer to
your specific question regarding the impacts of your model on compaction.

Thanks,

Jonathan


Jonathan Lacefield
Solutions Architect, DataStax
(404) 822 3487
<http://www.linkedin.com/in/jlacefield>


<http://www.datastax.com/what-we-offer/products-services/training/virtual-training>


On Wed, Mar 26, 2014 at 2:54 PM, Donald Smith <
donald.sm...@audiencescience.com> wrote:

>  My underlying question is about the effects of the partitioning key on
> compaction.   Specifically, would having date as part of the partitioning
> key make compaction easier (because compaction wouldn't have to merge wide
> rows over multiple days)?   According to the person on irc, it wouldn't
> make much difference.
>
>
> We care mostly about read times. If read times were *all* we cared about,
> we'd use a CQL primary key  of *((customer_id,type) date)*, especially
> since it lets us efficiently iterate over all dates for a given customer
> and type.  I also care about compaction time, and if the other primary key
> form decreased compaction time, I might go for it. We have terabytes of
> data.
>
>
>
> I don't think we ever have to query all types for a given customer or
> date.  That is, we are always given a specific customer and type, plus
> usually but not always a date.
>
>
>
> Thanks, Don
>
>
>
> *From:* Jonathan Lacefield [mailto:jlacefi...@datastax.com]
> *Sent:* Wednesday, March 26, 2014 11:20 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Question about how compaction and partition keys interact
>
>
>
> Don,
>
>
>
>   What is the underlying question?  Are trying to figure out what's going
> to be faster for reads or are you really concerned about storage?
>
>
>
>   The recommendation typically provided is to suggest that tables are
> modeled based on query access, to enable the fastest read performance.
>
>
>
>   In your example, will your app's queries look for
>
>   1)  customer interactions by type by day, with the ability to
>
>            - sort by day within a type
>
>            - grab ranges of dates for at type quickly
>
>            - or pull all dates (and cell data) for a type
>
>    or
>
>  2)  customer interactions by date by type, with the ability to
>
>            - sort by type within a date
>
>            - grab ranges of types for a date quickly
>
>            - or pull all types data for a date
>
>
>
>   We also typically recommend that partitions stay within ~100k of columns
> or ~100MB per partition.  With your first scenario, wide row, you wouldn't
> hit the number of columns for ~273 years :)
>
>
>
>   What's interesting in your modeling scenario is that, with the current
> options, you don't have the ability to easily pull all dates for a customer
> without specifying the type, specific dates, or using ALLOW FILTERING.  Did
> you ever consider partitioning simply on customer and using date and type
> as clustering keys?
>
>
>
>   Hope that helps.
>
>
>
> Jonathan
>
>
>
>
>
>
>
>
>   Jonathan Lacefield
>
> Solutions Architect, DataStax
>
> (404) 822 3487
>
> [image: Image removed by sender.] <http://www.linkedin.com/in/jlacefield>
>
>
>
>
>
> [image: Image removed by 
> sender.]<http://www.datastax.com/what-we-offer/products-services/training/virtual-training>
>
>
>
> On Wed, Mar 26, 2014 at 1:22 PM, Donald Smith <
> donald.sm...@audiencescience.com> wrote:
>
> In CQL we need to decide between using *((customer_id,type),date) *as the
> CQL primary key for a reporting table, versus *((customer_id,date),type)*.
>
>
>
> We store reports for every day.  If we use *(customer_id,type)* as the
> partition key (physical key), then we have  a WIDE ROW where each date's
> data is stored in a different column. Over time, as new reports are added
> for different dates, the row will get wider and wider, and I thought that
> might cause more work for compaction.
>
>
>
> So, would a partition key of *(customer_id,date)* yield better compaction
> behavior?
>
>
>
> Again, if we use *(customer_id,type)* as the partition key, then over
> time, as new columns are added to that row for different dates, I'd think
> that compaction would have to merge new data for a given physical row from
> multiple sstables. That would make compaction expensive.  But if we use
> *(customer_id,date)* as the partition key, then new data will be added to *new
> physical rows*, and so compaction would have less work to do????
>
>
>
> My question is really about how compaction interacts with partition keys.
>  Someone on the Cassandra irc channel,
> http://webchat.freenode.net/?channels=#cassandra, said that when
> partition keys overlap between sstables, there's only "slightly" more work
> to do than when they don't, for merging sstables in compaction.  So he
> thought the first form, * ((customer_id,type),date), * would be better.
>
>
>
> One advantage of the first form,* ((customer_id,type),date) , * is that
> we can get all report data for all dates for a given customer and type in a
> single wide row  -- and we do have a (uncommon) use case for such reports.
>
>
>
> If we used a primary key of *((customer_id,type,date))*, then the rows
> would be un-wide; that wouldn't take advantage of clustering columns and
> (like the second form) wouldn't support the (uncommon) use case mentioned
> in the previous paragraph.
>
>
>
> Thanks, Don
>
>
>
> *Donald A. Smith* | Senior Software Engineer
> P: 425.201.3900 x 3866
> C: (206) 819-5965
> F: (646) 443-2333
> dona...@audiencescience.com
>
>
> [image: AudienceScience]
>
>
>
>
>

<<inline: image001.jpg>>

<<inline: ~WRD000.jpg>>

Reply via email to