Adding New Node to an Existing Cluster

2012-12-17 Thread Adeel Akbar

Hi,

I am using Cassandra Cluster 1.1.4 with two nodes alongwith REPLICA 
factor 2. I have added one new node in my existing Cassandra Cluster 
with instruction provided on 
http://www.datastax.com/docs/1.1/operations/cluster_management. Now Its 
only showing 285 MB data instead of 250GB +. Please let me know, can we 
required to execute any command to balance the data in all three nodes.


# /opt/apache-cassandra-1.1.4/bin/nodetool -h localhost ring
*Address DC  RackStatus State Load
Effective-Ownership Token *

   111379633042792501120498209932601771854
XX.XX.XX.C  DC1 RAC1Up Normal *286.58 MB * 
55.76%  16631224681855479515247230241845664688
XX.XX.XX.B  DC1 RAC1Up Normal  278.23 GB   
88.55% 91902851206288351623775585543017122534
XX.XX.XX.A  DC1 RAC1Up Normal  275.85 GB   
55.69% 111379633042792501120498209932601771854


Regards,

*Adeel Akbar*


Re: multiget_slice SlicePredicate

2012-12-17 Thread Jason Wee
if you have rows like 10k and get 100 column per row, this gonna choke the
cluster...been there. if you really still have to use multiget_slice, try
slice your data before calling multiget_slice and check if your cluster
read request pending increase... try to slow down the client sending
request to the cluster if the pending going up. :)


On Tue, Dec 11, 2012 at 6:15 AM, Wei Zhu  wrote:

> Well, not sure how parallel is multiget. Someone is saying it's in
> parallel sending requests to the different nodes and on each node it's
> executed sequentially. I didn't bother looking into the source code yet.
> Anyone knows it for sure?
>
> I am using Hector, just copied the thrift definition from Cassandra site
> for reference.
>
> You are right, the count is for each individual row.
>
> Thanks.
> -Wei
>
>   --
> *From:* "Hiller, Dean" 
> *To:* "user@cassandra.apache.org" ; Wei Zhu <
> wz1...@yahoo.com>
> *Sent:* Monday, December 10, 2012 1:13 PM
> *Subject:* Re: multiget_slice SlicePredicate
>
> What's wrong with multiget…parallel performance is great from multiple
> disks and so usually that is a good thing.
>
> Also, something looks wrong, since you have list keys, I would
> expect the Map to be Map>
>
> Are you sure you have that correct?  IF you set range to 100, it should be
> 100 columns each row but it never hurts to run the code and verify.
>
> Later,
> Dean
> PlayOrm Developer
>
>
> From: Wei Zhu mailto:wz1...@yahoo.com>>
> Reply-To: "user@cassandra.apache.org" <
> user@cassandra.apache.org>, Wei Zhu <
> wz1...@yahoo.com>
> Date: Monday, December 10, 2012 2:07 PM
> To: Cassandr usergroup  user@cassandra.apache.org>>
> Subject: multiget_slice SlicePredicate
>
> I know it's probably not a good idea to use multiget, but for my use case,
> it's the only choice,
>
> I have question regarding the SlicePredicate argument of the multiget_slice
>
>
> The SlicePredicate takes slice_range which takes start, end and range. I
> suppose start and end will apply to each individual row. How about range,
> is it a accumulative column count of all the rows or to the individual row?
> If I set range to 100, is it 100 columns per row, or total?
>
> Thanks for you reply,
> -Wei
>
> multiget_slice
>
> *
> map> multiget_slice(list keys,
> ColumnParent column_parent, SlicePredicate predicate, ConsistencyLevel
> consistency_level)
>
>
>
>
>


[RELEASE CANDIDATE] Apache Cassandra 1.2.0-rc1 released

2012-12-17 Thread Sylvain Lebresne
The Cassandra team is pleased to announce the first release candidate for
the
future Apache Cassandra 1.2.0.

Let me first stress that this is not the final release yet and as such is
*not*
ready for production use.

This release is getting very close to a final version but may still contain
bugs. All available testing of this release will help making 1.2.0 final a
better release and would thus be greatly appreciated. If you were to
encounter
any problem during your testing, please report[3,4] them. Be sure to a look
at
the change log[1] and the release notes[2] to see where Cassandra 1.2
differs
from the previous series.

Apache Cassandra 1.2.0-rc1[5] is available as usual from the cassandra
website (http://cassandra.apache.org/download/) and a debian package is
available using the 12x branch (see
http://wiki.apache.org/cassandra/DebianPackaging).

Thank you for your help in testing and have fun with it.

[1]: http://goo.gl/s7z8s (CHANGES.txt)
[2]: http://goo.gl/mSQr6 (NEWS.txt)
[3]: https://issues.apache.org/jira/browse/CASSANDRA
[4]: user@cassandra.apache.org
[5]:
http://git-wip-us.apache.org/repos/asf?p=cassandra.git;a=shortlog;h=refs/tags/cassandra-1.2.0-rc1


Re: Adding New Node to an Existing Cluster

2012-12-17 Thread Tomas Nunez
These tokens don't seem balanced at all. You should use "nodetool move" to
move those tokens to some balanced values, for instance:

C: 0
B: 56713727820156407428984779325531226112
A: 113427455640312814857969558651062452224




2012/12/17 Adeel Akbar 

>  Hi,
>
> I am using Cassandra Cluster 1.1.4 with two nodes alongwith REPLICA factor
> 2. I have added one new node in my existing Cassandra Cluster with
> instruction provided on
> http://www.datastax.com/docs/1.1/operations/cluster_management. Now Its
> only showing 285 MB data instead of 250GB +. Please let me know, can we
> required to execute any command to balance the data in all three nodes.
>
> # /opt/apache-cassandra-1.1.4/bin/nodetool -h localhost ring
> *Address DC  RackStatus State   Load
> Effective-Ownership Token   *
>
>111379633042792501120498209932601771854
> XX.XX.XX.C  DC1 RAC1Up Normal  *286.58 MB *
> 55.76%  16631224681855479515247230241845664688
> XX.XX.XX.B  DC1 RAC1Up Normal  278.23 GB
> 88.55%  91902851206288351623775585543017122534
> XX.XX.XX.A  DC1 RAC1Up Normal  275.85 GB
> 55.69%  111379633042792501120498209932601771854
>
> Regards,
>
> *Adeel Akbar*
>



-- 
[image: Groupalia] 
www.groupalia.com Tomàs NúñezIT-SysprodTel. + 34
93 159 31 00 Fax. + 34 93 396 18 52Llull, 95-97, 2º planta, 08005
BarcelonaSkype:
tomas.nunez.groupaliatomas.nu...@groupalia.com[image:
Twitter] Twitter [image: Twitter]
 Facebook [image: Twitter]
 Linkedin 
<><><><>

Re: Read operations resulting in a write?

2012-12-17 Thread Mike

Thank you Aaron, this was very helpful.

Could it be an issue that this optimization does not really take effect 
until the memtable with the hoisted data is flushed?  In my simple 
example below, the same row is updated and multiple selects of the same 
row will result in multiple writes to the memtable. It seems it maybe 
possible (although unlikely) where, if you go from a write-mostly to a 
read-mostly scenario, you could get into a state where you are stuck 
rewriting to the same memtable, and the memtable is not flushed because 
it absorbs the over-writes.  I can foresee this especially if you are 
reading the same rows repeatedly.


I also noticed from the codepaths that if Row caching is enabled, this 
optimization will not occur.  We made some changes this weekend to make 
this column family more suitable to row-caching and enabled row-caching 
with a small cache.  Our initial results is that it seems to have 
corrected the write counts, and has increased performance quite a bit.  
However, are there any hidden gotcha's there because this optimization 
is not occurring? https://issues.apache.org/jira/browse/CASSANDRA-2503 
mentions a "compaction is behind" problem.  Any history on that? I 
couldn't find too much information on it.


Thanks,
-Mike

On 12/16/2012 8:41 PM, aaron morton wrote:



1) Am I reading things correctly?

Yes.
If you do a read/slice by name and more than min compaction level 
nodes where read the data is re-written so that the next read uses 
fewer SSTables.


2) What is really happening here?  Essentially minor compactions can 
occur between 4 and 32 memtable flushes.  Looking through the code, 
this seems to only effect a couple types of select statements (when 
selecting a specific column on a specific key being one of them). 
During the time between these two values, every "select" statement 
will perform a write.

Yup, only for readying a row where the column names are specified.
Remember minor compaction when using SizedTiered Compaction (the 
default) works on buckets of the same size.


Imagine a row that had been around for a while and had fragments in 
more than Min Compaction Threshold sstables. Say it is 3 SSTables in 
the 2nd tier and 2 sstables in the 1st. So it takes (potentially) 5 
SSTable reads. If this row is read it will get hoisted back up.


But the row has is in only 1 SSTable in the 2nd tier and 2 in the 1st 
tier it will not hoisted.


There are a few short circuits in the SliceByName read path. One of 
them is to end the search when we know that no other SSTables contain 
columns that should be considered. So if the 4 columns you read 
frequently are hoisted into the 1st bucket your reads will get handled 
by that one bucket.


It's not every select. Just those that touched more the min compaction 
sstables.



3) Is this desired behavior?  Is there something else I should be 
looking at that could be causing this behavior?

Yes.
https://issues.apache.org/jira/browse/CASSANDRA-2503

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 15/12/2012, at 12:58 PM, Michael Theroux > wrote:



Hello,

We have an unusual situation that I believe I've reproduced, at least 
temporarily, in a test environment.  I also think I see where this 
issue is occurring in the code.


We have a specific column family that is under heavy read and write 
load on a nightly basis.   For the purposes of this description, I'll 
refer to this column family as "Bob".  During this nightly 
processing, sometimes Bob is under very write load, other times it is 
very heavy read load.


The application is such that when something is written to Bob, a 
write is made to one of two other tables.  We've witnessed a 
situation where the write count on Bob far outstrips the write count 
on either of the other tables, by a factor of 3->10.  This is based 
on the WriteCount available on the column family JMX MBean.  We have 
not been able to find where in our code this is happening, and we 
have gone as far as tracing our CQL calls to determine that the 
relationship between Bob and the other tables are what we expect.


I brought up a test node to experiment, and see a situation where, 
when a "select" statement is executed, a write will occur.


In my test, I perform the following (switching between nodetool and 
cqlsh):


update bob set 'about'='coworker' where key='';
nodetool flush
update bob set 'about'='coworker' where key='';
nodetool flush
update bob set 'about'='coworker' where key='';
nodetool flush
update bob set 'about'='coworker' where key='';
nodetool flush
update bob set 'about'='coworker' where key='';
nodetool flush

Then, for a period of time (before a minor compaction occurs), a 
select statement that selects specific columns will cause writes to 
occur in the write count of the column family:


select about,changed,data from bob where key='';

This situation will conti

Re: State of Cassandra and Java 7

2012-12-17 Thread Brian Tarbox
I was using jre-7u9-linux-x64  which was the latest at the time.

I'll confess that I did not file any bugs...at the time the advice from
both the Cassandra and Zookeeper lists was to stay away from Java 7 (and my
boss had had enough of my reporting that "*the problem was Java 7"* for me
to spend a lot more time getting the details).

Brian


On Sun, Dec 16, 2012 at 4:54 AM, Sylvain Lebresne wrote:

> On Sat, Dec 15, 2012 at 7:12 PM, Michael Kjellman  > wrote:
>
>> What "issues" have you ran into? Actually curious because we push 1.1.5-7
>> really hard and have no issues whatsoever.
>>
>>
> A related question is "which which version of java 7 did you try"? The
> first releases of java 7 were apparently famous for having many issues but
> it seems the more recent updates are much more stable.
>
> --
> Sylvain
>
>
>> On Dec 15, 2012, at 7:51 AM, "Brian Tarbox" 
>> wrote:
>>
>> We've reverted all machines back to Java 6 after running into numerous
>> Java 7 issues...some running Cassandra, some running Zookeeper, others just
>> general problems.  I don't recall any other major language release being
>> such a mess.
>>
>>
>> On Fri, Dec 14, 2012 at 5:07 PM, Bill de hÓra  wrote:
>>
>>> "At least that would be one way of defining "officially supported".
>>>
>>> Not quite, because, Datastax is not Apache Cassandra.
>>>
>>> "the only issue related to Java 7 that I know of is CASSANDRA-4958, but
>>> that's osx specific (I wouldn't advise using osx in production anyway) and
>>> it's not directly related to Cassandra anyway so you can easily use the
>>> beta version of snappy-java as a workaround if you want to. So that non
>>> blocking issue aside, and as far as we know, Cassandra supports Java 7. Is
>>> it rock-solid in production? Well, only repeated use in production can
>>> tell, and that's not really in the hand of the project."
>>>
>>> Exactly right. If enough people use Cassandra on Java7 and enough people
>>> file bugs about Java 7 and enough people work on bugs for Java 7 then
>>> Cassandra will eventually work well enough on Java7.
>>>
>>> Bill
>>>
>>> On 14 Dec 2012, at 19:43, Drew Kutcharian  wrote:
>>>
>>> > In addition, the DataStax official documentation states: "Versions
>>> earlier than 1.6.0_19 should not be used. Java 7 is not recommended."
>>> >
>>> > http://www.datastax.com/docs/1.1/install/install_rpm
>>> >
>>> >
>>> >
>>> > On Dec 14, 2012, at 9:42 AM, Aaron Turner 
>>> wrote:
>>> >
>>> >> Does Datastax (or any other company) support Cassandra under Java 7?
>>> >> Or will they tell you to downgrade when you have some problem, because
>>> >> they don't support C* running on 7?
>>> >>
>>> >> At least that would be one way of defining "officially supported".
>>> >>
>>> >> On Fri, Dec 14, 2012 at 2:22 AM, Sylvain Lebresne <
>>> sylv...@datastax.com> wrote:
>>> >>> What kind of official statement do you want? As far as I can be
>>> considered
>>> >>> an official voice of the project, my statement is: "various people
>>> run in
>>> >>> production with Java 7 and it seems to work".
>>> >>>
>>> >>> Or to answer the initial question, the only issue related to Java 7
>>> that I
>>> >>> know of is CASSANDRA-4958, but that's osx specific (I wouldn't
>>> advise using
>>> >>> osx in production anyway) and it's not directly related to Cassandra
>>> anyway
>>> >>> so you can easily use the beta version of snappy-java as a
>>> workaround if you
>>> >>> want to. So that non blocking issue aside, and as far as we know,
>>> Cassandra
>>> >>> supports Java 7. Is it rock-solid in production? Well, only repeated
>>> use in
>>> >>> production can tell, and that's not really in the hand of the
>>> project. We do
>>> >>> obviously encourage people to try Java 7 as much as possible and
>>> report any
>>> >>> problem they may run into, but I would have though this goes without
>>> saying.
>>> >>>
>>> >>>
>>> >>> On Fri, Dec 14, 2012 at 4:05 AM, Rob Coli 
>>> wrote:
>>> 
>>>  On Thu, Dec 13, 2012 at 11:43 AM, Drew Kutcharian 
>>> wrote:
>>> > With Java 6 begin EOL-ed soon
>>> > (https://blogs.oracle.com/java/entry/end_of_public_updates_for),
>>> what's the
>>> > status of Cassandra's Java 7 support? Anyone using it in
>>> production? Any
>>> > outstanding *known* issues?
>>> 
>>>  I'd love to see an official statement from the project, due to the
>>>  sort of EOL issues you're referring to. Unfortunately previous
>>>  requests on this list for such a statement have gone unanswered.
>>> 
>>>  The non-official response is that various people run in production
>>>  with Java 7 and it seems to work. :)
>>> 
>>>  =Rob
>>> 
>>>  --
>>>  =Robert Coli
>>>  AIM>ALK - rc...@palominodb.com
>>>  YAHOO - rcoli.palominob
>>>  SKYPE - rcoli_palominodb
>>> >>>
>>> >>>
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Aaron Turner
>>> >> http://synfin.net/ Twitter: @synfinatic
>>> >> http://tcpreplay.synfin.net/ - Pcap editing and replay tools for
>>> Unix & W

Data Model Review

2012-12-17 Thread Adam Venturella
My use case is capturing some information about Instagram photos from the
API. I have 2 use cases. One, I need to capture all of the media data for
an account and two I need to be able to privately annotate that data. There
is some nuance in this, multiple http queries for example, but ignoring
that, and assuming I have obtained all of the data surrounding an accounts
photos here is how I was thinking of storing that information for use case
1.

ColumnFamily: InstagramPhotos

Row Key: 

Columns:
Coulmn Name: 
Coulumn Value: JSON representing the data for the individual photo (filter,
comments, likes etc, not the binary photo data).



So the idea would be to keep adding columns to the row that contain that
serialized data (in JSON) with their timestamps as the name.  Timestamps as
the column names, I figure, should help help to perform range queries,
where I make the 1st column inserted the earliest timestamp and the last
column inserted the most recent. I could probably also use TimeUUIDs here
as well since I will have things ordered prior to inserting.

The question here, does this approach make sense? Is it common to store
JSON in columns like this? I know there are super columns as well, so I
could use those I suppose instead of JSON. The extra level of indexing
would probably be useful to query specific photos for use case 2. I have
heard it best to try and avoid the use of super columns for now. I have no
information to back that claim up other than some time spent in the IRC. So
feel free to debunk that statement if it is false.

So that is use case one, use case two covers the private annotations.

I figured here:

ColumnFamily: InstagramAnnotations
row key:  Canonical Media Id

Column Name: TimeUUID
Column Value: JSON representing an annotation/internal comment


Writing out the above I can actually see where I might need to tighten some
things up around how I store the photos. I am clearly missing an obvious
connection between the InstagramPhotos and the InstagramAnnotations, maybe
super columns would help with the photos instead of JSON? Otherwise I would
need to build an index row where I tie the the canonical photo id to a
timestamp (column name) in the InstagramPhotos. I could also try to figure
out how to make a TimeUUID of my own that can double as the media's
canonical id or further look at Instagram's canonical id for photos and see
if it already counts up. In which case I could use that in place of a
timestamp.

Anyway, I figured I would see if anyone might help flush out other
potential pitfalls in the above. I am definitely new to cassandra and I am
using this project as a way to learn some more about assembling systems
using it.


Re: Adding New Node to an Existing Cluster

2012-12-17 Thread aaron morton
Use nodetool move and change the tokens one at a time to the values suggested 
by Tomas. 

for background 
http://www.datastax.com/docs/1.1/references/nodetool#nodetool-move

Each move will take some time, you will see the node state change from UP to 
MOVING in the output from nodetool ring. When it's back to UP the move is done. 

*Note:* before making the change you should increase your RF to 3 and run a 
repair with nodetool (using the -pr option). This will make sure your data is 
fully distributed to all nodes. You should also make your code use the QUOURM 
CL .

Hope that helps. 

Aaron

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 17/12/2012, at 11:53 PM, Adeel Akbar  wrote:

> Dear Tomas,
> 
> I didn't use token values in configuration. A new node automatically picked a 
> token range when joined the cluster. Previously Nodes A and B server and now 
> I want to serve all these node like A+B+C.  
> 
> Thanks & Regards
> 
> Adeel Akbar
> 
> On 12/17/2012 3:47 PM, Tomas Nunez wrote:
>> These tokens don't seem balanced at all. You should use "nodetool move" to 
>> move those tokens to some balanced values, for instance:
>> 
>> C: 0
>> B: 56713727820156407428984779325531226112
>> A: 113427455640312814857969558651062452224
>> 
>> 
>> 
>> 
>> 2012/12/17 Adeel Akbar 
>> Hi,
>> 
>> I am using Cassandra Cluster 1.1.4 with two nodes alongwith REPLICA factor 
>> 2. I have added one new node in my existing Cassandra Cluster with 
>> instruction provided on 
>> http://www.datastax.com/docs/1.1/operations/cluster_management. Now Its only 
>> showing 285 MB data instead of 250GB +. Please let me know, can we required 
>> to execute any command to balance the data in all three nodes. 
>> 
>> # /opt/apache-cassandra-1.1.4/bin/nodetool -h localhost ring
>> Address DC  RackStatus State   Load
>> Effective-Ownership Token   
>>  
>>  111379633042792501120498209932601771854 
>> XX.XX.XX.C  DC1 RAC1Up Normal  286.58 MB  55.76% 
>>  16631224681855479515247230241845664688  
>> XX.XX.XX.B  DC1 RAC1Up Normal  278.23 GB   
>> 88.55%  91902851206288351623775585543017122534  
>> XX.XX.XX.A  DC1 RAC1Up Normal  275.85 GB   
>> 55.69%  111379633042792501120498209932601771854
>> 
>> Regards,
>> 
>> Adeel Akbar
>> 
>> 
>> 
>> -- 
>> 
>> www.groupalia.com
>> Tomàs Núñez
>> IT-Sysprod
>> Tel. + 34 93 159 31 00 
>> Fax. + 34 93 396 18 52
>> Llull, 95-97, 2º planta, 08005 Barcelona
>> Skype: tomas.nunez.groupalia
>> tomas.nu...@groupalia.com
>>  Twitter Facebook> Attachment.png> Linkedin
> 



Re: Read operations resulting in a write?

2012-12-17 Thread aaron morton
> Could it be an issue that this optimization does not really take effect until 
> the memtable with the hoisted data is flushed? 
No.
The read path in collectTimeOrderedData() reads from the memtable first. It 
then reads the SStable meta data (maxTimestamp) and checks if the candidate 
columns are both 1) all the columns in the query 2) the only possible values .

So immediately after the columns are hoisted a read will touch the and the 
sstable meta data (always in memory) for the most recent sstable.

> In my simple example below, the same row is updated and multiple selects of 
> the same row will result in multiple writes to the memtable.
with some overlapping reads this would be possible, once one of them has 
completed subsequent operations would read from the memtable only. 
 
> It seems it maybe possible (although unlikely) where, if you go from a 
> write-mostly to a read-mostly scenario, you could get into a state where you 
> are stuck rewriting to the same memtable, and the memtable is not flushed 
> because it absorbs the over-writes.
Memtable would still be flushed due to other CF's generating memory pressure 
and/or the commit log check pointing. 
Also reads go to the memtable first. 

> However, are there any hidden gotcha's there because this optimization is not 
> occurring?  
Not that I can think off, the optimisation is not occurring because all the 
work getTimeOrderedData() does to read from disk has been done.
You're good to to. Assuming you have narrow rows, or a good feel for how big 
they will get. 

> https://issues.apache.org/jira/browse/CASSANDRA-2503 mentions a "compaction 
> is behind" problem.  Any history on that?I couldn't find too much 
> information on it.
I assume it means cases where minor compaction cannot keep up. E.g. it has been 
throttled down, or a concurrent repair / upgradesstables is slowing things. 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 18/12/2012, at 4:01 AM, Mike  wrote:

> Thank you Aaron, this was very helpful.
> 
> Could it be an issue that this optimization does not really take effect until 
> the memtable with the hoisted data is flushed?  In my simple example below, 
> the same row is updated and multiple selects of the same row will result in 
> multiple writes to the memtable.  It seems it maybe possible (although 
> unlikely) where, if you go from a write-mostly to a read-mostly scenario, you 
> could get into a state where you are stuck rewriting to the same memtable, 
> and the memtable is not flushed because it absorbs the over-writes.  I can 
> foresee this especially if you are reading the same rows repeatedly.
> 
> I also noticed from the codepaths that if Row caching is enabled, this 
> optimization will not occur.  We made some changes this weekend to make this 
> column family more suitable to row-caching and enabled row-caching with a 
> small cache.  Our initial results is that it seems to have corrected the 
> write counts, and has increased performance quite a bit.  However, are there 
> any hidden gotcha's there because this optimization is not occurring?  
> https://issues.apache.org/jira/browse/CASSANDRA-2503 mentions a "compaction 
> is behind" problem.  Any history on that?I couldn't find too much 
> information on it.
> 
> Thanks,
> -Mike
> 
> On 12/16/2012 8:41 PM, aaron morton wrote:
>> 
>>> 1) Am I reading things correctly?
>> Yes. 
>> If you do a read/slice by name and more than min compaction level nodes 
>> where read the data is re-written so that the next read uses fewer SSTables.
>> 
>>> 2) What is really happening here?  Essentially minor compactions can occur 
>>> between 4 and 32 memtable flushes.  Looking through the code, this seems to 
>>> only effect a couple types of select statements (when selecting a specific 
>>> column on a specific key being one of them). During the time between these 
>>> two values, every "select" statement will perform a write.
>> Yup, only for readying a row where the column names are specified.
>> Remember minor compaction when using SizedTiered Compaction (the default) 
>> works on buckets of the same size. 
>> 
>> Imagine a row that had been around for a while and had fragments in more 
>> than Min Compaction Threshold sstables. Say it is 3 SSTables in the 2nd tier 
>> and 2 sstables in the 1st. So it takes (potentially) 5 SSTable reads. If 
>> this row is read it will get hoisted back up. 
>> 
>> But the row has is in only 1 SSTable in the 2nd tier and 2 in the 1st tier 
>> it will not hoisted. 
>> 
>> There are a few short circuits in the SliceByName read path. One of them is 
>> to end the search when we know that no other SSTables contain 
>> columns that should be considered. So if the 4 columns you read frequently 
>> are hoisted into the 1st bucket your reads will get handled by that one 
>> bucket. 
>> 
>> It's not every select. Just those that touched more the

Re: Read operations resulting in a write?

2012-12-17 Thread Edward Capriolo
Is there a way to turn this on and off through configuration? I am not
necessarily sure I would want this feature. Also it is confusing if these
writes show up in JMX and look like user generated write operations.


On Mon, Dec 17, 2012 at 10:01 AM, Mike  wrote:

>  Thank you Aaron, this was very helpful.
>
> Could it be an issue that this optimization does not really take effect
> until the memtable with the hoisted data is flushed?  In my simple example
> below, the same row is updated and multiple selects of the same row will
> result in multiple writes to the memtable.  It seems it maybe possible
> (although unlikely) where, if you go from a write-mostly to a read-mostly
> scenario, you could get into a state where you are stuck rewriting to the
> same memtable, and the memtable is not flushed because it absorbs the
> over-writes.  I can foresee this especially if you are reading the same
> rows repeatedly.
>
> I also noticed from the codepaths that if Row caching is enabled, this
> optimization will not occur.  We made some changes this weekend to make
> this column family more suitable to row-caching and enabled row-caching
> with a small cache.  Our initial results is that it seems to have corrected
> the write counts, and has increased performance quite a bit.  However, are
> there any hidden gotcha's there because this optimization is not
> occurring?  https://issues.apache.org/jira/browse/CASSANDRA-2503 mentions
> a "compaction is behind" problem.  Any history on that?  I couldn't find
> too much information on it.
>
> Thanks,
> -Mike
>
> On 12/16/2012 8:41 PM, aaron morton wrote:
>
>
>   1) Am I reading things correctly?
>
> Yes.
> If you do a read/slice by name and more than min compaction level nodes
> where read the data is re-written so that the next read uses fewer SSTables.
>
>  2) What is really happening here?  Essentially minor compactions can
> occur between 4 and 32 memtable flushes.  Looking through the code, this
> seems to only effect a couple types of select statements (when selecting a
> specific column on a specific key being one of them). During the time
> between these two values, every "select" statement will perform a write.
>
> Yup, only for readying a row where the column names are specified.
> Remember minor compaction when using SizedTiered Compaction (the default)
> works on buckets of the same size.
>
>  Imagine a row that had been around for a while and had fragments in more
> than Min Compaction Threshold sstables. Say it is 3 SSTables in the 2nd
> tier and 2 sstables in the 1st. So it takes (potentially) 5 SSTable reads.
> If this row is read it will get hoisted back up.
>
>  But the row has is in only 1 SSTable in the 2nd tier and 2 in the 1st
> tier it will not hoisted.
>
>  There are a few short circuits in the SliceByName read path. One of them
> is to end the search when we know that no other SSTables contain columns
> that should be considered. So if the 4 columns you read frequently are
> hoisted into the 1st bucket your reads will get handled by that one bucket.
>
>  It's not every select. Just those that touched more the min compaction
> sstables.
>
>
>  3) Is this desired behavior?  Is there something else I should be
> looking at that could be causing this behavior?
>
> Yes.
> https://issues.apache.org/jira/browse/CASSANDRA-2503
>
>  Cheers
>
>
>-
> Aaron Morton
> Freelance Cassandra Developer
> New Zealand
>
>  @aaronmorton
> http://www.thelastpickle.com
>
>  On 15/12/2012, at 12:58 PM, Michael Theroux  wrote:
>
>  Hello,
>
>  We have an unusual situation that I believe I've reproduced, at least
> temporarily, in a test environment.  I also think I see where this issue is
> occurring in the code.
>
>  We have a specific column family that is under heavy read and write load
> on a nightly basis.   For the purposes of this description, I'll refer to
> this column family as "Bob".  During this nightly processing, sometimes Bob
> is under very write load, other times it is very heavy read load.
>
>  The application is such that when something is written to Bob, a write
> is made to one of two other tables.  We've witnessed a situation where the
> write count on Bob far outstrips the write count on either of the other
> tables, by a factor of 3->10.  This is based on the WriteCount available on
> the column family JMX MBean.  We have not been able to find where in our
> code this is happening, and we have gone as far as tracing our CQL calls to
> determine that the relationship between Bob and the other tables are what
> we expect.
>
>  I brought up a test node to experiment, and see a situation where, when
> a "select" statement is executed, a write will occur.
>
>  In my test, I perform the following (switching between nodetool and
> cqlsh):
>
>  update bob set 'about'='coworker' where key='';
> nodetool flush
>  update bob set 'about'='coworker' where key='';
> nodetool flush
>  update bob set 'about'='coworker' where key='';
> nodetool