Re: Reduce Cassandra GC

2013-06-17 Thread Joel Samuelsson
> If you are talking about 1.2.x then I also have memory problems on the
idle cluster: java memory constantly slow grows up to limit, then spend
long time for GC. I never seen such behaviour for  1.0.x and 1.1.x, where
on idle cluster java memory stay on the same value.

No I am running Cassandra 1.1.8.

> Can you paste you gc config?

I believe the relevant configs are these:
# GC tuning options
JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"
JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"
JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"
JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"
JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=1"
JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75"
JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"

I haven't changed anything in the environment config up until now.

> Also can you take a heap dump at 2 diff points so that we can compare it?

I can't access the machine at all during the stop-the-world freezes. Was
that what you wanted me to try?

> Uncomment the followings in "cassandra-env.sh".
Done. Will post results as soon as I get a new stop-the-world gc.

> If you are unable to find a JIRA, file one

Unless this turns out to be a problem on my end, I will.


Re: Reduce Cassandra GC

2013-06-17 Thread Joel Samuelsson
Just got a very long GC again. What am I to look for in the logging I just
enabled?


2013/6/17 Joel Samuelsson 

> > If you are talking about 1.2.x then I also have memory problems on the
> idle cluster: java memory constantly slow grows up to limit, then spend
> long time for GC. I never seen such behaviour for  1.0.x and 1.1.x, where
> on idle cluster java memory stay on the same value.
>
> No I am running Cassandra 1.1.8.
>
> > Can you paste you gc config?
>
> I believe the relevant configs are these:
> # GC tuning options
> JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"
> JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"
> JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"
> JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"
> JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=1"
> JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75"
> JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
>
> I haven't changed anything in the environment config up until now.
>
> > Also can you take a heap dump at 2 diff points so that we can compare
> it?
>
> I can't access the machine at all during the stop-the-world freezes. Was
> that what you wanted me to try?
>
> > Uncomment the followings in "cassandra-env.sh".
> Done. Will post results as soon as I get a new stop-the-world gc.
>
> > If you are unable to find a JIRA, file one
>
> Unless this turns out to be a problem on my end, I will.
>


Re: Changing replication factor

2013-06-17 Thread Vegard Berget
Hi,
Thank you for the information.I have increased the rf, and I think the
increase we have seen in cpu load etc is due to the counter cf's,
which is almost write-only (reads a few times a day).  The load
increase is noticeable, but no problem.Repair went fine.  But I
noticed that when I increased rf for a counter column and for (some
completely different reasons) took one node down, and after that ran
Repair I would get multiple lines in system.log:"invalid counter shard
detected; (X, Y, Z) and (X, Y, Z2) differ only in count; will pick
highest to self-heal; this indicates a bug or corruption generated a
bad counter shard"I guess this is because that while the node was
down, the counters gets out of sync and needs to just pick the
highest?  In my case this will be (more or less) correct, since the
sync-problem happened because of a downed node,which means _all_
increases happens on the other node and that node will have the
correct number?  I am just curious, as some minor errors in the
counters would be no problem for us.
.vegard,
- Original Message -
From: user@cassandra.apache.org
To:, "Vegard Berget" 
Cc:
Sent:Fri, 14 Jun 2013 17:20:26 -0700
Subject:Re: Changing replication factor

 On Mon, Jun 10, 2013 at 6:04 AM, Vegard Berget  wrote:
 > If one increases the replication factor of a keyspace and then do a
repair,
 > how will this affect the performance of the affected nodes? Could
we risk
 > the nodes being (more or less) unresponsive while repair is going
on?

 Repair is a relatively heavyweight activity (the heaviest a cassandra
 node can do!) which requires significant headroom in terms of CPU,
 heap memory and disk space. It is possible that nodes could become
 unavailable transiently during the repair, but unless they are
already
 very busy they should not become completely unresponsive. For one
 thing, both compaction and streaming respect throttles which are
 designed to minimize the impact of the streaming/compaction workload
 resulting from repair.

 > The nodes I am speaking of contains ~100gb of data.

 This is a relatively small amount of data per node, which makes the
 impact of Repair less severe.

 > Also, some of the keyspaces I am considering increase the
replication factor
 > for contains Counter Column Families (has rf:1). I think I have
read that
 > adding replication to counter cfs will affect performance
negatively, is
 > this correct?

 Per Sylvain (one of the primary authors of the Counters codebase) [1]
:

 "
 For counters, it's a little bit different. At RF=3, for each inserts,
 one node is doing a write *and* a read, while the two other nodes are
 only doing a
 write. So given that the read takes a time is non negligible, you
 should see simple
 improvement a RF=3 compared to RF=1 because each node gets 1/3 of the
 reads (involved in
 the counter write) it would get if it was the only replica. Now if
the
 write time
 were negligible compared to the read time, then yes you would see
roughly a 3x
 increase. But while writes are still faster than reads in Cassandra,
 reads a now fairly
 fast too (but all this depends on other factor like how much the
 caches helps, etc...), so it
 will likely be less than a 3x increase. Should be noticeable though."
 "

 I interpret the above to mean that RF=3 is actually slightly *faster*
 for Counters than RF=1.

 =Rob

 [1]
http://mail-archives.apache.org/mod_mbox/cassandra-user/201110.mbox/%3ccakkz8q0thzzsbu2370mx6jpeec3lh17pjmv1kojggauajup...@mail.gmail.com%3E



Re: Large number of files for Leveled Compaction

2013-06-17 Thread Hiller, Dean
My bet is 5MB is the low end since many people go with the default.  We upped 
it to 10MB as at that time no one knew of what size was a good size to be and 
the default was only 5MB.

Dean

From: Franc Carter mailto:franc.car...@sirca.org.au>>
Reply-To: "user@cassandra.apache.org" 
mailto:user@cassandra.apache.org>>
Date: Sunday, June 16, 2013 11:37 PM
To: "user@cassandra.apache.org" 
mailto:user@cassandra.apache.org>>, Wei Zhu 
mailto:wz1...@yahoo.com>>
Subject: Re: Large number of files for Leveled Compaction

On Mon, Jun 17, 2013 at 3:28 PM, Wei Zhu 
mailto:wz1...@yahoo.com>> wrote:
default value of 5MB is way too small in practice. Too many files in one 
directory is not a good thing. It's not clear what should be a good number. I 
have heard people are using 50MB, 75MB, even 100MB. Do your own test o find a 
"right" number.

Interesting - 50MB is the low end of what people are using - 5MB is a lot 
lower. I'll try a 50MB set

cheers


-Wei


From: "Franc Carter" 
mailto:franc.car...@sirca.org.au>>
To: user@cassandra.apache.org
Sent: Sunday, June 16, 2013 10:15:22 PM
Subject: Re: Large number of files for Leveled Compaction




On Mon, Jun 17, 2013 at 2:59 PM, Manoj Mainali 
mailto:mainalima...@gmail.com>> wrote:
Not in the case of LeveledCompaction. Only SizeTieredCompaction merges smaller 
sstables into large ones. With the LeveledCompaction, the sstables are always 
of fixed size but they are grouped into different levels.

You can refer to this page 
http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra on 
details of how LeveledCompaction works.


Yes, but it seems I've misinterpreted that page ;-(

I took this paragraph

In figure 3, new sstables are added to the first level, L0, and immediately 
compacted with the sstables in L1 (blue). When L1 fills up, extra sstables are 
promoted to L2 (violet). Subsequent sstables generated in L1 will be compacted 
with the sstables in L2 with which they overlap. As more data is added, leveled 
compaction results in a situation like the one shown in figure 4.

to mean that once a level fills up it gets compacted into a higher level

cheers

Cheers
Manoj


On Mon, Jun 17, 2013 at 1:54 PM, Franc Carter 
mailto:franc.car...@sirca.org.au>> wrote:
On Mon, Jun 17, 2013 at 2:47 PM, Manoj Mainali 
mailto:mainalima...@gmail.com>> wrote:
With LeveledCompaction, each sstable size is fixed and is defined by 
sstable_size_in_mb in the compaction configuration of CF definition and default 
value is 5MB. In you case, you may have not defined your own value, that is why 
your each sstable is 5MB. And if you dataset is huge, you will see a lot of 
sstable counts.


Ok, seems like I do have (at least) an incomplete understanding. I realise that 
the minimum size is 5MB, but I thought compaction would merge these into a 
smaller number of larger sstables ?

thanks


Cheers

Manoj


On Fri, Jun 7, 2013 at 1:44 PM, Franc Carter 
mailto:franc.car...@sirca.org.au>> wrote:

Hi,

We are trialling Cassandra-1.2(.4) with Leveled compaction as it looks like it 
may be a win for us.

The first step of testing was to push a fairly large slab of data into the 
Column Family - we did this much faster (> x100) than we would in a production 
environment. This has left the Column Family with about 140,000 files in the 
Column Family directory which seems way too high. On two of the nodes the 
CompactionStats show 2 outstanding tasks and on a third node there are over 
13,000 outstanding tasks. However from looking at the log activity it looks 
like compaction has finished on all nodes.

Is this number of files expected/normal ?

cheers

--

Franc Carter| Systems architect|Sirca Ltd


franc.car...@sirca.org.au | 
www.sirca.org.au

Tel: +61 2 8355 2514

Level 4, 55 Harrington St, The Rocks NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215





--

Franc Carter| Systems architect|Sirca Ltd


franc.car...@sirca.org.au | 
www.sirca.org.au

Tel: +61 2 8355 2514

Level 4, 55 Harrington St, The Rocks NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215





--

Franc Carter| Systems architect|Sirca Ltd


franc.car...@sirca.org.au | 
www.sirca.org.au

Tel: +61 2 8355 2514

Level 4, 55 Harrington St, The Rocks NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215





--

Franc Carter| Systems architect|Sirca Ltd


franc.car...@sirca.org.au | 
www.sirca.org.au

Tel: +61 2 8355 2514

Level 4, 55 Harrington St, The Rocks NSW 2000

PO Box H58, 

nodetool ring showing different 'Load' size

2013-06-17 Thread Rodrigo Felix
Hi,

   I've been running a benchmark on Cassandra and I'm facing a problem
regarding to the size of the database.
   I performed a load phase and then, when running nodetool ring, I got the
following output:

*ubuntu@domU-12-31-39-0E-11-F1:~/cassandra$ bin/nodetool ring *
*Address DC  RackStatus State   Load
 Effective-Ownership Token   *
*
 85070591730234615865843651857942052864  *
*10.192.18.3 datacenter1 rack1   Up Normal  2.07 GB
50.00%  0   *
*10.85.135.169   datacenter1 rack1   Up Normal  2.09 GB
50.00%  85070591730234615865843651857942052864*

   After that I executed, for about one hour, a workload with scan and
insert queries. Then, after finishing the workload execution, I run again
nodetool ring and got the following:

*ubuntu@domU-12-31-39-0E-11-F1:~/cassandra$ bin/nodetool ring *
*Address DC  RackStatus State   Load
 Effective-Ownership Token   *
*
 85070591730234615865843651857942052864  *
*10.192.18.3 datacenter1 rack1   Up Normal  1.07 GB
50.00%  0   *
*10.85.135.169   datacenter1 rack1   Up Normal  2.15 GB
50.00%  85070591730234615865843651857942052864*

   Any idea why a node had its size reduced if no record was removed? No
machine or added or removed during this workload.
   Is this related to any kind of compression? If yes, is there a command
to confirm that?
   I also faced a problem where a node has its size increased from about
2gb to about 4gb. In this last scenario, I both added and removed nodes
during the workload depending on the load (CPU).
   Thanks in advance for any help.


Att.

*Rodrigo Felix de Almeida*
LSBD - Universidade Federal do Ceará
Project Manager
MBA, CSM, CSPO, SCJP


State of Cassandra-Shuffle (1.2.x)

2013-06-17 Thread Ben Boule
A bit of background:

We are in Beta, we have a very small (2 node) cluster that we created with 
1.2.1.  Being new to this we did not enable vnodes, and we got bit hard by the 
default token generation in production after setting up lots of development & 
QA clusters without running into the problem.   We ended up with like 97.5% of 
the tokens belonging to one of the two nodes.   The good thing is even one 
Cassandra node is doing OK right now with our load.   The bad thing of course 
is we still would rather it be balanced.   There is only about 120GB of data.

We would like to upgrade this cluster to vNodes.. we first tried doing this on 
1.2.1, it did not work due to the bug where the shuffle job inserted a 
corrupted row into the system.range_xfers column family.   Last week I talked 
to several people at the summit and it was recommended we try this with 1.2.5.

I have a test cluster I am trying to run this procedure on,  I set it up with 1 
token per node, then upgrade it to vnodes, then I upgraded it to 1.2.5 with no 
problems friday, and let it run over the weekend.  All appeared to be well when 
I left, there were something like 500 total relocations generated, and it had 
chugged through ~100 of them after an hour or so and it looked like it was 
heading towards being balanced.

@ip-10-10-1-160:/var/lib/cassandra/data/Keyspace1/Standard1/snapshots# 
nodetool status
Datacenter: datacenter1
===
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address  Load   Tokens  Owns   Host ID  
 Rack
UN  10.10.1.161  1.02 GB254 66.8%  6d500bc6-95fb-47a3-afb5-c283c4f3de03 
 rack1
UN  10.10.1.160  1.1 GB 258 33.2%  186d99b8-9fde-4e50-959a-6fba6098fba6 
 rack1

When I came in to work today (Monday), there were 189 relocations to go, and 
this is what the status looks like.

@ip-10-10-1-160:/var/lib/cassandra/data/Keyspace1/Standard1/snapshots# 
nodetool status
Datacenter: datacenter1
===
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address  Load   Tokens  Owns   Host ID  
 Rack
UN  10.10.1.161  48.11 GB   231 38.7%  6d500bc6-95fb-47a3-afb5-c283c4f3de03 
 rack1
UN  10.10.1.160  34.5 GB281 61.3%  186d99b8-9fde-4e50-959a-6fba6098fba6 
 rack1

An hour later and now it looks like this:

-@ip-10-10-1-160:/tmp# nodetool status
Datacenter: datacenter1
===
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address  Load   Tokens  Owns   Host ID  
 Rack
UN  10.10.1.161  11.61 GB   231 38.7%  6d500bc6-95fb-47a3-afb5-c283c4f3de03 
 rack1
UN  10.10.1.160  931.45 MB  281 61.3%  186d99b8-9fde-4e50-959a-6fba6098fba6 
 rack1

I did notice that it had fallen behind on compaction while this was running.

-@ip-10-10-1-161:~$ nodetool compactionstats
pending tasks: 6
  compaction typekeyspace   column family   completed   
total  unit  progress
   Compaction   Keyspace1   Standard1  1838641124  
5133428315 bytes35.82%
   Compaction   Keyspace1   Standard1  2255463423  
5110283630 bytes44.14%
Active compaction remaining time :   0h06m06s

The reduction in disk space did seem to correspond with about half of the 
compaction jobs finishing.   It seems to bounce up and down as it runs, 
consuming huge amounts of space and then freeing it up.

My question is what can we expect out of this job?  Should it really be 
working?   Do we need to expect it to waste 70-100x disk space while it runs?   
Are there compaction options we can set ahead of time to minimize the penalty 
here?  What is the expected extra space consumed while it runs, what is the 
expected extra space consumed when it is done?  Note that in my test cluster, I 
used a keyspace created by cassandra-stress, it uses the default compaction 
settings, which is SizeTiered and whatever the default thresholds are.   In our 
real cluster, we did configure compaction.

Our original plan when the job didn't work against 1.2.1 was to bring up a new 
cluster along side the old one, that was pre-configured for vNodes, and then 
migrate our data out of the old cluster into the new cluster.  Obviously this 
requires us to write our own software to do the migration.   We are going to 
size up the new cluster as well and update the schema, so it's not a total 
waste, but we would have liked to be able to balance the load on the original 
cluster in the mean time.

Any advice?  We are planning to migrate to 2.0 later this summer but probably 
don't want to build it from the beta source ourself right now.

Thank you,
Ben Boule
This electronic message contains information which may be confidential or 
privileged. The information is intended for the use of the individual or entity 
named above. If you are not the intended recipient, be aware that

Re: nodetool ring showing different 'Load' size

2013-06-17 Thread Eric Stevens
Load is the size of the storage on disk as I understand it.  This can
fluctuate during normal usage even if records are not being added or
removed, a node's load may be reduced during compaction for example.
 During compaction, especially if you use Size Tiered Compaction strategy
(the default), load may temporarily double for a column family.


On Mon, Jun 17, 2013 at 11:33 AM, Rodrigo Felix <
rodrigofelixdealme...@gmail.com> wrote:

> Hi,
>
>I've been running a benchmark on Cassandra and I'm facing a problem
> regarding to the size of the database.
>I performed a load phase and then, when running nodetool ring, I got
> the following output:
>
> *ubuntu@domU-12-31-39-0E-11-F1:~/cassandra$ bin/nodetool ring *
> *Address DC  RackStatus State   Load
>  Effective-Ownership Token   *
> *
>85070591730234615865843651857942052864  *
> *10.192.18.3 datacenter1 rack1   Up Normal  2.07 GB
> 50.00%  0   *
> *10.85.135.169   datacenter1 rack1   Up Normal  2.09 GB
> 50.00%  85070591730234615865843651857942052864*
>
>After that I executed, for about one hour, a workload with scan and
> insert queries. Then, after finishing the workload execution, I run again
> nodetool ring and got the following:
>
> *ubuntu@domU-12-31-39-0E-11-F1:~/cassandra$ bin/nodetool ring *
> *Address DC  RackStatus State   Load
>  Effective-Ownership Token   *
> *
>85070591730234615865843651857942052864  *
> *10.192.18.3 datacenter1 rack1   Up Normal  1.07 GB
> 50.00%  0   *
> *10.85.135.169   datacenter1 rack1   Up Normal  2.15 GB
> 50.00%  85070591730234615865843651857942052864*
>
>Any idea why a node had its size reduced if no record was removed? No
> machine or added or removed during this workload.
>Is this related to any kind of compression? If yes, is there a command
> to confirm that?
>I also faced a problem where a node has its size increased from about
> 2gb to about 4gb. In this last scenario, I both added and removed nodes
> during the workload depending on the load (CPU).
>Thanks in advance for any help.
>
>
> Att.
>
> *Rodrigo Felix de Almeida*
> LSBD - Universidade Federal do Ceará
> Project Manager
> MBA, CSM, CSPO, SCJP
>


Re: Large number of files for Leveled Compaction

2013-06-17 Thread Eric Stevens
At the DataStax Cassandra Summit 2013 last week, Al Tobey from Ooyala
recommended ss_table_size_in_mb be set at 256mb unless you have a fairly
small data set.  The talk was "Extreme Cassandra Optimization," and it was
superbly informative, I highly recommend it once DataStax gets the videos
online.


On Mon, Jun 17, 2013 at 1:35 AM, Wei Zhu  wrote:

> Correction, the largest I heard is 256MB SSTable size.
>
> --
> *From: *"Wei Zhu" 
> *To: *user@cassandra.apache.org
> *Sent: *Sunday, June 16, 2013 10:28:25 PM
>
> *Subject: *Re: Large number of files for Leveled Compaction
>
> default value of 5MB is way too small in practice. Too many files in one
> directory is not a good thing. It's not clear what should be a good number.
> I have heard people are using 50MB, 75MB, even 100MB. Do your own test o
> find a "right" number.
>
> -Wei
>
> --
> *From: *"Franc Carter" 
> *To: *user@cassandra.apache.org
> *Sent: *Sunday, June 16, 2013 10:15:22 PM
> *Subject: *Re: Large number of files for Leveled Compaction
>
>
>
> On Mon, Jun 17, 2013 at 2:59 PM, Manoj Mainali wrote:
>
>> Not in the case of LeveledCompaction. Only SizeTieredCompaction merges
>> smaller sstables into large ones. With the LeveledCompaction, the sstables
>> are always of fixed size but they are grouped into different levels.
>>
>> You can refer to this page
>> http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra on
>> details of how LeveledCompaction works.
>>
>>
> Yes, but it seems I've misinterpreted that page ;-(
>
> I took this paragraph
>
> In figure 3, new sstables are added to the first level, L0, and
>> immediately compacted with the sstables in L1 (blue). When L1 fills up,
>> extra sstables are promoted to L2 (violet). Subsequent sstables generated
>> in L1 will be compacted with the sstables in L2 with which they overlap. As
>> more data is added, leveled compaction results in a situation like the one
>> shown in figure 4.
>>
>
> to mean that once a level fills up it gets compacted into a higher level
>
> cheers
>
>
>> Cheers
>> Manoj
>>
>>
>> On Mon, Jun 17, 2013 at 1:54 PM, Franc Carter 
>> wrote:
>>
>>> On Mon, Jun 17, 2013 at 2:47 PM, Manoj Mainali 
>>> wrote:
>>>
 With LeveledCompaction, each sstable size is fixed and is defined by
 sstable_size_in_mb in the compaction configuration of CF definition and
 default value is 5MB. In you case, you may have not defined your own value,
 that is why your each sstable is 5MB. And if you dataset is huge, you will
 see a lot of sstable counts.

>>>
>>>
>>> Ok, seems like I do have (at least) an incomplete understanding. I
>>> realise that the minimum size is 5MB, but I thought compaction would merge
>>> these into a smaller number of larger sstables ?
>>>
>>> thanks
>>>
>>>
 Cheers

 Manoj


 On Fri, Jun 7, 2013 at 1:44 PM, Franc Carter >>> > wrote:

>
> Hi,
>
> We are trialling Cassandra-1.2(.4) with Leveled compaction as it looks
> like it may be a win for us.
>
> The first step of testing was to push a fairly large slab of data into
> the Column Family - we did this much faster (> x100) than we would in a
> production environment. This has left the Column Family with about 140,000
> files in the Column Family directory which seems way too high. On two of
> the nodes the CompactionStats show 2 outstanding tasks and on a third node
> there are over 13,000 outstanding tasks. However from looking at the log
> activity it looks like compaction has finished on all nodes.
>
> Is this number of files expected/normal ?
>
> cheers
>
> --
>
> *Franc Carter* | Systems architect | Sirca Ltd
>  
>
> franc.car...@sirca.org.au | www.sirca.org.au
>
> Tel: +61 2 8355 2514
>
> Level 4, 55 Harrington St, The Rocks NSW 2000
>
> PO Box H58, Australia Square, Sydney NSW 1215
>
>
>

>>>
>>>
>>> --
>>>
>>> *Franc Carter* | Systems architect | Sirca Ltd
>>>  
>>>
>>> franc.car...@sirca.org.au | www.sirca.org.au
>>>
>>> Tel: +61 2 8355 2514
>>>
>>> Level 4, 55 Harrington St, The Rocks NSW 2000
>>>
>>> PO Box H58, Australia Square, Sydney NSW 1215
>>>
>>>
>>>
>>
>
>
> --
>
> *Franc Carter* | Systems architect | Sirca Ltd
>  
>
> franc.car...@sirca.org.au | www.sirca.org.au
>
> Tel: +61 2 8355 2514
>
> Level 4, 55 Harrington St, The Rocks NSW 2000
>
> PO Box H58, Australia Square, Sydney NSW 1215
>
>
>
>
>


Re: opscentrer is spying

2013-06-17 Thread Robert Coli
On Sun, Jun 16, 2013 at 5:46 PM, Radim Kolar  wrote:
> in case you do not know yet, opscenter is sending certain data about your
> cassandra instalation back to datastax.
>
> This fact is not visibly presented to user, its same spyware crap like
> EHCache.

Could you expand on this? What information do you see being sent, and
how are you seeing it being transmitted to Datastax?

=Rob


Re: State of Cassandra-Shuffle (1.2.x)

2013-06-17 Thread Robert Coli
On Mon, Jun 17, 2013 at 8:37 AM, Ben Boule  wrote:
> We are in Beta, we have a very small (2 node) cluster that we created with
> 1.2.1.

https://issues.apache.org/jira/browse/CASSANDRA-5525

May be relevant?

What RF is this cluster? Given beta and cluster size and data size
this small, I would probably just dump and reload instead of trying to
make shuffle work.

http://palominodb.com/blog/2012/09/25/bulk-loading-options-cassandra

> Being new to this we did not enable vnodes, and we got bit hard by
> the default token generation in production after setting up lots of
> development & QA clusters without running into the problem.

It will perhaps be some consolation to hear that this insane
misfeature of automatic token assignment by range bisection is finally
going away in Cassandra 2.0.

> Any advice?  We are planning to migrate to 2.0 later this summer but
> probably don't want to build it from the beta source ourself right now.

Eric Evans gave a talk at the summit during which he attempted to
communicate that people probably shouldn't use shuffle. Given that he
is the one who wrote the shuffle patch, this seems like a meaningful
data point... :)

=Rob


Re: Changing replication factor

2013-06-17 Thread Robert Coli
On Mon, Jun 17, 2013 at 5:33 AM, Vegard  Berget  wrote:
> "invalid counter shard detected; (X, Y, Z) and (X, Y, Z2) differ only in
> count; will pick highest to self-heal; this indicates a bug or corruption
> generated a bad counter shard"

https://issues.apache.org/jira/browse/CASSANDRA-4417
and
https://issues.apache.org/jira/browse/CASSANDRA-4071

tl;dr - nobody fully understands in what case they are created (though
there are a few likely candidates) or how to fix them without
potential loss of counter accuracy, and nobody seems to be working on
a solution at this time.

https://issues.apache.org/jira/browse/CASSANDRA-5026

Reduces the frequency of the log messages.

https://issues.apache.org/jira/browse/CASSANDRA-4775

Is the ticket where Counters 2.0 design is occurring.

=Rob


Re: Pycassa xget not parsing composite column name properly

2013-06-17 Thread Tyler Hobbs
That looks correct, and I just double checked that xget behaves normally
for me for that case.  What does it actually print?  Can you try not
unpacking the tuple in your inner for-loop and print that?

Also, there's a pycassa mailing list (pycassa-disc...@googlegroups.com)
that would be a better location for this conversation.


On Sun, Jun 16, 2013 at 5:52 AM, Oleg Dulin  wrote:

> I have a column family defined as:
>
> create column family LSItemIdsByFieldValueIndex_**Integer
>  with column_type = 'Standard'
>  and comparator = 'CompositeType(org.apache.**cassandra.db.marshal.**
> IntegerType,org.apache.**cassandra.db.marshal.UTF8Type)**'
>  and default_validation_class = 'UTF8Type'
>  and key_validation_class = 'UTF8Type';
>
> This snippet of code:
>
>result=searchIndex.get_range(**column_count=1)
>for key,columns in result:
>print '\t',key
>indexData=searchIndex[indexCF]**.xget(key)
>for name, value in indexData:
>print name
>
> does not correctly print column name as parsed into a tuple of two parts.
>
> Am I doing something wrong here ?
>
>
>
> --
> Regards,
> Oleg Dulin
> http://www.olegdulin.com
>
>
>


-- 
Tyler Hobbs
DataStax 


Re: Is there anyone who implemented time range partitions with column families?

2013-06-17 Thread Robert Coli
On Wed, May 29, 2013 at 9:33 AM, Hiller, Dean  wrote:
> QUESTION: I am assuming 10 compactions should be enough to put enough load
> on the disk/cpu/ram etc. etc. or do you think I should go with 100CF's.
> 98% of our data is all in this one CF.

Compaction can only really efficiently multi-thread with some
relatively tight relationship to the number of cores available... my
hunch would be that the number of shards you are looking in
PlayObjectRockin'Mapper (TM) is closer to 10 than 100.

=Rob


Re: Uneven CPU load on a 4 node cluster

2013-06-17 Thread Andreas Wagner

Hi all,

I'm experiencing very similar effects. Did you (or anyone for that 
matter) have/solvethis issue?


I have a 3 node cluster with vnodes having the same #tokens (256). 
Infact, all nodes are configured identical and share similar/same 
hardware. Cassandra.yaml settings are fairly standard - nothing fancy.


According to "nodetool status" command everything is perfectly balanced. 
Running "cassandra-stress -d node_ip1,node_ip2,node_ip3" causes a heavy 
load on node_ip1, while node_ip2/3 are almost idle. Data, however, seems 
to be distributed evenly. I even get "UnavailableException" for some 
keys to be inserted on node_ip1.


I also tried a second run with the scheduling set to "roundrobin" and 
made use of the standard "throttlingoption". Unfortunately, nothing changed.


Could someone please provide some pointersand/or insights what I'm doing 
wrong?


Thanks so much!
Andreas


Re: opscentrer is spying

2013-06-17 Thread Nick Bailey
OpsCenter collects anonymous usage data and reports it back to DataStax.
For example, number of nodes, keyspaces, column families, etc. Stat
reporting isn't required to run OpsCenter however. To turn this feature
off, see the docs here (stat_reporter):

http://www.datastax.com/docs/opscenter/configure/configure_opscenter_adv#stat-reporter-interval

On Mon, Jun 17, 2013 at 11:44 AM, Robert Coli  wrote:

> On Sun, Jun 16, 2013 at 5:46 PM, Radim Kolar  wrote:
> > in case you do not know yet, opscenter is sending certain data about your
> > cassandra instalation back to datastax.
> >
> > This fact is not visibly presented to user, its same spyware crap like
> > EHCache.
>
> Could you expand on this? What information do you see being sent, and
> how are you seeing it being transmitted to Datastax?
>
> =Rob
>


multi-dc clusters with 'local' ips and no vpn

2013-06-17 Thread Chris Burroughs
Cassandra makes the totally reasonable assumption that the entire
cluster is in one routable address space.  We unfortunately had a
situation where:
 * nodes can talk to each other in the same dc on an internal address,
but not talk to each other over their external 1:1 NAT address.
 * nodes can talk to nodes in the other dc over the external address,
but there is no usable shared internal address space they can talk over

In case anyone else finds themselves in the same situation we have what
we think is a working solution in pre-production.  CASSANDRA-5630
handles the "reconnect trick" to prefer the local ip when in the same
DC.  And some iptables rules allow the local nodes to do the initial
gossiping with each other before that switch.

for each node in same dc:
'iptables -t nat -A OUTPUT -j DNAT -p tcp --dst %s --dport 7000 -o
eth0  --to-destination %s' % (ext_ip, local_ip)


Re: index_interval

2013-06-17 Thread Robert Coli
On Mon, May 13, 2013 at 9:19 PM, Bryan Talbot  wrote:
> Can the index sample storage be treated more like key cache or row cache
> where the total space used can be limited to something less than all
> available system ram, and space is recycled using an LRU (or configurable)
> algorithm?

Treating it with LRU doesn't seem to make that much sense, but there's
seemingly-trivial ways to prune an Index Sample [1] like
delete-every-other-key.

Brief conversation with driftx suggests a lack of enthusiasm for the
scale of win potential from active pruning of the Index Sample,
especially given the relative size of bloom filters compared to the
Index Sample.

However if you are interested in this as a potential improvement, feel
free to file a JIRA! :D

=Rob

[1] New terminology "Partition Summary" per jbellis keynote @ summit2013


Re: index_interval

2013-06-17 Thread Thomas Bernhardt





 From: Robert Coli 
To: user@cassandra.apache.org 
Sent: Monday, June 17, 2013 3:28 PM
Subject: Re: index_interval
 

On Mon, May 13, 2013 at 9:19 PM, Bryan Talbot  wrote:
> Can the index sample storage be treated more like key cache or row cache
> where the total space used can be limited to something less than all
> available system ram, and space is recycled using an LRU (or configurable)
> algorithm?

Treating it with LRU doesn't seem to make that much sense, but there's
seemingly-trivial ways to prune an Index Sample [1] like
delete-every-other-key.

Brief conversation with driftx suggests a lack of enthusiasm for the
scale of win potential from active pruning of the Index Sample,
especially given the relative size of bloom filters compared to the
Index Sample.

However if you are interested in this as a potential improvement, feel
free to file a JIRA! :D

=Rob

[1] New terminology "Partition Summary" per jbellis keynote @ summit2013

Custom 1.2 Authentication plugin will not work unless user is in system_auth.users column family

2013-06-17 Thread Bao Le
Hi,
 
  We have a custom  authenticator that works well with Cassandra 1.1.5.
When upgrading to C* 1.2.5, authentication failed. Turn out that in 
ClientState.login, we make a call to Auth.isExistingUser(user.getName())
if the AuthenticatedUser is not Anonymous user. This isExistingUser method does 
a query on system_auth.users and if it cannot find the name there, throw an 
exception.

  If our authentication model involves exchanging data on the fly and not 
relying on pre-created users, how do we bypass this check? Should we 
add a method on IAuthenticator to specify whether user look-up is needed or not?

Bao

Re: Custom 1.2 Authentication plugin will not work unless user is in system_auth.users column family

2013-06-17 Thread Dave Brosius
It seems to me that isExistingUser should be pushed down to the 
IAuthenticator implementation.


Perhaps you should add a ticket to 
https://issues.apache.org/jira/browse/CASSANDRA


On 06/17/2013 05:12 PM, Bao Le wrote:

Hi,

  We have a custom  authenticator that works well with Cassandra 1.1.5.
When upgrading to C* 1.2.5, authentication failed. Turn out that in 
ClientState.login, we make a call to Auth.isExistingUser(user.getName())
if the AuthenticatedUser is not Anonymous user. This isExistingUser 
method does a query on system_auth.users and if it cannot find the 
name there, throw an exception.


  If our authentication model involves exchanging data on the fly and 
not relying on pre-created users, how do we bypass this check? Should we
add a method on IAuthenticator to specify whether user look-up is 
needed or not?


Bao







Re: Reduce Cassandra GC

2013-06-17 Thread Takenori Sato
Find "promotion failure". Bingo if it happened at the time.

Otherwise, post the relevant portion of the log here. Someone may find a
hint.


On Mon, Jun 17, 2013 at 5:51 PM, Joel Samuelsson
wrote:

> Just got a very long GC again. What am I to look for in the logging I just
> enabled?
>
>
> 2013/6/17 Joel Samuelsson 
>
>> > If you are talking about 1.2.x then I also have memory problems on the
>> idle cluster: java memory constantly slow grows up to limit, then spend
>> long time for GC. I never seen such behaviour for  1.0.x and 1.1.x, where
>> on idle cluster java memory stay on the same value.
>>
>> No I am running Cassandra 1.1.8.
>>
>> > Can you paste you gc config?
>>
>> I believe the relevant configs are these:
>> # GC tuning options
>> JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"
>> JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"
>> JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"
>> JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"
>> JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=1"
>> JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75"
>> JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
>>
>> I haven't changed anything in the environment config up until now.
>>
>> > Also can you take a heap dump at 2 diff points so that we can compare
>> it?
>>
>> I can't access the machine at all during the stop-the-world freezes. Was
>> that what you wanted me to try?
>>
>> > Uncomment the followings in "cassandra-env.sh".
>> Done. Will post results as soon as I get a new stop-the-world gc.
>>
>> > If you are unable to find a JIRA, file one
>>
>> Unless this turns out to be a problem on my end, I will.
>>
>
>


Node failing to decomission (vnodes and 1.2.5)

2013-06-17 Thread David McNelis
I have a node in my ring (1.2.5) that when it was set up, had the wrong
number of vnodes assigned (double the amount it should have had).

As  a result, and because we can't reduce the number of vnodes on a machine
(at least at this point), I need to decommission the node.

The problem is that we've tried running decommission several times.  In
each instance we'll have a lot of streams to other nodes for a period, and
then eventually, netstats will tell us:

nodetool -h localhost netstats
Mode: LEAVING
 Nothing streaming to /10.x.x.1
 Nothing streaming to /10.x.x.2
 Nothing streaming to /10.x.x.3
Not receiving any streams.
Pool NameActive   Pending  Completed
Commandsn/a 0 955991
Responses   n/a 02947860

I also am not seeing anything in the nodes log files to suggest errors
during streaming or leaving.

Then the node will stay in this leaving state for... well, we gave up after
several days of no more activity and retried several times.  Each time we
"gave up" on it, we restarted the service and it was no longer listed as
Leaving, just active.  Even when in a "leaving" state, the size of data on
the node continued to grow.

What suggestions does anyone have on getting this node removed from my ring
so I can rebuild it with the correct number of tokens, before I end up with
a disk space issue from too many vnodes.