Nodes Flapping in the RIng

2011-11-10 Thread Jake Maizel
We have a new 6-node cluster running 0.6.13 (Due to some client side issues
we need to be on 0.6x for time being) that we are injecting data into and
ran into some issues with nodes going down and then up quickly in the
ring.  All nodes are effected and we have rules out the network layer.

It happens on all nodes and seems related to GC or mtable flushes.  We had
things stable but after a series of data migrations we saw some swapping so
we tuned to max heap down and this helped with swapping but the flapping
still persists.

The systems have 6-cores and 24 GB ram, max heap is at 12G.   We are using
the Parallel GC colector for throughput.

Our run file for starting cassandra looks like this:

exec 2>&1

ulimit -n 262144

cd /opt/cassandra-0.6.13

exec chpst -u cassandra java \
  -ea \
  -Xms4G \
  -Xmx12G \
  -XX:TargetSurvivorRatio=90 \
  -XX:+PrintGCDetails \
  -XX:+AggressiveOpts \
  -XX:+UseParallelGC \
  -XX:+CMSParallelRemarkEnabled \
  -XX:SurvivorRatio=128 \
  -XX:MaxTenuringThreshold=0 \
  -Djava.rmi.server.hostname=10.20.3.155 \
  -Dcom.sun.management.jmxremote.port=8080 \
  -Dcom.sun.management.jmxremote.ssl=false \
  -Dcom.sun.management.jmxremote.authenticate=false \
  -Dcassandra-foreground=yes \
  -Dstorage-config=/etc/cassandra \
  -cp '/etc/cassandra:/opt/cassandra-0.6.13/lib/*' \
  org.apache.cassandra.thrift.CassandraDaemon <&-

Our storage conf like this for the mem/disk stuff:



  


  mmap
  4
  64

  32
  64

  16

  64
  256
  0.3
  60

  12
  32

  periodic
  1

  864000

  true


Any thoughts on this would be really interesting.

-- 
Jake Maizel
Head of Network Operations
Soundcloud

Mail & GTalk: j...@soundcloud.com
Skype: jakecloud

Rosenthaler strasse 13, 101 19, Berlin, DE


Re: Off-heap caching through ByteBuffer.allocateDirect when JNA not available ?

2011-11-10 Thread Benoit Perroud
Thanks for the answer.
I saw the move to sun.misc.
In what sense allocateDirect is broken ?

Thanks,

Benoit.


2011/11/9 Jonathan Ellis :
> allocateDirect is broken for this purpose, but we removed the JNA
> dependency using sun.misc.Unsafe instead:
> https://issues.apache.org/jira/browse/CASSANDRA-3271
>
> On Wed, Nov 9, 2011 at 5:54 AM, Benoit Perroud  wrote:
>> Hi,
>>
>> I wonder if you have already discussed about ByteBuffer.allocateDirect
>> alternative to JNA memory allocation ?
>>
>> If so, do someone mind send me a pointer ?
>>
>> Thanks !
>>
>> Benoit.
>>
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
>



-- 
sent from my Nokia 3210


OOM : key caches, mmap ?

2011-11-10 Thread Alain RODRIGUEZ
Hi,

I faced a similar issue as described there :
http://comments.gmane.org/gmane.comp.db.cassandra.user/11184.

I was running Cassandra 1.0.0 with a 3 node cluster on 3 t1.micro from
Amazon EC2.

I have no error in cassandra logs, but an OOM in /var/log/kern.log which
put one of my nodes down.

After some search I have seen two possible reason for it :

1 - I may be keeping too much keys cached (200 000 which is default). I
read that now this cache is out of the heap, my heap is about 300 M, the
HEAP_NEWSIZE is about 75M, the system use about 150M all this out of 600M

What happens when I try to cache a row key and my memory is already full ?
How to choose the number of keys cached ? I saw that it is hard to
calculate the amount of memory used by this cache. Should I start by
caching 2,000 rows per CF and update my schema increasing progresively the
amount of key cached if I see that it works fine ?

2 - I read a lot about mmap without understanding clearly if I should keep
it "auto" or "mmap_index_only". Anyways, I didn't find the
"disk_access_mode" option in my cassandra.yaml, which let me think that it
was removed and I have no other choice to keep "auto" value.

Thank you,

Alain


Re: Data model for counting uniques online

2011-11-10 Thread Silviu Matei
what about keeping a record per device and recording there that you've seen
it, and only incrementing the counters (or a different set of counters)
based on that?


On Wed, Nov 9, 2011 at 6:09 PM, Philippe  wrote:

> Hello, I'd like to get some ideas on how to model counting uniques with
> cassandra.
> My use-case is that I have various counters that I increment based on data
> received from multiple devices. I'd like to be able to know if at least X
> unique devices contributed to a counter value.
> I've thought of the idea of building a key per counter and then using
> columns whose name would be the device ids but it might grow very very
> large and I don't see how to prune it effeciently.
> Any ideas ?
>
> Thanks
>


RE: propertyfilesnitch problem

2011-11-10 Thread Shu Zhang
At first, I was also thinking that one or more nodes in the cluster are broken 
or not responding. But through nodetool cfstats, it looks like all the nodes 
are working as expected and pings gives me the expected inter-node latencies. 
Also the scores calculated by dynamic snitch in the steady state seem to 
correspond to how we configure the network topology. 

We're not timing out, but comparing periods when the dynamic snitch has 
appropriate scores and when it doesn't, the latency of LOCAL_QUORUM operations 
gets bumped up from ~10ms to ~100ms. Quorum operations remain at ~100ms 
regardless of dynamic snitch settings. We maintain the consistent load through 
the tests and there are no feedback mechanisms.

Thanks,
Shu

From: sc...@scode.org [sc...@scode.org] On Behalf Of Peter Schuller 
[peter.schul...@infidyne.com]
Sent: Wednesday, November 09, 2011 11:07 PM
To: user@cassandra.apache.org
Subject: Re: propertyfilesnitch problem

> 2. With the same setup, after each period as defined by 
> dynamic_snitch_reset_interval_in_ms, the LOCAL_QUORUM performance greatly 
> degrades before drastically improving again within a minute.

This part sounds to me like one or more nodes in the cluster are
either broken and not responding at all, or overloaded. Restarts will
tend to temporarily cause additional pressure on nodes (particularly
I/O due to cache eviction issues).

Because the dynamic snitch won't ever know that the node is slow
(after a reset) until requests start actually timing out, it can be up
to rpc_timeout second before it gets snitched away. That sounds like
what you're seeing. On ever reset, an rpc_timeout period of poor
latency for clients.

Is rpc_timeout 60 seconds?

> 4. With dynamic snitch turned on, QUORUM operations' performance is about the 
> same as using LOCAL_QUORUM when the dynamic snitch is off or the first minute 
> after a restart with the snitch turned on.

This is strange, unless it is co-incidental.

Can you be more specific about the performance characteristics you're
seeing when degraded? For example:

* High latency, or timeouts?
* Are you getting Unavailable exceptions?
* Are you maintaining the same overall throughput or is there a
feedback mechanism such that when queries have high latency the
request rate decreases?
* Which data points are you using to consider something degraded?
What's matching in the QUORUM and LOCAL_QUOROM w/o dynsnitch cases?
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


indexes from CassandraSF

2011-11-10 Thread Guy Incognito

hi,

i've been looking at the model below from Ed Anuff's presentation at 
Cassandra CF (http://www.slideshare.net/edanuff/indexing-in-cassandra).  
Couple of questions:


1) Isn't there still the chance that two concurrent updates may end up 
with the index containing two entries for the given user, only one of 
which would be match the actual value in the Users cf?


2) What happens if your batch fails partway through the update?  If i 
understand correctly there are no guarantees about ordering when a batch 
is executed, so isn't it possible that eg the previous
value entries in Users_Index_Entries may have been deleted, and then the 
batch fails before the entries in Indexes are deleted, ie the mechanism 
has 'lost' those values?  I assume this can be addressed
by not deleting the old entries until the batch has succeeded (ie put 
the previous entry deletion into a separate, subsequent batch).  this at 
least lets you retry at a later time.


perhaps i'm missing something?

SELECT {"location"}..{"location", *}
FROM Users_Index_Entries WHERE KEY = ;

BEGIN BATCH

DELETE {"location", ts1}, {"location", ts2}, ...
FROM Users_Index_Entries WHERE KEY = ;

DELETE {, , ts1}, {, , ts2}, ...
FROM Indexes WHERE KEY = "Users_By_Location";

UPDATE Users_Index_Entries SET {"location", ts3} = 
WHERE KEY=;

UPDATE Indexes SET {, , ts3) = null
WHERE KEY = "Users_By_Location";

UPDATE Users SET location = 
WHERE KEY = ;

APPLY BATCH



1.0.2 Assertion Error

2011-11-10 Thread Dan Hendry
Just saw this weird assertion after upgrading one of my nodes from 0.8.6 to
1.0.2 (its been running fine for a few hours now):

 

INFO [FlushWriter:9] 2011-11-10 13:08:58,882 Memtable.java (line 237)
Writing Memtable-Data@1388955390(25676955/430716097 serialized/live bytes,
478913 ops)

ERROR [FlushWriter:9] 2011-11-10 13:08:59,513 AbstractCassandraDaemon.java
(line 133) Fatal exception in thread Thread[FlushWriter:9,5,main]

java.lang.AssertionError: CF size changed during serialization: was 4
initially but 3 written

at
org.apache.cassandra.db.ColumnFamilySerializer.serializeForSSTable(ColumnFam
ilySerializer.java:94)

at
org.apache.cassandra.db.ColumnFamilySerializer.serializeWithIndexes(ColumnFa
milySerializer.java:112)

at
org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:177)

at
org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:264)

at org.apache.cassandra.db.Memtable.access$400(Memtable.java:47)

at org.apache.cassandra.db.Memtable$4.runMayThrow(Memtable.java:289)

at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)

   at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.ja
va:886)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:9
08)

at java.lang.Thread.run(Thread.java:662)

 

 

After taking a quick peek at the code, it looks like the numbers ("4
initially but 3 written") refer to the number of columns (in the memtable?).
Given the byte size of the memtable being flushed, there are *certainly*
more than 4 columns. Besides this error, there are no other unexpected or
unusual log entries and the node seems to be behaving normally.

 

Thoughts?

 

Dan Hendry



Bug?

2011-11-10 Thread Ian Danforth
All,

 In 0.8.6 I got myself into a bit of a fix. First I tried to drop a column
family. This failed because I didn't have JNA installed (known and
documented). To fix this I drained the node, stopped the process, installed
JNA, and restarted C*.

Unfortunately this lead to an inconsistency in schema versions. Apparently
the schema was successfully modified locally during the drop command even
though the snapshot failed. In system.log HH then began to fail:

ERROR [HintedHandoff:4] 2011-11-10 19:24:35,628
AbstractCassandraDaemon.java (line 139) Fatal exception in thread
Thread[HintedHandoff:4,1,main]
java.lang.RuntimeException: java.lang.RuntimeException: Could not reach
schema agreement with /10.252.229.235 in 6ms
at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)
Caused by: java.lang.RuntimeException: Could not reach schema agreement
with /10.252.229.235 in 6ms
at
org.apache.cassandra.db.HintedHandOffManager.waitForSchemaAgreement(HintedHandOffManager.java:293)
at
org.apache.cassandra.db.HintedHandOffManager.deliverHintsToEndpoint(HintedHandOffManager.java:304)
at
org.apache.cassandra.db.HintedHandOffManager.access$100(HintedHandOffManager.java:89)
at
org.apache.cassandra.db.HintedHandOffManager$2.runMayThrow(HintedHandOffManager.java:397)
at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)



This edge case might be prevented if the schema was updated as the final
step of a DROP.

Thoughts on if this is worth submitting as a bug?

Ian


RE: 1.0.2 Assertion Error

2011-11-10 Thread Dan Hendry
Just happened again, seems to be with the same column family (at least on a
flusher thread for which the last activity was flushing a memtable for that
CF).

 

It also looks like MemtablePostFlusher tasks blocked (and not getting
cleared) as evidenced by tpstats:

 

Pool NameActive   Pending  Completed   Blocked  All
time blocked

MemtablePostFlusher   118 16 0
0

 

Dan

 

From: Dan Hendry [mailto:dan.hendry.j...@gmail.com] 
Sent: November-10-11 14:49
To: 'user@cassandra.apache.org'
Subject: 1.0.2 Assertion Error

 

Just saw this weird assertion after upgrading one of my nodes from 0.8.6 to
1.0.2 (its been running fine for a few hours now):

 

INFO [FlushWriter:9] 2011-11-10 13:08:58,882 Memtable.java (line 237)
Writing Memtable-Data@1388955390(25676955/430716097 serialized/live bytes,
478913 ops)

ERROR [FlushWriter:9] 2011-11-10 13:08:59,513 AbstractCassandraDaemon.java
(line 133) Fatal exception in thread Thread[FlushWriter:9,5,main]

java.lang.AssertionError: CF size changed during serialization: was 4
initially but 3 written

at
org.apache.cassandra.db.ColumnFamilySerializer.serializeForSSTable(ColumnFam
ilySerializer.java:94)

at
org.apache.cassandra.db.ColumnFamilySerializer.serializeWithIndexes(ColumnFa
milySerializer.java:112)

at
org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:177)

at
org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:264)

at org.apache.cassandra.db.Memtable.access$400(Memtable.java:47)

at org.apache.cassandra.db.Memtable$4.runMayThrow(Memtable.java:289)

at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)

   at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.ja
va:886)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:9
08)

at java.lang.Thread.run(Thread.java:662)

 

 

After taking a quick peek at the code, it looks like the numbers ("4
initially but 3 written") refer to the number of columns (in the memtable?).
Given the byte size of the memtable being flushed, there are *certainly*
more than 4 columns. Besides this error, there are no other unexpected or
unusual log entries and the node seems to be behaving normally.

 

Thoughts?

 

Dan Hendry



Re: 1.0.2 Assertion Error

2011-11-10 Thread Sylvain Lebresne
That would be a bug (as any assertion error would be), likely some
race condition.
Could you open a ticket?
The fact that this block the MemtablePostFlusher is unfortunately
related. Restarting the
node would fix but we need to make that more solid too.

--
Sylvain

On Thu, Nov 10, 2011 at 9:04 PM, Dan Hendry  wrote:
> Just happened again, seems to be with the same column family (at least on a
> flusher thread for which the last activity was flushing a memtable for that
> CF).
>
>
>
> It also looks like MemtablePostFlusher tasks blocked (and not getting
> cleared) as evidenced by tpstats:
>
>
>
> Pool Name    Active   Pending  Completed   Blocked  All
> time blocked
>
> MemtablePostFlusher   1    18     16
> 0 0
>
>
>
> Dan
>
>
>
> From: Dan Hendry [mailto:dan.hendry.j...@gmail.com]
> Sent: November-10-11 14:49
> To: 'user@cassandra.apache.org'
> Subject: 1.0.2 Assertion Error
>
>
>
> Just saw this weird assertion after upgrading one of my nodes from 0.8.6 to
> 1.0.2 (its been running fine for a few hours now):
>
>
>
> INFO [FlushWriter:9] 2011-11-10 13:08:58,882 Memtable.java (line 237)
> Writing Memtable-Data@1388955390(25676955/430716097 serialized/live bytes,
> 478913 ops)
>
> ERROR [FlushWriter:9] 2011-11-10 13:08:59,513 AbstractCassandraDaemon.java
> (line 133) Fatal exception in thread Thread[FlushWriter:9,5,main]
>
> java.lang.AssertionError: CF size changed during serialization: was 4
> initially but 3 written
>
>     at
> org.apache.cassandra.db.ColumnFamilySerializer.serializeForSSTable(ColumnFamilySerializer.java:94)
>
>     at
> org.apache.cassandra.db.ColumnFamilySerializer.serializeWithIndexes(ColumnFamilySerializer.java:112)
>
>     at
> org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:177)
>
>     at
> org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:264)
>
>     at org.apache.cassandra.db.Memtable.access$400(Memtable.java:47)
>
>     at org.apache.cassandra.db.Memtable$4.runMayThrow(Memtable.java:289)
>
>     at
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
>
>    at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>
>     at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>
>     at java.lang.Thread.run(Thread.java:662)
>
>
>
>
>
> After taking a quick peek at the code, it looks like the numbers (“4
> initially but 3 written”) refer to the number of columns (in the memtable?).
> Given the byte size of the memtable being flushed, there are *certainly*
> more than 4 columns. Besides this error, there are no other unexpected or
> unusual log entries and the node seems to be behaving normally.
>
>
>
> Thoughts?
>
>
>
> Dan Hendry


Data retrieval inconsistent

2011-11-10 Thread Subrahmanya Harve
I am facing an issue in 0.8.7 cluster -

- I have two clusters in two DCs (rather one cross dc cluster) and two
keyspaces. But i have only configured one keyspace to replicate data to the
other DC and the other keyspace to not replicate over to the other DC.
Basically this is the way i ran the keyspace creation  -
create keyspace K1 with
placement_strategy='org.apache.cassandra.locator.SimpleStrategy' and
strategy_options = [{replication_factor:1}];
create keyspace K2 with
placement_strategy='org.apache.cassandra.locator.NetworkTopologyStrategy'
and strategy_options = [{DC1:2, DC2:2}];

I had to do this because i expect that K1 will get a large volume of data
and i do not want this wired over to the other DC.

I am writing the data at CL=ONE and reading the data at CL=ONE. I am seeing
an issue where sometimes i get the data and other times i do not see the
data. Does anyone know what could be going on here?

A second larger question is  - i am migrating from 0.7.4 to 0.8.7 , i can
see that there are large changes in the yaml file, but a specific question
i had was - how do i configure disk_access_mode like it used to be in 0.7.4?

One observation i have made is that some nodes of the cross dc cluster are
at different system times. This is something to fix but could this be why
data is sometimes retrieved and other times not? Or is there some other
thing to it?

Would appreciate a quick response.


Re: Data retrieval inconsistent

2011-11-10 Thread Edward Capriolo
On Thu, Nov 10, 2011 at 3:27 PM, Subrahmanya Harve <
subrahmanyaha...@gmail.com> wrote:

> I am facing an issue in 0.8.7 cluster -
>
> - I have two clusters in two DCs (rather one cross dc cluster) and two
> keyspaces. But i have only configured one keyspace to replicate data to the
> other DC and the other keyspace to not replicate over to the other DC.
> Basically this is the way i ran the keyspace creation  -
>create keyspace K1 with
> placement_strategy='org.apache.cassandra.locator.SimpleStrategy' and
> strategy_options = [{replication_factor:1}];
>create keyspace K2 with
> placement_strategy='org.apache.cassandra.locator.NetworkTopologyStrategy'
> and strategy_options = [{DC1:2, DC2:2}];
>
> I had to do this because i expect that K1 will get a large volume of data
> and i do not want this wired over to the other DC.
>
> I am writing the data at CL=ONE and reading the data at CL=ONE. I am seeing
> an issue where sometimes i get the data and other times i do not see the
> data. Does anyone know what could be going on here?
>
> A second larger question is  - i am migrating from 0.7.4 to 0.8.7 , i can
> see that there are large changes in the yaml file, but a specific question
> i had was - how do i configure disk_access_mode like it used to be in
> 0.7.4?
>
> One observation i have made is that some nodes of the cross dc cluster are
> at different system times. This is something to fix but could this be why
> data is sometimes retrieved and other times not? Or is there some other
> thing to it?
>
> Would appreciate a quick response.
>

disk_access_mode is no longer required. Cassandra attempts to detect 64 bit
and set this correctly.

If your systems are not running NTP and as a result their clocks are not in
sync this can cause issues (with ttl columns). Your client machines must
have correct time as well.


RE: 1.0.2 Assertion Error

2011-11-10 Thread Dan Hendry
https://issues.apache.org/jira/browse/CASSANDRA-3482 

I restarted the node and the problem has cropped up again. 

Is it possible to downgrade back to 0.8? Is there any way to convert 'h'
version SSTables to the old 'g' version? Any other data file changes to be
aware of?

Dan

-Original Message-
From: Sylvain Lebresne [mailto:sylv...@datastax.com] 
Sent: November-10-11 15:23
To: user@cassandra.apache.org
Subject: Re: 1.0.2 Assertion Error

That would be a bug (as any assertion error would be), likely some
race condition.
Could you open a ticket?
The fact that this block the MemtablePostFlusher is unfortunately
related. Restarting the
node would fix but we need to make that more solid too.

--
Sylvain

On Thu, Nov 10, 2011 at 9:04 PM, Dan Hendry 
wrote:
> Just happened again, seems to be with the same column family (at least on
a
> flusher thread for which the last activity was flushing a memtable for
that
> CF).
>
>
>
> It also looks like MemtablePostFlusher tasks blocked (and not getting
> cleared) as evidenced by tpstats:
>
>
>
> Pool Name    Active   Pending  Completed   Blocked 
All
> time blocked
>
> MemtablePostFlusher   1    18     16
> 0 0
>
>
>
> Dan
>
>
>
> From: Dan Hendry [mailto:dan.hendry.j...@gmail.com]
> Sent: November-10-11 14:49
> To: 'user@cassandra.apache.org'
> Subject: 1.0.2 Assertion Error
>
>
>
> Just saw this weird assertion after upgrading one of my nodes from 0.8.6
to
> 1.0.2 (its been running fine for a few hours now):
>
>
>
> INFO [FlushWriter:9] 2011-11-10 13:08:58,882 Memtable.java (line 237)
> Writing Memtable-Data@1388955390(25676955/430716097 serialized/live bytes,
> 478913 ops)
>
> ERROR [FlushWriter:9] 2011-11-10 13:08:59,513 AbstractCassandraDaemon.java
> (line 133) Fatal exception in thread Thread[FlushWriter:9,5,main]
>
> java.lang.AssertionError: CF size changed during serialization: was 4
> initially but 3 written
>
>     at
>
org.apache.cassandra.db.ColumnFamilySerializer.serializeForSSTable(ColumnFam
ilySerializer.java:94)
>
>     at
>
org.apache.cassandra.db.ColumnFamilySerializer.serializeWithIndexes(ColumnFa
milySerializer.java:112)
>
>     at
>
org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:177)
>
>     at
> org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:264)
>
>     at org.apache.cassandra.db.Memtable.access$400(Memtable.java:47)
>
>     at
org.apache.cassandra.db.Memtable$4.runMayThrow(Memtable.java:289)
>
>     at
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
>
>    at
>
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.ja
va:886)
>
>     at
>
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:9
08)
>
>     at java.lang.Thread.run(Thread.java:662)
>
>
>
>
>
> After taking a quick peek at the code, it looks like the numbers (“4
> initially but 3 written”) refer to the number of columns (in the
memtable?).
> Given the byte size of the memtable being flushed, there are *certainly*
> more than 4 columns. Besides this error, there are no other unexpected or
> unusual log entries and the node seems to be behaving normally.
>
>
>
> Thoughts?
>
>
>
> Dan Hendry
No virus found in this incoming message.
Checked by AVG - www.avg.com 
Version: 9.0.920 / Virus Database: 271.1.1/4007 - Release Date: 11/10/11
02:34:00



Re: Data retrieval inconsistent

2011-11-10 Thread Jeremiah Jordan
I am pretty sure the way you have K1 configured it will be placed across 
both DC's as if you had large ring.  If you want it only in DC1 you need 
to say DC1:1, DC2:0.
If you are writing and reading at ONE you are not guaranteed to get the 
data if RF > 1.  If RF = 2, and you write with ONE, you data could be 
written to server 1, and then read from server 2 before it gets over there.


The differing on server times will only really matter for TTL's.  Most 
everything else works off comparing user supplied times.


-Jeremiah

On 11/10/2011 02:27 PM, Subrahmanya Harve wrote:


I am facing an issue in 0.8.7 cluster -

- I have two clusters in two DCs (rather one cross dc cluster) and two 
keyspaces. But i have only configured one keyspace to replicate data 
to the other DC and the other keyspace to not replicate over to the 
other DC. Basically this is the way i ran the keyspace creation  -
create keyspace K1 with 
placement_strategy='org.apache.cassandra.locator.SimpleStrategy' and 
strategy_options = [{replication_factor:1}];
create keyspace K2 with 
placement_strategy='org.apache.cassandra.locator.NetworkTopologyStrategy' 
and strategy_options = [{DC1:2, DC2:2}];


I had to do this because i expect that K1 will get a large volume of 
data and i do not want this wired over to the other DC.


I am writing the data at CL=ONE and reading the data at CL=ONE. I am 
seeing an issue where sometimes i get the data and other times i do 
not see the data. Does anyone know what could be going on here?


A second larger question is  - i am migrating from 0.7.4 to 0.8.7 , i 
can see that there are large changes in the yaml file, but a specific 
question i had was - how do i configure disk_access_mode like it used 
to be in 0.7.4?


One observation i have made is that some nodes of the cross dc cluster 
are at different system times. This is something to fix but could this 
be why data is sometimes retrieved and other times not? Or is there 
some other thing to it?


Would appreciate a quick response.


Not all nodes see the complete ring

2011-11-10 Thread Timothy Smith
I’m curious if anyone has ever seen this happen or has any idea how it
would happen.  I have a 10 cluster node with 5 nodes in each data
center running .6 (we're working on the upgrade now).  I had several
nodes with forgotten deletes so I failed the nodes and bootstrapped
them back into the cluster one at a time.  Everything seemed fine, but
now I’m noticing that all systems in 1 of my data centers see all 10
nodes and all systems in the other data center see just 9.  I’m
figuring now is the time to fail the node that only half the other
nodes can see, but what would cause this to happen?

-Tim Smith

DC1
10.x.x.45   Up162065212751151145161126595807335373   |<--|
10.x.x.44   Up  5452449782323250074504667089218893518  |   ^
10.x.x.43   Up 8114257989534302620064490155463988554  v   |
10.x.x.46   Up   21422567192334579300859480282267974118 |   ^
10.x.x.60   Up  54861697885175209049354960363878287097 v   |
10.x.x.69   Up154328840302872203985032035664154382201|   ^
10.x.x.62   Up156995951391754654763374624484548356765v   |
10.x.x.61   Up   158321671343439891722659169797597266747|   ^
10.x.x.47   Up  159657096314745030420102789742477598562|-->|

DC2
10.x.x.45   Up 162065212751151145161126595807335373   |<--|
10.x.x.44   Up  5452449782323250074504667089218893518  |   ^
10.x.x.43   Up 8114257989534302620064490155463988554  v   |
10.x.x.46   Up21422567192334579300859480282267974118 |   ^
10.x.x.60   Up   54861697885175209049354960363878287097 v   |
10.x.x.71   Up  149032703168324939856639604911542585192|   ^
10.x.x.69   Up154328840302872203985032035664154382201v   |
10.x.x.62   Up156995951391754654763374624484548356765|   ^
10.x.x.61   Up   158321671343439891722659169797597266747v   |
10.x.x.47   Up159657096314745030420102789742477598562|-->|


Re: Data retrieval inconsistent

2011-11-10 Thread Subrahmanya Harve
Thanks Ed and Jeremiah for that useful info.
"I am pretty sure the way you have K1 configured it will be placed across
both DC's as if you had large ring.  If you want it only in DC1 you need to
say DC1:1, DC2:0."
Infact i do want K1 to be available across both DCs as if i had a large
ring. I just do not want them to replicate over across DCs. Also i did try
doing it like you said DC1:1, DC2:0 but wont that mean that, all my data
goes into DC1 irrespective of whether the data is getting into the nodes of
DC1 or DC2, thereby creating a "hot DC"? Since the volume of data for this
case is huge, that might create a load imbalance on DC1? (Am i missing
something?)


On Thu, Nov 10, 2011 at 1:30 PM, Jeremiah Jordan <
jeremiah.jor...@morningstar.com> wrote:

> I am pretty sure the way you have K1 configured it will be placed across
> both DC's as if you had large ring.  If you want it only in DC1 you need to
> say DC1:1, DC2:0.
> If you are writing and reading at ONE you are not guaranteed to get the
> data if RF > 1.  If RF = 2, and you write with ONE, you data could be
> written to server 1, and then read from server 2 before it gets over there.
>
> The differing on server times will only really matter for TTL's.  Most
> everything else works off comparing user supplied times.
>
> -Jeremiah
>
>
> On 11/10/2011 02:27 PM, Subrahmanya Harve wrote:
>
>>
>> I am facing an issue in 0.8.7 cluster -
>>
>> - I have two clusters in two DCs (rather one cross dc cluster) and two
>> keyspaces. But i have only configured one keyspace to replicate data to the
>> other DC and the other keyspace to not replicate over to the other DC.
>> Basically this is the way i ran the keyspace creation  -
>>create keyspace K1 with placement_strategy='org.**
>> apache.cassandra.locator.**SimpleStrategy' and strategy_options =
>> [{replication_factor:1}];
>>create keyspace K2 with placement_strategy='org.**
>> apache.cassandra.locator.**NetworkTopologyStrategy' and strategy_options
>> = [{DC1:2, DC2:2}];
>>
>> I had to do this because i expect that K1 will get a large volume of data
>> and i do not want this wired over to the other DC.
>>
>> I am writing the data at CL=ONE and reading the data at CL=ONE. I am
>> seeing an issue where sometimes i get the data and other times i do not see
>> the data. Does anyone know what could be going on here?
>>
>> A second larger question is  - i am migrating from 0.7.4 to 0.8.7 , i can
>> see that there are large changes in the yaml file, but a specific question
>> i had was - how do i configure disk_access_mode like it used to be in 0.7.4?
>>
>> One observation i have made is that some nodes of the cross dc cluster
>> are at different system times. This is something to fix but could this be
>> why data is sometimes retrieved and other times not? Or is there some other
>> thing to it?
>>
>> Would appreciate a quick response.
>>
>


changing ownership when nodes have same token

2011-11-10 Thread Feng Qu
Hello, 

I notice that when starting a new node with same configuration(cluster name, 
seeds, token etc) as an existing ring member, the new node will take over the 
ownership from existing ring member. Is this expected behavior? I would like to 
see Cassandra prevents new node from joining the ring as in this case new node 
is mis-configured.
 
Feng

Re: Data retrieval inconsistent

2011-11-10 Thread Jeremiah Jordan
No, that is what I thought you wanted.  I was thinking your machines in 
DC1 had extra disk space or something...


(I stopped replying to the dev list)

On 11/10/2011 04:09 PM, Subrahmanya Harve wrote:


Thanks Ed and Jeremiah for that useful info.
"I am pretty sure the way you have K1 configured it will be placed across
both DC's as if you had large ring.  If you want it only in DC1 you 
need to

say DC1:1, DC2:0."
Infact i do want K1 to be available across both DCs as if i had a large
ring. I just do not want them to replicate over across DCs. Also i did try
doing it like you said DC1:1, DC2:0 but wont that mean that, all my data
goes into DC1 irrespective of whether the data is getting into the 
nodes of

DC1 or DC2, thereby creating a "hot DC"? Since the volume of data for this
case is huge, that might create a load imbalance on DC1? (Am i missing
something?)


On Thu, Nov 10, 2011 at 1:30 PM, Jeremiah Jordan <
jeremiah.jor...@morningstar.com> wrote:

> I am pretty sure the way you have K1 configured it will be placed across
> both DC's as if you had large ring.  If you want it only in DC1 you 
need to

> say DC1:1, DC2:0.
> If you are writing and reading at ONE you are not guaranteed to get the
> data if RF > 1.  If RF = 2, and you write with ONE, you data could be
> written to server 1, and then read from server 2 before it gets over 
there.

>
> The differing on server times will only really matter for TTL's.  Most
> everything else works off comparing user supplied times.
>
> -Jeremiah
>
>
> On 11/10/2011 02:27 PM, Subrahmanya Harve wrote:
>
>>
>> I am facing an issue in 0.8.7 cluster -
>>
>> - I have two clusters in two DCs (rather one cross dc cluster) and two
>> keyspaces. But i have only configured one keyspace to replicate 
data to the

>> other DC and the other keyspace to not replicate over to the other DC.
>> Basically this is the way i ran the keyspace creation  -
>>create keyspace K1 with placement_strategy='org.**
>> apache.cassandra.locator.**SimpleStrategy' and strategy_options =
>> [{replication_factor:1}];
>>create keyspace K2 with placement_strategy='org.**
>> apache.cassandra.locator.**NetworkTopologyStrategy' and 
strategy_options

>> = [{DC1:2, DC2:2}];
>>
>> I had to do this because i expect that K1 will get a large volume 
of data

>> and i do not want this wired over to the other DC.
>>
>> I am writing the data at CL=ONE and reading the data at CL=ONE. I am
>> seeing an issue where sometimes i get the data and other times i do 
not see

>> the data. Does anyone know what could be going on here?
>>
>> A second larger question is  - i am migrating from 0.7.4 to 0.8.7 , 
i can
>> see that there are large changes in the yaml file, but a specific 
question
>> i had was - how do i configure disk_access_mode like it used to be 
in 0.7.4?

>>
>> One observation i have made is that some nodes of the cross dc cluster
>> are at different system times. This is something to fix but could 
this be
>> why data is sometimes retrieved and other times not? Or is there 
some other

>> thing to it?
>>
>> Would appreciate a quick response.
>>
>



Mass deletion -- slowing down

2011-11-10 Thread Maxim Potekhin

Hello,

My data load comes in batches representing one day in the life of a 
large computing facility.
I index the data by the day it was produced, to be able to quickly pull 
data for a specific day

within the last year or two. There are 6 other indexes.

When it comes to retiring the data, I intend to delete it for the oldest 
date and after that add
a fresh batch of data, so I control the disk space. Therein lies a 
problem -- and it maybe
Pycassa related, so I also filed an issue on github -- then I select by 
'DATE=blah' and then
do a batch remove, it works fine for a while, and then after a few 
thousand deletions (done
in batches of 1000) it grinds to a halt, i.e. I can no longer iterate 
the result, which manifests

in a timeout error.

Is that a behavior seen before? Cassandra version is 0.8.6, Pycassa 1.3.0.

TIA,

Maxim



Is there a way to get only keys with get_indexed_slices?

2011-11-10 Thread Maxim Potekhin


Is there a way to get only keys with get_indexed_slices?
Looking at the code, it's not possible, but -- is there some way anyhow?
I don't want to extract any data, just a list of matching keys.

TIA,

Maxim



Re: Data retrieval inconsistent

2011-11-10 Thread Subrahmanya Harve
Thanks.
I'm gonna try and use QUORUM to read and/or write and see if data is
returned consistently.


On Thu, Nov 10, 2011 at 3:00 PM, Jeremiah Jordan <
jeremiah.jor...@morningstar.com> wrote:

>  No, that is what I thought you wanted.  I was thinking your machines in
> DC1 had extra disk space or something...
>
> (I stopped replying to the dev list)
>
>
> On 11/10/2011 04:09 PM, Subrahmanya Harve wrote:
>
> Thanks Ed and Jeremiah for that useful info.
> "I am pretty sure the way you have K1 configured it will be placed across
> both DC's as if you had large ring.  If you want it only in DC1 you need to
> say DC1:1, DC2:0."
> Infact i do want K1 to be available across both DCs as if i had a large
> ring. I just do not want them to replicate over across DCs. Also i did try
> doing it like you said DC1:1, DC2:0 but wont that mean that, all my data
> goes into DC1 irrespective of whether the data is getting into the nodes of
> DC1 or DC2, thereby creating a "hot DC"? Since the volume of data for this
> case is huge, that might create a load imbalance on DC1? (Am i missing
> something?)
>
>
> On Thu, Nov 10, 2011 at 1:30 PM, Jeremiah Jordan <
> jeremiah.jor...@morningstar.com> wrote:
>
> > I am pretty sure the way you have K1 configured it will be placed across
> > both DC's as if you had large ring.  If you want it only in DC1 you need
> to
> > say DC1:1, DC2:0.
> > If you are writing and reading at ONE you are not guaranteed to get the
> > data if RF > 1.  If RF = 2, and you write with ONE, you data could be
> > written to server 1, and then read from server 2 before it gets over
> there.
> >
> > The differing on server times will only really matter for TTL's.  Most
> > everything else works off comparing user supplied times.
> >
> > -Jeremiah
> >
> >
> > On 11/10/2011 02:27 PM, Subrahmanya Harve wrote:
> >
> >>
> >> I am facing an issue in 0.8.7 cluster -
> >>
> >> - I have two clusters in two DCs (rather one cross dc cluster) and two
> >> keyspaces. But i have only configured one keyspace to replicate data to
> the
> >> other DC and the other keyspace to not replicate over to the other DC.
> >> Basically this is the way i ran the keyspace creation  -
> >>create keyspace K1 with placement_strategy='org.**
> >> apache.cassandra.locator.**SimpleStrategy' and strategy_options =
> >> [{replication_factor:1}];
> >>create keyspace K2 with placement_strategy='org.**
> >> apache.cassandra.locator.**NetworkTopologyStrategy' and strategy_options
>
> >> = [{DC1:2, DC2:2}];
> >>
> >> I had to do this because i expect that K1 will get a large volume of
> data
> >> and i do not want this wired over to the other DC.
> >>
> >> I am writing the data at CL=ONE and reading the data at CL=ONE. I am
> >> seeing an issue where sometimes i get the data and other times i do not
> see
> >> the data. Does anyone know what could be going on here?
> >>
> >> A second larger question is  - i am migrating from 0.7.4 to 0.8.7 , i
> can
> >> see that there are large changes in the yaml file, but a specific
> question
> >> i had was - how do i configure disk_access_mode like it used to be in
> 0.7.4?
> >>
> >> One observation i have made is that some nodes of the cross dc cluster
> >> are at different system times. This is something to fix but could this
> be
> >> why data is sometimes retrieved and other times not? Or is there some
> other
> >> thing to it?
> >>
> >> Would appreciate a quick response.
> >>
> >
>
>


is that possible to add more data structure(key-list) in cassandra?

2011-11-10 Thread Yan Chunlu
I think cassandra is doing great job on key-value data store, it saved me
tremendous work on maintain the data consistency and service availability.
   But I think it would be great if it could support more data structures
such as key-list, currently I am using key-value save the list, it seems
not very efficiency. Redis has a good point on this but it is not easy to
scale.

Maybe it is a wrong place and wrong question, only curious if there is
already solution about this, thanks a lot!


Efficient map reduce over ranges of Cassandra data

2011-11-10 Thread Edward Capriolo
Hey all,

I know there are several tickets in the pipe that should make it possible
do use secondary indexes to run map reduce jobs that do not have to ingest
the entire dataset such as:

https://issues.apache.org/jira/browse/CASSANDRA-1600

I had ended up creating a sharded secondary index in user space (I just
call it ordered buckets), described here:

http://www.slideshare.net/edwardcapriolo/casbase-presentation/27

Looking at the ordered buckets implementation I realized it is a perfect
candidate for "efficient map reduce" since it is easy to split.

A unit test of that implementation is here:

https://github.com/edwardcapriolo/casbase/blob/master/src/test/java/com/jointhegrid/casbase/hadoop/OrderedBucketInputFormatTest.java

With this you can current do efficient map reduce on cassandra data, while
waiting for other integrated solutions to come along.


range slice with TimeUUID column names

2011-11-10 Thread footh
I am using Hector to do a range query for a column family that uses TimeUUIDs 
as column names.  However, I'm not sure how to create the "range".  I figured 
I'd create some UUIDs using the com.eaio.uuid library with timestamps for the 
range I was interested in.  When trying this, I don't get any results back even 
though I am sure there are UUIDs in the time range.  Is this the correct way to 
do this?  Here's how I'm creating my range UUIDs where 'start' and 'finish' are 
timestamps from java.util.Date:

UUID startId = new UUID(UUIDGen.createTime(start), 
UUIDGen.getClockSeqAndNode());
UUID finishId = new UUID(UUIDGen.createTime(finish), 
UUIDGen.getClockSeqAndNode());



Re: 1.0.2 Assertion Error

2011-11-10 Thread Radim Kolar

Dne 10.11.2011 22:18, Dan Hendry napsal(a):

Is it possible to downgrade back to 0.8? Is there any way to convert 'h'
version SSTables to the old 'g' version? Any other data file changes to be
aware of?
try to add 0.8 node to cluster and decommission 1.0 node. maybe 0.8 will 
understand streams from 1.0.


Re: is that possible to add more data structure(key-list) in cassandra?

2011-11-10 Thread Radim Kolar

Dne 11.11.2011 5:58, Yan Chunlu napsal(a):
I think cassandra is doing great job on key-value data store, it saved 
me tremendous work on maintain the data consistency and service 
availability.But I think it would be great if it could support 
more data structures such as key-list, currently I am using key-value 
save the list, it seems not very efficiency. Redis has a good point on 
this but it is not easy to scale.


Maybe it is a wrong place and wrong question, only curious if there is 
already solution about this, thanks a lot!

use supercolumns unless your lists are very large.


configurable bloom filters (like hbase)

2011-11-10 Thread Radim Kolar
i have problem with large CF (about 200 billions entries per node). 
While i can configure index_interval to lower memory requirements, i 
still have to stick with huge bloom filters.


Ideal would be to have bloom filters configurable like in hbase. 
Cassandra standard is about 1.05% false possitive but in my case i would 
be fine even with 20% false positive rate. Data are not often read back. 
Most of them will be never read before they expire via TTL.


Re: Off-heap caching through ByteBuffer.allocateDirect when JNA not available ?

2011-11-10 Thread Maciej Miklas
I would like to know it also - actually is should be similar, plus there
are no dependencies to sun.misc packages.

Regards,
Maciej

On Thu, Nov 10, 2011 at 1:46 PM, Benoit Perroud  wrote:

> Thanks for the answer.
> I saw the move to sun.misc.
> In what sense allocateDirect is broken ?
>
> Thanks,
>
> Benoit.
>
>
> 2011/11/9 Jonathan Ellis :
> > allocateDirect is broken for this purpose, but we removed the JNA
> > dependency using sun.misc.Unsafe instead:
> > https://issues.apache.org/jira/browse/CASSANDRA-3271
> >
> > On Wed, Nov 9, 2011 at 5:54 AM, Benoit Perroud 
> wrote:
> >> Hi,
> >>
> >> I wonder if you have already discussed about ByteBuffer.allocateDirect
> >> alternative to JNA memory allocation ?
> >>
> >> If so, do someone mind send me a pointer ?
> >>
> >> Thanks !
> >>
> >> Benoit.
> >>
> >
> >
> >
> > --
> > Jonathan Ellis
> > Project Chair, Apache Cassandra
> > co-founder of DataStax, the source for professional Cassandra support
> > http://www.datastax.com
> >
>
>
>
> --
> sent from my Nokia 3210
>


configurable index_interval per keyspace

2011-11-10 Thread Radim Kolar
It would be good to have index_interval configurable per keyspace. 
Preferably in cassandra.yaml because i use it as tuning on nodes running 
out of memory without affecting performance noticeably.