Sorting by timestamp

2010-05-29 Thread Erik
Hi, 

I have a list of posts I'm trying to insert into Cassandra.  Each post has a
timestamp already (in the past) that is not necessarily unique.  I'm trying to
insert these posts into Cassandra so that when I retrieve them, they are ordered
chronologically.  When I insert them, I can't guarantee that the order of
insertion is ordering them based on the actual timestamp of the post. 

Can someone help me find a solution for this?

Thanks.

Erik
 



Re: Datafile Corruption

2019-08-14 Thread Forkalsrud, Erik
The dmesg command will usually show information about hardware errors.

An example from a spinning disk:
sd 0:0:10:0: [sdi] Unhandled sense code
sd 0:0:10:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 0:0:10:0: [sdi] Sense Key : Medium Error [current]
Info fld=0x6fc72
sd 0:0:10:0: [sdi] Add. Sense: Unrecovered read error
sd 0:0:10:0: [sdi] CDB: Read(10): 28 00 00 06 fc 70 00 00 08 00


Also, you can read the file like  "cat  
/data/ssd2/data/KeyspaceMetadata/x-x/lb-26203-big-Data.db > /dev/null"
If you get an error message, it's probably a hardware issue.

- Erik -


From: Philip Ó Condúin 
Sent: Thursday, August 8, 2019 09:58
To: user@cassandra.apache.org 
Subject: Re: Datafile Corruption

Hi Jon,

Good question, I'm not sure if we're using NVMe, I don't see /dev/nvme but we 
could still be using it.
We using Cisco UCS C220 M4 SFF so I'm just going to check the spec.

Our Kernal is the following, we're using REDHAT so I'm told we can't upgrade 
the version until the next major release anyway.
root@cass 0 17:32:28 ~ # uname -r
3.10.0-957.5.1.el7.x86_64

Cheers,
Phil

On Thu, 8 Aug 2019 at 17:35, Jon Haddad 
mailto:j...@jonhaddad.com>> wrote:
Any chance you're using NVMe with an older Linux kernel?  I've seen a *lot* 
filesystem errors from using older CentOS versions.  You'll want to be using a 
version > 4.15.

On Thu, Aug 8, 2019 at 9:31 AM Philip Ó Condúin 
mailto:philipocond...@gmail.com>> wrote:
@Jeff - If it was hardware that would explain it all, but do you think it's 
possible to have every server in the cluster with a hardware issue?
The data is sensitive and the customer would lose their mind if I sent it 
off-site which is a pity cause I could really do with the help.
The corruption is occurring irregularly on every server and instance and column 
family in the cluster.  Out of 72 instances, we are getting maybe 10 corrupt 
files per day.
We are using vnodes (256) and it is happening in both DC's

@Asad - internode compression is set to ALL on every server.  I have checked 
the packets for the private interconnect and I can't see any dropped packets, 
there are dropped packets for other interfaces, but not for the private ones, I 
will get the network team to double-check this.
The corruption is only on the application schema, we are not getting corruption 
on any system or cass keyspaces.  Corruption is happening in both DC's.  We are 
getting corruption for the 1 application schema we have across all tables in 
the keyspace, it's not limited to one table.
Im not sure why the app team decided to not use default compression, I must ask 
them.



I have been checking the /var/log/messages today going back a few weeks and can 
see a serious amount of broken pipe errors across all servers and instances.
Here is a snippet from one server but most pipe errors are similar:

Jul  9 03:00:08  cassandra: INFO  02:00:08 Writing 
Memtable-sstable_activity@1126262628(43.631KiB serialized bytes, 18072 ops, 
0%/0% of on/off-heap limit)
Jul  9 03:00:13  kernel: fnic_handle_fip_timer: 8 callbacks suppressed
Jul  9 03:00:19  kernel: fnic_handle_fip_timer: 8 callbacks suppressed
Jul  9 03:00:22  cassandra: ERROR 02:00:22 Got an IOException during write!
Jul  9 03:00:22  cassandra: java.io.IOException: Broken pipe
Jul  9 03:00:22  cassandra: at sun.nio.ch.FileDispatcherImpl.write0(Native 
Method) ~[na:1.8.0_172]
Jul  9 03:00:22  cassandra: at 
sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) ~[na:1.8.0_172]
Jul  9 03:00:22  cassandra: at 
sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) ~[na:1.8.0_172]
Jul  9 03:00:22  cassandra: at sun.nio.ch.IOUtil.write(IOUtil.java:65) 
~[na:1.8.0_172]
Jul  9 03:00:22  cassandra: at 
sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471) ~[na:1.8.0_172]
Jul  9 03:00:22  cassandra: at 
org.apache.thrift.transport.TNonblockingSocket.write(TNonblockingSocket.java:165)
 ~[libthrift-0.9.2.jar:0.9.2]
Jul  9 03:00:22  cassandra: at 
com.thinkaurelius.thrift.util.mem.Buffer.writeTo(Buffer.java:104) 
~[thrift-server-0.3.7.jar:na]
Jul  9 03:00:22  cassandra: at 
com.thinkaurelius.thrift.util.mem.FastMemoryOutputTransport.streamTo(FastMemoryOutputTransport.java:112)
 ~[thrift-server-0.3.7.jar:na]
Jul  9 03:00:22  cassandra: at 
com.thinkaurelius.thrift.Message.write(Message.java:222) 
~[thrift-server-0.3.7.jar:na]
Jul  9 03:00:22  cassandra: at 
com.thinkaurelius.thrift.TDisruptorServer$SelectorThread.handleWrite(TDisruptorServer.java:598)
 [thrift-server-0.3.7.jar:na]
Jul  9 03:00:22  cassandra: at 
com.thinkaurelius.thrift.TDisruptorServer$SelectorThread.processKey(TDisruptorServer.java:569)
 [thrift-server-0.3.7.jar:na]
Jul  9 03:00:22  cassandra: at 
com.thinkaurelius.thrift.TDisruptorServer$AbstractSelectorThread.select(TDisruptorServer.java:423)
 [thrift-server-0.3.7.

Lot's of hints, but only on a few nodes

2016-05-10 Thread Erik Forsberg
I have this situation where a few (like, 3-4 out of 84) nodes misbehave. 
Very long GC pauses, dropping out of cluster etc.


This happens while loading data (via CQL), and analyzing metrics it 
looks like on these few nodes, a lot of hints are being generated close 
to the time when they start to misbehave.


Since this is Cassandra 2.0.13 which have a less than optimal hints 
implementation, largs numbers of hints is a GC troublemaker.


Again looking at metrics, it looks like hints are being generated for a 
large number of nodes, so it doesn't look like the destination nodes are 
at fault. So, I'm confused.


Any Hints (pun intended) on what could cause a few nodes to generate 
more hints than the rest of the cluster?


Regards,
\EF


Extending a partially upgraded cluster - supported

2016-05-18 Thread Erik Forsberg

Hi!

I have a 2.0.13 cluster which I need to do two things with:

* Extend it
* Upgrade to 2.1.14

I'm pondering in what order to do things. Is it a supported operation to 
extend a partially upgraded cluster, i.e. a cluster upgraded to 2.0 
where not all sstables have been upgraded?


If I do that, will the sstables written on the new nodes be in the 2.1 
format when I add them? Or will they be written in the 2.0 format so 
I'll have to run upgradesstables anyway?


The cleanup I do on the existing nodes, will write the new 2.1 format, 
right?


There might be other reasons not to do this, one being that it's seldom 
wise to do many operations at once. So please enlighten me on how bad an 
idea this is :-)


Thanks,
\EF


Re: Extending a partially upgraded cluster - supported

2016-05-18 Thread Erik Forsberg



On 2016-05-18 20:19, Jeff Jirsa wrote:

You can’t stream between versions, so in order to grow the cluster, you’ll need 
to be entirely on 2.0 or entirely on 2.1.


OK. I was sure you can't stream between a 2.0 node and a 2.1 node, but 
if I understand you correctly you can't stream between two 2.1 nodes 
unless the sstables on the source node has been upgraded to "ka", i.e. 
the 2.1 sstable version?


Looks like it's extend first, upgrade later, given that we're a bit 
close on disk capacity.


Thanks,
\EF


If you go to 2.1 first, be sure you run upgradesstables before you try to 
extend the cluster.





On 5/18/16, 11:17 AM, "Erik Forsberg"  wrote:


Hi!

I have a 2.0.13 cluster which I need to do two things with:

* Extend it
* Upgrade to 2.1.14

I'm pondering in what order to do things. Is it a supported operation to
extend a partially upgraded cluster, i.e. a cluster upgraded to 2.0
where not all sstables have been upgraded?

If I do that, will the sstables written on the new nodes be in the 2.1
format when I add them? Or will they be written in the 2.0 format so
I'll have to run upgradesstables anyway?

The cleanup I do on the existing nodes, will write the new 2.1 format,
right?

There might be other reasons not to do this, one being that it's seldom
wise to do many operations at once. So please enlighten me on how bad an
idea this is :-)

Thanks,
\EF




upgradesstables/cleanup/compaction strategy change

2016-05-23 Thread Erik Forsberg

Hi!

I have a 2.0.13 cluster which I have just extended, and I'm now looking 
into upgrading it to 2.1.


* The cleanup after the extension is partially done.
* I'm also looking into changing a few tables into Leveled Compaction 
Strategy.


In the interest of speeding up things by avoiding unnecessary rewrites 
of data, I'm pondering if I can:


1. Upgrade to 2.1, then run cleanup instead of upgradesstables getting 
cleanup + upgrade of sstable format to ka at the same time?


2. Upgrade to 2.1, then change compaction strategy and get LCS + upgrade 
of sstable format to ka at the same time?


Comments on that?

Thanks,
\EF


RE: Doing rolling upgrades on two DCs

2016-08-06 Thread Forkalsrud, Erik

You didn't mention which version you're upgrading to, but generally you 
shouldn't add or 
remove nodes while the nodes your cluster is not all on on the same version. So 
if you're
going to add nodes, you should do that before you upgrade the first node, or 
after you have
upgraded all of them.

Shifting traffic away from DC1 may not be necessary, but shouldn't hurt.


- EF

_
From: Timon Wong [timon86.w...@gmail.com]
Sent: Saturday, August 6, 2016 02:03
To: user@cassandra.apache.org
Subject: Doing rolling upgrades on two DCs

Hello, newbie here.

Currently we are using Cassandra v2.2.1 (two DCs, naming them DC1, DC2, 
respectively), and recently we have two issues:

1. Cannot adding new node due to CASSANDRA-10961
2. Cannot repair node, due to CASSANDRA-10501

We learned that we can do rolling upgrade, since we are only using one DC1 for 
now (DC2 is just for replica), we have following steps in mind:

1. Backup;
2. Doing upgrade on DC2. (Follows this guide: 
https://docs.datastax.com/en/latest-upgrade/upgrade/cassandra/upgrdCassandraDetails.html);
3. Do some checks (client compatibility, etc);
4. Adding nodes to DC2 (using new Cassandra);
5. Traffic our loads to DC2;
6. Upgrading DC1.

Is that possible, or do we miss something?

Thanks in advance!

Timon


Re: Doing rolling upgrades on two DCs

2016-08-08 Thread Erik Forkalsud


The low-risk approach would be to change one thing at a time, but I 
don't think there's
a technical reason you couldn't change the commit log and data directory 
locations

while updating to a new version.

- EF

On 08/08/2016 07:58 AM, Timon Wong wrote:

Thank you!
We would do a patch level upgrade, from 2.2.1 to 2.2.7 (the latest 
stable). Besides, we want to adjust both `commitlog_directory` and 
`data_file_directories` settings (separate them into two better IOPS 
volumes), and we assumed they should be done before or after upgrade?


On Sun, Aug 7, 2016 at 5:38 AM, Forkalsrud, Erik <mailto:eforkals...@cj.com>> wrote:



You didn't mention which version you're upgrading to, but
generally you shouldn't add or
remove nodes while the nodes your cluster is not all on on the
same version. So if you're
going to add nodes, you should do that before you upgrade the
first node, or after you have
upgraded all of them.

Shifting traffic away from DC1 may not be necessary, but shouldn't
hurt.


- EF

_
From: Timon Wong [timon86.w...@gmail.com
<mailto:timon86.w...@gmail.com>]
Sent: Saturday, August 6, 2016 02:03
To: user@cassandra.apache.org <mailto:user@cassandra.apache.org>
Subject: Doing rolling upgrades on two DCs

Hello, newbie here.

Currently we are using Cassandra v2.2.1 (two DCs, naming them DC1,
DC2, respectively), and recently we have two issues:

1. Cannot adding new node due to CASSANDRA-10961
2. Cannot repair node, due to CASSANDRA-10501

We learned that we can do rolling upgrade, since we are only using
one DC1 for now (DC2 is just for replica), we have following steps
in mind:

1. Backup;
2. Doing upgrade on DC2. (Follows this guide:

https://docs.datastax.com/en/latest-upgrade/upgrade/cassandra/upgrdCassandraDetails.html

<https://docs.datastax.com/en/latest-upgrade/upgrade/cassandra/upgrdCassandraDetails.html>);
3. Do some checks (client compatibility, etc);
4. Adding nodes to DC2 (using new Cassandra);
5. Traffic our loads to DC2;
6. Upgrading DC1.

Is that possible, or do we miss something?

Thanks in advance!

Timon






How are writes handled while adding nodes to cluster?

2015-10-06 Thread Erik Forsberg
Hi!

How are writes handled while I'm adding a node to a cluster, i.e. while
the new node is in JOINING state?

Are they queued up as hinted handoffs, or are they being written to the
joining node?

In the former case I guess I have to make sure my max_hint_window_in_ms
is long enough for the node to become NORMAL or hints will get dropped
and I must do repair. Am I right?

Thanks,
\EF


A few misbehaving nodes

2016-04-19 Thread Erik Forsberg

Hi!

I have this problem where 3 of my 84 nodes misbehave with too long GC 
times, leading to them being marked as DN.


This happens when I load data to them using CQL from a hadoop job, so 
quite a lot of inserts at a time. The CQL loading job is using 
TokenAwarePolicy with fallback to DCAwareRoundRobinPolicy. Cassandra 
java driver version 2.1.7.1 is in use.


My other observation is that around the time the GC starts to work like 
crazy, there is a lot of outbound network traffic from the troublesome 
nodes. If a healthy node has around 25 Mbit/s in, 25 Mbit/s out, an 
unhealthy sees 25 Mbit/s in, 200 Mbit/s out.


So, something is iffy with these 3 nodes, but I have some trouble 
finding out exactly what makes them differ.


This is Cassandra 2.0.13 (yes, old) using vnodes. Keyspace is using 
NetworkTopologyStrategy with replication 2, in one datacenter.


One thing I know I'm doing wrong is that I have slightly differing 
number of hosts in each of my 6 chassies (One of them have 15 nodes, one 
of have 13, the remaining have 14). Could what I'm seeing here be the 
effect of that?


Other ideas on what could be wrong? Some kind of vnode imbalance? How 
can I diagnose that? What metrics should I be looking at?


Thanks,
\EF




Re: A few misbehaving nodes

2016-04-21 Thread Erik Forsberg



On 2016-04-19 15:54, sai krishnam raju potturi wrote:

hi;
   do we see any hung process like Repairs on those 3 nodes?  what 
does "nodetool netstats" show??


No hung process from what I can see.

root@cssa02-06:~# nodetool tpstats
Pool NameActive   Pending  Completed Blocked  
All time blocked
ReadStage 0 01530227 
0 0
RequestResponseStage  0 0   19230947 
0 0
MutationStage 0 0   37059234 
0 0
ReadRepairStage   0 0  80178 
0 0
ReplicateOnWriteStage 0 0  0 
0 0
GossipStage   0 0  43003 
0 0
CacheCleanupExecutor  0 0  0 
0 0
MigrationStage0 0  0 
0 0
MemoryMeter   0 0267 
0 0
FlushWriter   0 0202 
0 5
ValidationExecutor0 0212 
0 0
InternalResponseStage 0 0  0 
0 0
AntiEntropyStage  0 0427 
0 0
MemtablePostFlusher   0 0669 
0 0
MiscStage 0 0212 
0 0
PendingRangeCalculator0 0 70 
0 0
CompactionExecutor0 0   1206 
0 0
commitlog_archiver0 0  0 
0 0
HintedHandoff 0 1113 
0 0


Message type   Dropped
RANGE_SLICE  1
READ_REPAIR  0
PAGED_RANGE  0
BINARY   0
READ   219
MUTATION 3
_TRACE   0
REQUEST_RESPONSE 2
COUNTER_MUTATION 0

root@cssa02-06:~# nodetool netstats
Mode: NORMAL
Not sending any streams.
Read Repair Statistics:
Attempted: 75317
Mismatch (Blocking): 0
Mismatch (Background): 11
Pool NameActive   Pending  Completed
Commandsn/a 1   19248846
Responses   n/a 0   19875699

\EF


Dynamic Column Families in CQLSH v3

2012-08-23 Thread Erik Onnen
Hello All,

Attempting to create what the Datastax 1.1 documentation calls a
Dynamic Column Family
(http://www.datastax.com/docs/1.1/ddl/column_family#dynamic-column-families)
via CQLSH.

This works in v2 of the shell:

"create table data ( key varchar PRIMARY KEY) WITH comparator=LongType;"

When defined this way via v2 shell, I can successfully switch to v3
shell and query the CF fine.

The same syntax in v3 yields:

"Bad Request: comparator is not a valid keyword argument for CREATE TABLE"

The 1.1 documentation indicates that comparator is a valid option for
at least ALTER TABLE:

http://www.datastax.com/docs/1.1/configuration/storage_configuration#comparator

This leads me to believe that the correct way to create a dynamic
column family is to create a table with no named columns and alter the
table later but that also does not work:

"create table data (key varchar PRIMARY KEY);"

yields:

"Bad Request: No definition found that is not part of the PRIMARY KEY"

So, my question is, how do I create a Dynamic Column Family via the CQLSH v3?

Thanks!
-erik


Re: best way to clean up a column family? 60Gig of dangling data

2013-02-28 Thread Erik Forkalsud


Have you tried to (via jmx) call 
org.apache.cassandra.db.CompactionManager.forceUserDefinedCompaction() 
and give it the name of your SSTable file.


It's a trick I use to aggressively get rid of expired data, i.e. if I 
have a column family where all data is written with a TTL of 30 days, 
any SSTable files with last modified time of more than 30 days ago will 
have only expired data, so I call the above function to compact those 
files one by one.  In your case it sounds like it's not expired data, 
but data that belongs on other nodes that you want to get rid of.  I'm 
not sure if compaction will drop data that doesn't fall within the nodes 
key range, but if it does this method should have the effect you're after.



- Erik -


On 02/27/2013 08:51 PM, Hiller, Dean wrote:

Okay, we had 6 nodes of 130Gig and it was slowly increasing.  Through our 
operations to modify bloomfilter fp chance, we screwed something up as trying 
to relieve memory pressures was tough.  Anyways, somehow, this caused nodes 1, 
2, and 3 to jump to around 200Gig and our incoming data stream is completely 
constant at around 260 points/second.

Sooo, we know this dangling data(around 60Gigs) is in one single column family. 
 Node 1, 2, and 3 is for the first token range according to ringdescribe.  It 
is almost like the issue is now replicated to the other two nodes.  Is there 
any way we can go about debugging this and release the 60 gigs of disk space?

Also, the upgradesstables when memory is already close to max is not working 
too well.  Can we do this instead(ie. Is it safe?)?

  1.  Bring down the node
  2.  Move all the *Index.db files to another directory
  3.  Start the node and run upgradesstables

We know this relieves a ton of memory out of the gate for us.  We are trying to 
get memory back down by a gig, then upgrade to 1.2.2 and switch to leveled 
compaction as we have ZERO I/o really going on most of the time and really just 
have this bad bad memory bottleneck(iostat shows nothing typically as we are 
bottlenecked by memory).

Thanks,
Dean




Re: Unable to connect to C* nodes except one node (same configuration)

2017-07-25 Thread Erik Forkalsud

On 07/25/2017 05:13 AM, Junaid Nasir wrote:

|listen_address: 10.128.1.1 rpc_address: 10.128.1.1|


Are these the values on all three nodes?

If so, try with empty values:

|listen_address: rpc_address:|

or make sure each node has its own IP address configured.




sstableloader and ttls

2014-08-16 Thread Erik Forsberg
Hi!

If I use sstableloader to load data to a cluster, and the source
sstables contain some columns where the TTL has expired, i.e. the
sstable has not yet been compacted - will those entries be properly
removed on the destination side?

Thanks,
\EF


Running sstableloader from live Cassandra server

2014-08-16 Thread Erik Forsberg
Hi!

I'm looking into moving some data from one Cassandra cluster to another,
both of them running Cassandra 1.2.13 (or maybe some later 1.2 version
if that helps me avoid some fatal bug). Sstableloader will probably be
the right thing for me, and given the size of my tables, I will want to
run the sstableloader on the source cluster, but at the same time, that
source cluster needs to keep running to serve data to clients.

If I understand the docs right, this means I will have to:

1. Bring up a new network interface on each of my source nodes. No
problem, I have an IPv6 /64 to choose from :-)

2. Put a cassandra.yaml in the classpath of the sstableloader that
differs from the one in /etc/cassandra/conf, i.e. the one used by the
source cluster's cassandra, with the following:

* listen_address set to my new interface.
* rpc_address set to my new interface.
* rpc_port set as on the destination cluster (i.e. 9160)
* cluster_name set as on the destination cluster.
* storage_port as on the destination cluster (i.e. 7000)

Given the above I should be able to run sstableloader on the nodes of my
source cluster, even with source cluster cassandra daemon running.

Am I right, or did I miss anything?

Thanks,
\EF


LeveledCompaction, streaming bulkload, and lot's of small sstables

2014-08-18 Thread Erik Forsberg
Hi!

I'm bulkloading via streaming from Hadoop to my Cassandra cluster. This
results in a rather large set of relatively small (~1MiB) sstables as
the number of mappers that generate sstables on the hadoop cluster is high.

With SizeTieredCompactionStrategy, the cassandra cluster would quickly
compact all these small sstables into decently sized sstables.

With LeveledCompactionStrategy however, it takes a much longer time. I
have multithreaded_compaction: true, but it is only taking on 32
sstables at a time in one single compaction task, so when it starts with
~1500 sstables, it takes quite some time. I'm not running out of I/O.

Is there some configuration knob I can tune to make this happen faster?
I'm getting a bit confused by the description for min_sstable_size,
bucket_high, bucket_low etc - and I'm not sure if they apply in this case.

I'm pondering options for decreasing the number of sstables being
streamed from the hadoop side, but if that is possible remains to be seen.

Thanks!
\EF


Re: LeveledCompaction, streaming bulkload, and lot's of small sstables

2014-08-20 Thread Erik Forsberg
On 2014-08-18 19:52, Robert Coli wrote:
> On Mon, Aug 18, 2014 at 6:21 AM, Erik Forsberg  <mailto:forsb...@opera.com>> wrote:
> 
> Is there some configuration knob I can tune to make this happen faster?
> I'm getting a bit confused by the description for min_sstable_size,
> bucket_high, bucket_low etc - and I'm not sure if they apply in this
> case.
> 
> 
> You probably don't want to use multi-threaded compaction, it is removed
> upstream.
> 
> nodetool setcompactionthroughput 0
> 
> Assuming you have enough IO headroom etc.

OK. I disabled multithreaded and gave it a bit more throughput to play
with, but I still don't think that's the full story.

What I see is the following case:

1) My hadoop cluster is bulkloading around 1000 sstables to the
Cassandra cluster.

2) Cassandra will start compacting.

With SizeTiered, I would see multiple ongoing compactions on the CF in
question, each taking on 32 sstables and compacting to one, all of them
running at the same time.

With Leveled, I see only one compaction, taking on 32 sstables
compacting to one. When that finished, it will start another one. So
it's essentially a serial process, and it takes a much longer time than
what it does with SizeTiered. While this compaction is ongoing, read
performance is not very good.

http://www.datastax.com/dev/blog/performance-improvements-in-cassandra-1-2
mentions LCS is parallelized in Cassandra 1.2, but maybe that patch
doesn't cover my use case (although I realize that my use case is maybe
a bit weird)

So my question is if this is something I can tune? I'm running 1.2.18
now, but am strongly considering upgrade to 2.0.X.

Regards,
\EF




Running out of disk at bootstrap in low-disk situation

2014-09-20 Thread Erik Forsberg
Hi!

We have unfortunately managed to put ourselves in a situation where we are
really close to full disks on our existing 27 nodes.

We are now trying to add 15 more nodes, but running into problems with out
of disk space on the new nodes while joining.

We're using vnodes, on Cassandra 1.2.18 (yes, I know that's old, and I'll
upgrade as soon as I'm out of this problematic situation).

I've added all the 15 nodes, with some time inbetween - definitely more
than the 2-minute rule. But it seems like compaction is not keeping up with
the incoming data. Or at least that's my theory.

What are the recommended settings to avoid this problem? I have now set
compaction threshold to 0 for unlimited compaction bandwidth, hoping that
will help (will it?)

Will it help to lower the streaming throughput too? I'm unsure about the
latter since from observation it seems that compaction will not start until
it has finished streaming from a node. With 27 nodes sharing the incoming
bandwidth, all of them will take equally long time to finish and then the
compaction can occur. I guess I could limit streaming bandwidth on some of
the source nodes too. Or am I completely wrong here?

Other ideas most welcome.

Regards,
\EF


Restart joining node

2014-09-20 Thread Erik Forsberg
Hi!

On the same subject as before - due to full disk during bootstrap, my
joining nodes are stuck. What's the correct procedure here, will a plain
restart of the node do the right thing, i.e. continue where bootstrap
stopped, or is it better to clean the data directories before new start of
daemon?

Regards,
\EF


Working with legacy data via CQL

2014-11-11 Thread Erik Forsberg
Hi!

I have some data in a table created using thrift. In cassandra-cli, the
'show schema' output for this table is:

create column family Users
  with column_type = 'Standard'
  and comparator = 'AsciiType'
  and default_validation_class = 'UTF8Type'
  and key_validation_class = 'LexicalUUIDType'
  and column_metadata = [
{column_name : 'date_created',
validation_class : LongType},
{column_name : 'active',
validation_class : IntegerType,
index_name : 'Users_active_idx_1',
index_type : 0},
{column_name : 'email',
validation_class : UTF8Type,
index_name : 'Users_email_idx_1',
index_type : 0},
{column_name : 'username',
validation_class : UTF8Type,
index_name : 'Users_username_idx_1',
index_type : 0},
{column_name : 'default_account_id',
validation_class : LexicalUUIDType}];

>From cqlsh, it looks like this:

[cqlsh 4.1.1 | Cassandra 2.0.11 | CQL spec 3.1.1 | Thrift protocol 19.39.0]
Use HELP for help.
cqlsh:test> describe table Users;

CREATE TABLE "Users" (
  key 'org.apache.cassandra.db.marshal.LexicalUUIDType',
  column1 ascii,
  active varint,
  date_created bigint,
  default_account_id 'org.apache.cassandra.db.marshal.LexicalUUIDType',
  email text,
  username text,
  value text,
  PRIMARY KEY ((key), column1)
) WITH COMPACT STORAGE;

CREATE INDEX Users_active_idx_12 ON "Users" (active);

CREATE INDEX Users_email_idx_12 ON "Users" (email);

CREATE INDEX Users_username_idx_12 ON "Users" (username);

Now, when I try to extract data from this using cqlsh or the
python-driver, I have no problems getting data for the columns which are
actually UTF8,but for those where column_metadata have been set to
something else, there's trouble. Example using the python driver:

-- snip --

In [8]: u = uuid.UUID("a6b07340-047c-4d4c-9a02-1b59eabf611c")

In [9]: sess.execute('SELECT column1,value from "Users" where key = %s
and column1 = %s', [u, 'username'])
Out[9]: [Row(column1='username', value=u'uc6vf')]

In [10]: sess.execute('SELECT column1,value from "Users" where key = %s
and column1 = %s', [u, 'date_created'])
---
UnicodeDecodeErrorTraceback (most recent call last)
 in ()
> 1 sess.execute('SELECT column1,value from "Users" where key = %s
and column1 = %s', [u, 'date_created'])

/home/forsberg/dev/virtualenvs/ospapi/local/lib/python2.7/site-packages/cassandra/cluster.pyc
in execute(self, query, parameters, timeout, trace)
   1279 future = self.execute_async(query, parameters, trace)
   1280 try:
-> 1281 result = future.result(timeout)
   1282 finally:
   1283 if trace:

/home/forsberg/dev/virtualenvs/ospapi/local/lib/python2.7/site-packages/cassandra/cluster.pyc
in result(self, timeout)
   2742 return PagedResult(self, self._final_result)
   2743 elif self._final_exception:
-> 2744 raise self._final_exception
   2745 else:
   2746 raise OperationTimedOut(errors=self._errors,
last_host=self._current_host)

UnicodeDecodeError: 'utf8' codec can't decode byte 0xf3 in position 6:
unexpected end of data

-- snap --

cqlsh gives me similar errors.

Can I tell the python driver to parse some column values as integers, or
is this an unsupported case?

For sure this is an ugly table, but I have data in it, and I would like
to avoid having to rewrite all my tools at once, so if I could support
it from CQL that would be great.

Regards,
\EF


Re: Working with legacy data via CQL

2014-11-12 Thread Erik Forsberg
On 2014-11-11 19:40, Alex Popescu wrote:
> On Tuesday, November 11, 2014, Erik Forsberg  <mailto:forsb...@opera.com>> wrote:
> 
> 
> You'll have better chances to get an answer about the Python driver on
> its own mailing
> list  
> https://groups.google.com/a/lists.datastax.com/forum/#!forum/python-driver-user

As I said, this also happens when using cqlsh:

cqlsh:test> SELECT column1,value from "Users" where key =
a6b07340-047c-4d4c-9a02-1b59eabf611c and column1 = 'date_created';

 column1  | value
--+--
 date_created | '\x00\x00\x00\x00Ta\xf3\xe0'

(1 rows)

Failed to decode value '\x00\x00\x00\x00Ta\xf3\xe0' (for column 'value')
as text: 'utf8' codec can't decode byte 0xf3 in position 6: unexpected
end of data

So let me rephrase: How do I work with data where the table has metadata
that makes some columns differ from the main validation class? From
cqlsh, or the python driver, or any driver?

Thanks,
\EF


Re: Working with legacy data via CQL

2014-11-17 Thread Erik Forsberg
On 2014-11-15 01:24, Tyler Hobbs wrote:
> What version of cassandra did you originally create the column family
> in?  Have you made any schema changes to it through cql or
> cassandra-cli, or has it always been exactly the same?

Oh that's a tough question given that the cluster has been around since
2011. So CF was probably created in Cassandra 0.7 or 0.8 via thrift
calls from pycassa, and I don't think there has been any schema changes
to it since.

Thanks,
\EF

> 
> On Wed, Nov 12, 2014 at 2:06 AM, Erik Forsberg  <mailto:forsb...@opera.com>> wrote:
> 
> On 2014-11-11 19:40, Alex Popescu wrote:
> > On Tuesday, November 11, 2014, Erik Forsberg  <mailto:forsb...@opera.com>
> > <mailto:forsb...@opera.com <mailto:forsb...@opera.com>>> wrote:
> >
> >
> > You'll have better chances to get an answer about the Python driver on
> > its own mailing
> > list  
> https://groups.google.com/a/lists.datastax.com/forum/#!forum/python-driver-user
> 
> As I said, this also happens when using cqlsh:
> 
> cqlsh:test> SELECT column1,value from "Users" where key =
> a6b07340-047c-4d4c-9a02-1b59eabf611c and column1 = 'date_created';
> 
>  column1  | value
> --+--
>  date_created | '\x00\x00\x00\x00Ta\xf3\xe0'
> 
> (1 rows)
> 
> Failed to decode value '\x00\x00\x00\x00Ta\xf3\xe0' (for column 'value')
> as text: 'utf8' codec can't decode byte 0xf3 in position 6: unexpected
> end of data
> 
> So let me rephrase: How do I work with data where the table has metadata
> that makes some columns differ from the main validation class? From
> cqlsh, or the python driver, or any driver?
> 
> Thanks,
> \EF
> 
> 
> 
> 
> -- 
> Tyler Hobbs
> DataStax <http://datastax.com/>



Re: Working with legacy data via CQL

2014-11-17 Thread Erik Forsberg
On 2014-11-17 09:56, Erik Forsberg wrote:
> On 2014-11-15 01:24, Tyler Hobbs wrote:
>> What version of cassandra did you originally create the column family
>> in?  Have you made any schema changes to it through cql or
>> cassandra-cli, or has it always been exactly the same?
> 
> Oh that's a tough question given that the cluster has been around since
> 2011. So CF was probably created in Cassandra 0.7 or 0.8 via thrift
> calls from pycassa, and I don't think there has been any schema changes
> to it since.

Actually, I don't think it matters. I created a minimal repeatable set
of python code (see below). Running that against a 2.0.11 server,
creating fresh keyspace and CF, then insert some data with
thrift/pycassa, then trying to extract the data that has a different
validation class, the python-driver and cqlsh bails out.

cqlsh example after running the below script:

cqlsh:badcql> select * from "Users" where column1 = 'default_account_id'
ALLOW FILTERING;

value "\xf9\x8bu}!\xe9C\xbb\xa7=\xd0\x8a\xff';\xe5" (in col 'value')
can't be deserialized as text: 'utf8' codec can't decode byte 0xf9 in
position 0: invalid start byte

cqlsh:badcql> select * from "Users" where column1 = 'date_created' ALLOW
FILTERING;

value '\x00\x00\x00\x00Ti\xe0\xbe' (in col 'value') can't be
deserialized as text: 'utf8' codec can't decode bytes in position 6-7:
unexpected end of data


So the question remains - how do I work with this data from cqlsh and /
or the python driver?

Thanks,
\EF

--repeatable example--
#!/usr/bin/env python

# Run this in virtualenv with pycassa and cassandra-driver installed via pip
import pycassa
import cassandra
import calendar
import traceback
import time
from uuid import uuid4

keyspace = "badcql"

sysmanager = pycassa.system_manager.SystemManager("localhost")
sysmanager.create_keyspace(keyspace,
strategy_options={'replication_factor':'1'})
sysmanager.create_column_family(keyspace, "Users",
key_validation_class=pycassa.system_manager.LEXICAL_UUID_TYPE,

comparator_type=pycassa.system_manager.ASCII_TYPE,

default_validation_class=pycassa.system_manager.UTF8_TYPE)
sysmanager.create_index(keyspace, "Users", "username",
pycassa.system_manager.UTF8_TYPE)
sysmanager.create_index(keyspace, "Users", "email",
pycassa.system_manager.UTF8_TYPE)
sysmanager.alter_column(keyspace, "Users", "default_account_id",
pycassa.system_manager.LEXICAL_UUID_TYPE)
sysmanager.create_index(keyspace, "Users", "active",
pycassa.system_manager.INT_TYPE)
sysmanager.alter_column(keyspace, "Users", "date_created",
pycassa.system_manager.LONG_TYPE)

pool = pycassa.pool.ConnectionPool(keyspace, ['localhost:9160'])
cf = pycassa.ColumnFamily(pool, "Users")

user_uuid = uuid4()

cf.insert(user_uuid, {'username':'test_username', 'auth_method':'ldap',
'email':'t...@example.com', 'active':1,

'date_created':long(calendar.timegm(time.gmtime())),
'default_account_id':uuid4()})

from cassandra.cluster import Cluster
cassandra_cluster = Cluster(["localhost"])
cassandra_session = cassandra_cluster.connect(keyspace)
print "username", cassandra_session.execute('SELECT value from "Users"
where key = %s and column1 = %s', (user_uuid, 'username',))
print "email", cassandra_session.execute('SELECT value from "Users"
where key = %s and column1 = %s', (user_uuid, 'email',))
try:
print "default_account_id", cassandra_session.execute('SELECT value
from "Users" where key = %s and column1 = %s', (user_uuid,
'default_account_id',))
except Exception as e:
print "Exception trying to get default_account_id",
traceback.format_exc()
cassandra_session = cassandra_cluster.connect(keyspace)

try:
print "active", cassandra_session.execute('SELECT value from "Users"
where key = %s and column1 = %s', (user_uuid, 'active',))
except Exception as e:
print "Exception trying to get active", traceback.format_exc()
cassandra_session = cassandra_cluster.connect(keyspace)

try:
print "date_created", cassandra_session.execute('SELECT value from
"Users" where key = %s and column1 = %s', (user_uuid, 'date_created',))
except Exception as e:
print "Exception trying to get date_created", traceback.format_exc()
-- end of example --


Re: Working with legacy data via CQL

2014-11-19 Thread Erik Forsberg
On 2014-11-19 01:37, Robert Coli wrote:
> 
> Thanks, I can reproduce the issue with that, and I should be able to
> look into it tomorrow.  FWIW, I believe the issue is server-side,
> not in the driver.  I may be able to suggest a workaround once I
> figure out what's going on.
> 
> 
> Is there a JIRA tracking this issue? I like being aware of potential
> issues with "legacy tables" ... :D

I created one, just for you! :-)

https://issues.apache.org/jira/browse/CASSANDRA-8339

\EF


Anonymous user in permissions system?

2015-02-05 Thread Erik Forsberg
Hi!

Is there such a thing as the anonymous/unauthenticated user in the
cassandra permissions system?

What I would like to do is to grant select, i.e. provide read-only
access, to users which have not presented a username and password.

Then grant update/insert to other users which have presented a username
and (correct) password.

Doable?

Regards,
\EF


changes to metricsReporterConfigFile requires restart of cassandra?

2015-02-11 Thread Erik Forsberg
Hi!

I was pleased to find out that cassandra 2.0.x has added support for
pluggable metrics export, which even includes a graphite metrics sender.

Question: Will changes to the metricsReporterConfigFile require a
restart of cassandra to take effect?

I.e, if I want to add a new exported metric to that file, will I have to
restart my cluster?

Thanks,
\EF


Re: Anonymous user in permissions system?

2015-02-16 Thread Erik Forsberg
On 2015-02-05 12:39, Carlos Rolo wrote:
> Hello Erik,
> 
> It seems possible, refer to the following documentation to see if it
> fits your needs:
> http://www.datastax.com/documentation/cassandra/2.0/cassandra/security/secureInternalAuthenticationTOC.html
> http://www.datastax.com/documentation/cassandra/2.0/cassandra/security/secureInternalAuthorizationTOC.html

I think you misunderstand what I want. I want to distinguish between
authenticated users, i.e. users which have provided a username/password,
and anonymous users, i.e. users which havent.

Reading the source code, the standard PasswordAuthenticated does not
allow this, but by subclassing it I can come up with something that does
what I want, since the AuthenticatedUser have some partial support for
the "anonymous" user.

Regards,
\EF


Re: Cluster status instability

2015-04-08 Thread Erik Forsberg
To elaborate a bit on what Marcin said:

* Once a node starts to believe that a few other nodes are down, it seems
to stay that way for a very long time (hours). I'm not even sure it will
recover without a restart.
* I've tried to stop then start gossip with nodetool on the node that
thinks several other nodes is down. Did not help.
* nodetool gossipinfo when run on an affected node claims STATUS:NORMAL for
all nodes (including the ones marked as down in status output)
* It is quite possible that the problem starts at the time of day when we
have a lot of bulkloading going on. But why does it then stay for several
hours after the load goes down?
* I have the feeling this started with our upgrade from 1.2.18 to 2.0.12
about a month ago, but I have no hard data to back that up.

Regarding region/snitch - this is not an AWS deployment, we run on our own
datacenter with GossipingPropertyFileSnitch.

Right now I have this situation with one node (04-05) thinking that there
are 4 nodes down. The rest of the cluster (56 nodes in total) thinks all
nodes are up. Load on cluster right now is minimal, there's no GC going on.
Heap usage is approximately 3.5/6Gb.

root@cssa04-05:~# nodetool status|grep DN
DN  2001:4c28:1:413:0:1:2:5   1.07 TB256 1.8%
114ff46e-57d0-40dd-87fb-3e4259e96c16  rack2
DN  2001:4c28:1:413:0:1:2:6   1.06 TB256 1.8%
b161a6f3-b940-4bba-9aa3-cfb0fc1fe759  rack2
DN  2001:4c28:1:413:0:1:2:13  896.82 GB  256 1.6%
4a488366-0db9-4887-b538-4c5048a6d756  rack2
DN  2001:4c28:1:413:0:1:3:7   1.04 TB256 1.8%
95cf2cdb-d364-4b30-9b91-df4c37f3d670  rack3

Excerpt from nodetool gossipinfo showing one node that status thinks is
down (2:5) and one that status thinks is up (3:12):

/2001:4c28:1:413:0:1:2:5
  generation:1427712750
  heartbeat:2310212
  NET_VERSION:7
  RPC_ADDRESS:0.0.0.0
  RELEASE_VERSION:2.0.13
  RACK:rack2
  LOAD:1.172524771195E12
  INTERNAL_IP:2001:4c28:1:413:0:1:2:5
  HOST_ID:114ff46e-57d0-40dd-87fb-3e4259e96c16
  DC:iceland
  SEVERITY:0.0
  STATUS:NORMAL,100493381707736523347375230104768602825
  SCHEMA:4b994277-19a5-3458-b157-f69ef9ad3cda
/2001:4c28:1:413:0:1:3:12
  generation:1427714889
  heartbeat:2305710
  NET_VERSION:7
  RPC_ADDRESS:0.0.0.0
  RELEASE_VERSION:2.0.13
  RACK:rack3
  LOAD:1.047542503234E12
  INTERNAL_IP:2001:4c28:1:413:0:1:3:12
  HOST_ID:bb20ddcb-0a14-4d91-b90d-fb27536d6b00
  DC:iceland
  SEVERITY:0.0
  STATUS:NORMAL,100163259989151698942931348962560111256
  SCHEMA:4b994277-19a5-3458-b157-f69ef9ad3cda

I also tried disablegossip + enablegossip on 02-05 to see if that made
04-05 mark it as up, with no success.

Please let me know what other debug information I can provide.

Regards,
\EF

On Thu, Apr 2, 2015 at 6:56 PM, daemeon reiydelle 
wrote:

> Do you happen to be using a tool like Nagios or Ganglia that are able to
> report utilization (CPU, Load, disk io, network)? There are plugins for
> both that will also notify you of (depending on whether you enabled the
> intermediate GC logging) about what is happening.
>
>
>
> On Thu, Apr 2, 2015 at 8:35 AM, Jan  wrote:
>
>> Marcin  ;
>>
>> are all your nodes within the same Region   ?
>> If not in the same region,   what is the Snitch type that you are using
>> ?
>>
>> Jan/
>>
>>
>>
>>   On Thursday, April 2, 2015 3:28 AM, Michal Michalski <
>> michal.michal...@boxever.com> wrote:
>>
>>
>> Hey Marcin,
>>
>> Are they actually going up and down repeatedly (flapping) or just down
>> and they never come back?
>> There might be different reasons for flapping nodes, but to list what I
>> have at the top of my head right now:
>>
>> 1. Network issues. I don't think it's your case, but you can read about
>> the issues some people are having when deploying C* on AWS EC2 (keyword to
>> look for: phi_convict_threshold)
>>
>> 2. Heavy load. Node is under heavy load because of massive number of
>> reads / writes / bulkloads or e.g. unthrottled compaction etc., which may
>> result in extensive GC.
>>
>> Could any of these be a problem in your case? I'd start from
>> investigating GC logs e.g. to see how long does the "stop the world" full
>> GC take (GC logs should be on by default from what I can see [1])
>>
>> [1] https://issues.apache.org/jira/browse/CASSANDRA-5319
>>
>> Michał
>>
>>
>> Kind regards,
>> Michał Michalski,
>> michal.michal...@boxever.com
>>
>> On 2 April 2015 at 11:05, Marcin Pietraszek 
>> wrote:
>>
>> Hi!
>>
>> We have 56 node cluster with C* 2.0.13 + CASSANDRA-9036 patch
>> installed. Assume we have nodes A, B, C, D, E. On some irregular basis
>> one of those nodes starts to report that subset of other nodes is in
>> DN state although C* deamon on all nodes is running:
>>
>> A$ nodetool status
>> UN B
>> DN C
>> DN D
>> UN E
>>
>> B$ nodetool status
>> UN A
>> UN C
>> UN D
>> UN E
>>
>> C$ nodetool status
>> DN A
>> UN B
>> UN D
>> UN E
>>
>> After restart of A node, C and D report that A it's in UN and also A
>> claims that whole cluster is in UN state. Right now I don't have any
>> clear steps to

One node misbehaving (lot's of GC), ideas?

2015-04-15 Thread Erik Forsberg
Hi!

We having problems with one node (out of 56 in total) misbehaving.
Symptoms are:

* High number of full CMS old space collections during early morning
when we're doing bulkloads. Yes, bulkloads, not CQL, and only a few
thrift insertions.
* Really long stop-the-world GC events (I've seen up to 50 seconds) for
both CMS and ParNew.
* CPU usage higher during early morning hours compared to other nodes.
* The large number of Garbage Collections *seems* to correspond to doing
a lot of compactions (SizeTiered for most of our CFs, Leveled for a few
small ones)
* Node loosing track of what other nodes are up and keeping that state
until restart (this I think is a bug caused by the GC behaviour, with
the stop-the-world making the node not accepting gossip connections from
other nodes)

This is on 2.0.13 with vnodes (256 per node).

All other nodes have normal behaviour, with a few (2-3) full CMS old
space  in the same 3h period that the trouble node is making some 30
ones. Heap space is 8G, with NEW_SIZE set to 800M. With 6G/800M the
problem was even worse (it seems, this is a bit hard to debug as it
happens *almost* every night).

nodetool status shows that although we have a certain unbalance in the
cluster, this node is neither the most nor the least loaded. I.e. we
have between 1.6% and 2.1% in the "Owns" column, and the troublesome
node reports 1.7%.

All nodes are under puppet control, so configuration is the same
everywhere.

We're running NetworkTopolyStrategy with rack awareness, and here's a
deviation from recommended settings - we have slightly varying number of
nodes in the racks:

 15 cssa01
 15 cssa02
 13 cssa03
 13 cssa04

The affected node is in the cssa04 rack. Could this mean I have some
kind of hotspot situation? Why would that show up as more GC work?

I'm quite puzzled here, so I'm looking for hints on how to identify what
is causing this.

Regards,
\EF






Re: CentOS - Could not setup cluster(snappy error)

2013-12-30 Thread Erik Forkalsud


You can add something like this to cassandra-env.sh :

JVM_OPTS="$JVM_OPTS 
-Dorg.xerial.snappy.tempdir=/path/that/allows/executables"



- Erik -


On 12/28/2013 08:36 AM, Edward Capriolo wrote:
Check your fstabs settings. On some systems /tmp has noexec set and 
unpacking a library into temp and trying to run it does not work.



On Fri, Dec 27, 2013 at 5:33 PM, Víctor Hugo Oliveira Molinar 
mailto:vhmoli...@gmail.com>> wrote:


Hi, I'm not being able to start a multiple node cluster in a
CentOs environment due to snappy loading error.

Here is my current setup for both machines(Node 1 and 2),
CentOs:
   CentOS release 6.5 (Final)

Java
   java version "1.7.0_25"
   Java(TM) SE Runtime Environment (build 1.7.0_25-b15)
   Java HotSpot(TM) 64-Bit Server VM (build 23.25-b01, mixed mode)

Cassandra Version:
   2.0.3

Also, I've already replaced the current snappy
jar(snappy-java-1.0.5.jar) by the older(snappy-java-1.0.4.1.jar).
Although the following error is still happening when I try to
start the second node:


 INFO 20:25:51,879 Handshaking version with /200.219.219.51
<http://200.219.219.51>
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.xerial.snappy.SnappyLoader.loadNativeLibrary(SnappyLoader.java:312)
at org.xerial.snappy.SnappyLoader.load(SnappyLoader.java:219)
at org.xerial.snappy.Snappy.(Snappy.java:44)
at
org.xerial.snappy.SnappyOutputStream.(SnappyOutputStream.java:79)
at
org.xerial.snappy.SnappyOutputStream.(SnappyOutputStream.java:66)
at

org.apache.cassandra.net.OutboundTcpConnection.connect(OutboundTcpConnection.java:359)
at

org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:150)
Caused by: java.lang.UnsatisfiedLinkError:
/tmp/snappy-1.0.4.1-libsnappyjava.so
<http://snappy-1.0.4.1-libsnappyjava.so>:
/tmp/snappy-1.0.4.1-libsnappyjava.so
<http://snappy-1.0.4.1-libsnappyjava.so>: failed to map segment
from shared object: Operation not permitted
at java.lang.ClassLoader$NativeLibrary.load(Native Method)
at java.lang.ClassLoader.loadLibrary1(ClassLoader.java:1957)
at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1882)
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1843)
at java.lang.Runtime.load0(Runtime.java:795)
at java.lang.System.load(System.java:1061)
at
org.xerial.snappy.SnappyNativeLoader.load(SnappyNativeLoader.java:39)
... 11 more
ERROR 20:25:52,201 Exception in thread
Thread[WRITE-/200.219.219.51 <http://200.219.219.51>,5,main]
org.xerial.snappy.SnappyError: [FAILED_TO_LOAD_NATIVE_LIBRARY] null
at org.xerial.snappy.SnappyLoader.load(SnappyLoader.java:229)
at org.xerial.snappy.Snappy.(Snappy.java:44)
at
org.xerial.snappy.SnappyOutputStream.(SnappyOutputStream.java:79)
at
org.xerial.snappy.SnappyOutputStream.(SnappyOutputStream.java:66)
at

org.apache.cassandra.net.OutboundTcpConnection.connect(OutboundTcpConnection.java:359)
at

org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:150)
ERROR 20:26:22,924 Exception encountered during startup
java.lang.RuntimeException: Unable to gossip with any seeds
at
org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1160)
at

org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:416)
at

org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:608)
at

org.apache.cassandra.service.StorageService.initServer(StorageService.java:576)
at

org.apache.cassandra.service.StorageService.initServer(StorageService.java:475)
at
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:346)
at

org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:461)
at
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:504)




What else can I do to fix it?
Att,
/Víctor Hugo Molinar/






Re: Cassandra consuming too much memory in ubuntu as compared to within windows, same machine.

2014-01-06 Thread Erik Forkalsud

On 01/04/2014 08:04 AM, Ertio Lew wrote:

...  my dual boot 4GB(RAM) machine.

...  -Xms4G -Xmx4G -



You are allocating all your ram to the java heap.  Are you using the 
same JVM parameters on the windows side?   You can try to lower the heap 
size or add ram to your machine.




- Erik -





EOFException in bulkloader, then IllegalStateException

2014-01-27 Thread Erik Forsberg

Hi!

I'm bulkloading from Hadoop to Cassandra. Currently in the process of 
moving to new hardware for both Hadoop and Cassandra, and while 
testrunning bulkload, I see the following error:


Exception in thread "Streaming to /2001:4c28:1:413:0:1:1:12:1" 
java.lang.RuntimeException: java.io.EOFException at 
com.google.common.base.Throwables.propagate(Throwables.java:155) at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) 
at java.lang.Thread.run(Thread.java:662) Caused by: java.io.EOFException 
at java.io.DataInputStream.readInt(DataInputStream.java:375) at 
org.apache.cassandra.streaming.FileStreamTask.receiveReply(FileStreamTask.java:193) 
at 
org.apache.cassandra.streaming.FileStreamTask.stream(FileStreamTask.java:180) 
at 
org.apache.cassandra.streaming.FileStreamTask.runMayThrow(FileStreamTask.java:91) 
at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) 
... 3 more


I see no exceptions related to this on the destination node 
(2001:4c28:1:413:0:1:1:12:1).


This makes the whole map task fail with:

2014-01-27 10:46:50,878 ERROR org.apache.hadoop.security.UserGroupInformation: 
PriviledgedActionException as:forsberg (auth:SIMPLE) cause:java.io.IOException: 
Too many hosts failed: [/2001:4c28:1:413:0:1:1:12]
2014-01-27 10:46:50,878 WARN org.apache.hadoop.mapred.Child: Error running child
java.io.IOException: Too many hosts failed: [/2001:4c28:1:413:0:1:1:12]
at 
org.apache.cassandra.hadoop.BulkRecordWriter.close(BulkRecordWriter.java:244)
at 
org.apache.cassandra.hadoop.BulkRecordWriter.close(BulkRecordWriter.java:209)
at 
org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.close(MapTask.java:540)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:650)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322)
at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
at org.apache.hadoop.mapred.Child.main(Child.java:260)
2014-01-27 10:46:50,880 INFO org.apache.hadoop.mapred.Task: Runnning cleanup 
for the task

The failed task was on hadoop worker node hdp01-12-4.

However, hadoop later retries this map task on a different hadoop worker node 
(hdp01-10-2), and that retry succeeds.

So that's weird, but I could live with it. Now, however, comes the real trouble 
- the hadoop job does not finish due to one task running on hdp01-12-4 being 
stuck with this:

Exception in thread "Streaming to /2001:4c28:1:413:0:1:1:12:1" 
java.lang.IllegalStateException: target reports current file is 
/opera/log2/hadoop/mapred/local/taskTracker/forsberg/jobcache/job_201401161243_0288/attempt_201401161243_0288_m_000473_0/work/tmp/iceland_test/Data_hourly/iceland_test-Data_hourly-ib-1-Data.db
 but is 
/opera/log6/hadoop/mapred/local/taskTracker/forsberg/jobcache/job_201401161243_0288/attempt_201401161243_0288_m_00_0/work/tmp/iceland_test/Data_hourly/iceland_test-Data_hourly-ib-1-Data.db
at 
org.apache.cassandra.streaming.StreamOutSession.validateCurrentFile(StreamOutSession.java:154)
at 
org.apache.cassandra.streaming.StreamReplyVerbHandler.doVerb(StreamReplyVerbHandler.java:45)
at 
org.apache.cassandra.streaming.FileStreamTask.receiveReply(FileStreamTask.java:199)
at 
org.apache.cassandra.streaming.FileStreamTask.stream(FileStreamTask.java:180)
at 
org.apache.cassandra.streaming.FileStreamTask.runMayThrow(FileStreamTask.java:91)
at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:662)

This just sits there forever, or at least until the hadoop task timeout kicks 
in.

So two questions here:

1) Any clues on what might cause the first EOFException? It seems to appear for 
*some* of my bulkloads. Not all, but frequent enough to be a problem. Like, 
every 10:th bulkload I do seems to have the problem.

2) The second problem I have a feeling could be related to 
https://issues.apache.org/jira/browse/CASSANDRA-4223, but with the extra quirk 
that with the bulkload case, we have *multiple java processes* creating 
streaming sessions on the same host, so streaming session IDs are not unique.

I'm thinking 2) happens because the EOFException made the streaming session in 
1) sit around on the target node without being closed.

This is on Cassandra 1.2.1. I know that's pretty old, but I would like to avoid 
upgradin

Re: EOFException in bulkloader, then IllegalStateException

2014-01-27 Thread Erik Forsberg

On 2014-01-27 12:56, Erik Forsberg wrote:
This is on Cassandra 1.2.1. I know that's pretty old, but I would like 
to avoid upgrading until I have made this migration from old to new 
hardware. Upgrading to 1.2.13 might be an option.


Update: Exactly the same behaviour on Cassandra 1.2.13.

Thanks,
\EF


Multiple large disks in server - setup considerations

2011-05-31 Thread Erik Forsberg
Hi!

I'm considering setting up a small (4-6 nodes) Cassandra cluster on
machines that each have 3x2TB disks. There's no hardware RAID in the
machine, and if there were, it could only stripe single disks
together, not parts of disks. 

I'm planning RF=2 (or higher).

I'm pondering what the best disk configuration is. Two alternatives:

1) Make small partition on first disk for Linux installation and commit
   log. Use Linux' software RAID0 to stripe the remaining space on disk1
   + the two remaining disks into one large XFS partition.

2) Make small partition on first disk for Linux installation and commit
   log. Mount rest of disk 1 as /var/cassandra1, then disk2
   as /var/cassandra2 and disk3 as /var/cassandra3.

Is it unwise to put the commit log on the same physical disk as some of
the data? I guess it could impact write performance, but maybe it's bad
from a data consistency point of view? 

How does Cassandra handle replacement of a bad disk in the two
alternatives? With option 1) I guess there's risk of files being
corrupt. With option 2) they will simply be missing after replacing the
disk with a new one.

With option 2) I guess I'm limiting the size of the total amount of
data in the largest CF at compaction to, hmm.. the free space on the
disk with most free space, correct? 

Comments welcome!

Thanks,
\EF
-- 
Erik Forsberg 
Developer, Opera Software - http://www.opera.com/ 


Re: Multiple large disks in server - setup considerations

2011-06-07 Thread Erik Forsberg
On Tue, 31 May 2011 13:23:36 -0500
Jonathan Ellis  wrote:

> Have you read http://wiki.apache.org/cassandra/CassandraHardware ?

I had, but it was a while ago so I guess I kind of deserved an RTFM! :-)

After re-reading it, I still want to know:

* If we disregard the performance hit caused by having the commitlog on
  the same physical device as parts of the data, are there any other
  grave effects on Cassandra's functionality with a setup like that?

* How does Cassandra handle a case where one of the disks in a striped
  RAID0 partition goes bad and is replaced? Is the only option to wipe
  everything from that node and reinit the node, or will it handle
  corrupt files? I.e, what's the recommended thing to do from an
  operations point of view when a disk dies on one of the nodes in a
  RAID0 Cassandra setup? What will cause the least risk for data loss?
  What will be the fastest way to get the node up to speed with the
  rest of the cluster?

Thanks,
\EF



> 
> On Tue, May 31, 2011 at 7:47 AM, Erik Forsberg 
> wrote:
> > Hi!
> >
> > I'm considering setting up a small (4-6 nodes) Cassandra cluster on
> > machines that each have 3x2TB disks. There's no hardware RAID in the
> > machine, and if there were, it could only stripe single disks
> > together, not parts of disks.
> >
> > I'm planning RF=2 (or higher).
> >
> > I'm pondering what the best disk configuration is. Two alternatives:
> >
> > 1) Make small partition on first disk for Linux installation and
> > commit log. Use Linux' software RAID0 to stripe the remaining space
> > on disk1
> >   + the two remaining disks into one large XFS partition.
> >
> > 2) Make small partition on first disk for Linux installation and
> > commit log. Mount rest of disk 1 as /var/cassandra1, then disk2
> >   as /var/cassandra2 and disk3 as /var/cassandra3.
> >
> > Is it unwise to put the commit log on the same physical disk as
> > some of the data? I guess it could impact write performance, but
> > maybe it's bad from a data consistency point of view?
> >
> > How does Cassandra handle replacement of a bad disk in the two
> > alternatives? With option 1) I guess there's risk of files being
> > corrupt. With option 2) they will simply be missing after replacing
> > the disk with a new one.
> >
> > With option 2) I guess I'm limiting the size of the total amount of
> > data in the largest CF at compaction to, hmm.. the free space on the
> > disk with most free space, correct?
> >
> > Comments welcome!
> >
> > Thanks,
> > \EF
> > --
> > Erik Forsberg 
> > Developer, Opera Software - http://www.opera.com/
> >
> 
> 
> 


-- 
Erik Forsberg 
Developer, Opera Software - http://www.opera.com/


Re: 0.7.9 RejectedExecutionException

2011-10-12 Thread Erik Forkalsrud

On 10/12/2011 11:33 AM, Ashley Martens wrote:

java version "1.6.0_20"
OpenJDK Runtime Environment (IcedTea6 1.9.9) (6b20-1.9.9-0ubuntu1~10.10.2)
OpenJDK 64-Bit Server VM (build 19.0-b09, mixed mode)


This may have been mentioned before, but is it an option to use the 
Sun/Oracle JDK?



- Erik -



Re: 0.7.9 RejectedExecutionException

2011-10-12 Thread Erik Forkalsrud


My suggestion would be to put a recent Sun JVM on the problematic node 
and see if that eliminates the crashes.


The Sun JVM appears the be the mainstream choice when running Cassandra, 
so that's a more well tested configuration. You can search the list 
archives for OpenJDK related bugs to see that they do exist.  For 
example:  
http://mail-archives.apache.org/mod_mbox/cassandra-user/201107.mbox/%3CCAB-=z42ihihr8svhdrbvpfyjjysspckes1zoxkvcje9axkd...@mail.gmail.com%3E



- Erik -


On 10/12/2011 12:41 PM, Ashley Martens wrote:
I guess it could be an option but I can't puppet the Oracle JDK 
install so I would rather not.


On Wed, Oct 12, 2011 at 12:35 PM, Erik Forkalsrud <mailto:eforkals...@cj.com>> wrote:


On 10/12/2011 11:33 AM, Ashley Martens wrote:

java version "1.6.0_20"
OpenJDK Runtime Environment (IcedTea6 1.9.9)
(6b20-1.9.9-0ubuntu1~10.10.2)
OpenJDK 64-Bit Server VM (build 19.0-b09, mixed mode)


This may have been mentioned before, but is it an option to use
    the Sun/Oracle JDK?


- Erik -




--
gpg --keyserver pgpkeys.mit.edu <http://pgpkeys.mit.edu/> --recv-keys 
0x23e861255b0d6abb

Key fingerprint = 0E9E 0E22 3957 BB04 DD72  B093 23E8 6125 5B0D 6ABB





Re: Updates Question

2011-10-24 Thread Erik Forkalsrud

On 10/24/2011 05:23 AM, Sam Hodgson wrote:
I can write new columns/rows etc no problem however when updating I 
get no errors back and the update does not occur.  I've also tried 
updating manually using the Cassandra cluster admin gui and the same 
thing happens. I can write but updates do not work, the gui even 
reports a success but the data remains unchanged.


A basic thing to check is that the updates have a higher timestamp than 
the initial write.  I was scratching my head with a similar problem a 
while back, and it turned out that cassandra-cli used java's 
System.currentTimeMillis() * 1000 for timestamps while the application 
code used just System.currentTimeMillis() and those values would always 
be lower.



- Erik -



Re: Internal error processing get: Null pointer exception

2011-11-02 Thread Erik Forkalsrud

On 11/01/2011 09:02 PM, Jonathan Ellis wrote:

That doesn't make sense to me.  CS:147 is

 columnFamilyKeyMap.put(row.key, row.cf);

where cFKM is

 Map  columnFamilyKeyMap = new
HashMap();

So cFKM can't be null, and HashMap accomodates both null key and null
value, so I'm not sure what there is to thorw NPE.


it must be  row == null


Can I use BulkOutputFormat from 1.1 to load data to older Cassandra versions?

2012-01-08 Thread Erik Forsberg

Hi!

Can the new BulkOutputFormat 
(https://issues.apache.org/jira/browse/CASSANDRA-3045) be used to load 
data to servers running cassandra 0.8.7 and/or Cassandra 1.0.6?


I'm thinking of using jar files from the development version to load 
data onto a production cluster which I want to keep on a production 
version of Cassandra. Can I do that, or does BulkOutputFormat require an 
API level that is only in the development version of Cassandra?


Thanks,
\EF


Recommended configuration for good streaming performance?

2012-02-02 Thread Erik Forsberg

Hi!

We're experimenting with streaming from Hadoop to Cassandra using 
BulkoutputFormat, on cassandra-1.1 branch.


Are there any specific settings we should tune on the Cassandra servers 
in order to get the best streaming performance?


Our Cassandra hardware are 16 core (including HT cores) with 24GiB of 
RAM. They have two disks each. So far we've configured them with 
commitlog on one disk and sstables on the other, but with streaming not 
using commitlog (correct?) maybe it makes sense to have sstables on both 
disks, doubling available I/O?


Thoughts on number of parallel streaming clients?

Thanks,
\EF


Streaming sessions from BulkOutputFormat job being listed long after they were killed

2012-02-17 Thread Erik Forsberg

Hi!

If I run a hadoop job that uses BulkOutputFormat to write data to 
Cassandra, and that hadoop job is aborted, i.e. streaming sessions are 
not completed, it seems like the streaming sessions hang around for a 
very long time, I've observed at least 12-15h, in output from 'nodetool 
netstats'.


To me it seems like they go away only after a restart of Cassandra.

Is this a known behaviour? Does it cause any problems, f. ex. consuming 
memory, or should I just ignore it?


Regards,
\EF



Max TTL?

2012-02-20 Thread Erik Forsberg

Hi!

When setting ttl on columns, is there a maximum value (other than 
MAXINT, 2**31-1) that can be used?


I have a very odd behaviour here, where I try to set ttl to  9 622 973 
(~111 days) which works, but setting it to 11 824 305 (~137 days) does 
not - it seems columns are deleted instantly at insertion.


This is using the BulkOutputFormat. And it could be a problem with our 
code, i.e. the code using BulkOutputFormat. So, uhm, just asking to see 
if we're hitting something obvious.


Regards,
\EF


Re: Max TTL?

2012-02-21 Thread Erik Forsberg

On 2012-02-20 21:20, aaron morton wrote:

Nothing obvious.


Samarth (working on same project) found that his patch to CASSANDRA-3754 
was cleaned up a bit too much, which caused a negative ttl.


https://issues.apache.org/jira/browse/CASSANDRA-3754?focusedCommentId=13212395&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13212395

So problem found.

Regards,
\EF


On Bloom filters and Key Cache

2012-03-21 Thread Erik Forsberg

Hi!

We're currently testing Cassandra with a large number of row keys per 
node - nodetool cfstats approximated number of keys to something like 
700M per node. This seems to have caused a very large heap consumption.


After reading 
http://wiki.apache.org/cassandra/LargeDataSetConsiderations I think I've 
tracked this down to the bloom filter, and the sampled index entries.


Regarding bloom filters, have I understood correctly that they are 
stored on Heap, and that the "Bloom Filter Space Used" reported by 
'nodetool cfstats' is an approximation of the heap space used by bloom 
filters? It reports the on-disk size, but if I understand 
CASSANDRA-3497, the on-disk size is smaller than the on-Heap size?


I understand that increasing bloom_filter_fp_chance will decrease the 
bloom filter size, but at the cost of worse performance when asking for 
keys that don't exist. I do have a fair amount of queries for keys that 
don't exist.


How much will increasing the key cache help, i.e. decrease bloom filter 
size but increase key cache size? Will the key cache cache negative 
results, i.e. the fact that a key didn't exist?


Regards,
\EF


sstable size increase at compaction

2012-03-21 Thread Erik Forsberg

Hi!

We're using the bulkloader to load data to Cassandra. During and after 
bulkloading, the minor compaction process seems to result in larger 
sstables being created. An example:


 INFO [CompactionExecutor:105] 2012-03-21 15:18:46,608 
CompactionTask.java (line 115) Compacting [SSTableReader(pat
h='/cassandra/OSP5/Data/OSP5-Data-hc-1755-Data.db'), (REMOVED A BUNCH OF 
OTHER SSTABLE PATHS), 
SSTableReader(path='/cassandra/OSP5/Data/OSP5-Data-hc-1749-Data.db'), 
SSTableReader(path='/cassandra/O

SP5/Data/OSP5-Data-hc-1753-Data.db')]

 INFO [CompactionExecutor:105] 2012-03-21 15:30:04,188 
CompactionTask.java (line 226) Compacted to 
[/cassandra/OSP5/Data/OSP5-Data-hc-3270-Data.db,].  84,214,484 to 
105,498,673 (~125% of original) bytes for 2,132,056 keys at 
0.148486MB/s.  Time: 677,580ms.


The sstables are compressed (DeflateCompressor with chunk size 128) on 
the Hadoop cluster before being transferred to Cassandra, and the CF has 
the same compression settings:


[default@Keyspace1] describe Data;
ColumnFamily: Data (Super)
  Key Validation Class: org.apache.cassandra.db.marshal.UTF8Type
  Default column value validator: 
org.apache.cassandra.db.marshal.LongType
  Columns sorted by: 
org.apache.cassandra.db.marshal.LongType/org.apache.cassandra.db.marshal.UTF8Type

  GC grace seconds: 864000
  Compaction min/max thresholds: 4/32
  Read repair chance: 1.0
  DC Local Read repair chance: 0.0
  Replicate on write: true
  Caching: KEYS_ONLY
  Bloom Filter FP chance: 0.01
  Built indexes: []
  Compaction Strategy: 
org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy

  Compression Options:
chunk_length_kb: 128
sstable_compression: 
org.apache.cassandra.io.compress.DeflateCompressor


Any clues on this?

Regards,
\EF


Re: sstable size increase at compaction

2012-03-21 Thread Erik Forsberg

On 2012-03-21 16:36, Erik Forsberg wrote:

Hi!

We're using the bulkloader to load data to Cassandra. During and after
bulkloading, the minor compaction process seems to result in larger
sstables being created. An example:


This is on Cassandra 1.1, btw.

\EF


Graveyard compactions, when do they occur?

2012-03-27 Thread Erik Forsberg

Hi!

I was trying out the "truncate" command in cassandra-cli.

http://wiki.apache.org/cassandra/CassandraCli08 says "A snapshot of the 
data is created, which is deleted asyncronously during a 'graveyard' 
compaction."


When do "graveyard" compactions happen? Do I have to trigger them somehow?

Thanks,
\EF


Re: blob fields, bynary or hexa?

2012-04-18 Thread Erik Forkalsud

On 04/18/2012 03:02 AM, mdione@orange.com wrote:

   We're building a database to stock the avatars for our users in three sizes. 
Thing is,
We planned to use the blob field with a ByteType validator, but if we try to 
inject the
binary data as read from the image file, we get a  
error.


Which client are you using?  With Hector or straight thrift, your should 
be able to store byte[] directly.



- Erik -


Re: How often to run `nodetool repair`

2013-08-01 Thread Erik Forkalsud

On 08/01/2013 01:16 PM, Andrey Ilinykh wrote:


TTL is effectively DELETE; you need to run a repair once every
gc_grace_seconds. If you don't, data might un-delete itself.


How is it possible? Every replica has TTL, so it when it expires every 
replica has tombstone. I don't see how you can get data with no 
tombstone. What do I miss?




The only way I can think of is this scenario:

   - value "A" for some key is written with ttl=30days, to all 
replicas   (i.e a long ttl or no ttl at all)
   - value "B" for the same key is written with ttl=1day, but doesn't 
reach all replicas

   - one day passes and the ttl=1day values turn into deletes
   - gc_grace passes and the tombstones are purged

at this point, the replica that didn't get the ttl=1day value will think 
the older value "A" is live.


I'm no expert on this so I may be mistaken, but in any case it's a 
corner case as overwriting columns with shorter ttls would be unusual.



- Erik -



Re: Cassandra freezes under load when using libc6 2.11.1-0ubuntu7.5

2011-01-13 Thread Erik Onnen
but that doesn't really explain a two minute pause during garbage
collection. My guess then, is that one or more of the following are at play
in this scenario:

1) a core nehalem bug - the nehalem architecture made a lot of changes to
the way it manges TLBs for memory, largely as a virtualization optimization.
I doubt this is the case but assuming the guest isn't seeing a different
architecture, we did see this issue only on E5507 processors.
2) the physical servers in the new AZ are drastically overcommitted -  maybe
AMZN bought into the notion that Nehalems are better at virtualization and
is allowing more guests to run on physical servers running Nehalems. I've no
idea, just a hypothesis.
3) a hypervisor bug - I've deployed large JVMs to big physical Nehalem boxen
running Cassandra clusters under high load and never seen behavior like the
above. If I could see more of what the hypervisor was doing I'd have a
pretty good idea here, but such is life in the cloud.



I also should say that I don't think any issues we had were at all related
specifically to Cassandra. We were running fine in the first AZ, no problems
other than needing to grow capacity. Only when we saw the different
architecture in the new EC2 AZ did we experience problems and when we
shackled the new generation collector, the bad problems went away.

Sorry for the long tirade. This was originally going to be a blog post but I
though it would have more value in context here. I hope ultimately it helps
someone else.
-erik


On Thu, Jan 13, 2011 at 5:26 PM, Mike Malone  wrote:

> Hey folks,
>
> We've discovered an issue on Ubuntu/Lenny with libc6 2.11.1-0ubuntu7.5 (it
> may also affect versions between 2.11.1-0ubuntu7.1 and 2.11.1-0ubuntu7.4).
> The bug affects systems when a large number of threads (or processes) are
> created rapidly. Once triggered, the system will become completely
> unresponsive for ten to fifteen minutes. We've seen this issue on our
> production Cassandra clusters under high load. Cassandra seems particularly
> susceptible to this issue because of the large thread pools that it creates.
> In particular, we suspect the unbounded thread pool for connection
> management may be pushing some systems over the edge.
>
> We're still trying to narrow down what changed in libc that is causing this
> issue. We also haven't tested things outside of xen, or on non-x86
> architectures. But if you're seeing these symptoms, you may want to try
> upgrading libc6.
>
> I'll send out an update if we find anything else interesting. If anyone has
> any thoughts as to what the cause is, we're all ears!
>
> Hope this saves someone some heart-ache,
>
> Mike
>


Re: Cassandra freezes under load when using libc6 2.11.1-0ubuntu7.5

2011-01-13 Thread Erik Onnen
Forgot one critical point, we use zero swap on any of these hosts.


Re: Cassandra freezes under load when using libc6 2.11.1-0ubuntu7.5

2011-01-13 Thread Erik Onnen
Too similar to be a coincidence I'd say:

Good node (old AZ):  2.11.1-0ubuntu7.5
Bad node (new AZ): 2.11.1-0ubuntu7.6

You beat me to the punch with the test program. I was working on something
similar to test it out and got side tracked.

I'll try the test app tomorrow and verify the versions of the AMIs used for
provisioning.

On Thu, Jan 13, 2011 at 11:31 PM, Mike Malone  wrote:

> Erik, the scenario you're describing is almost identical to what we've been
> experiencing. Sounds like you've been pulling your hair out too! You're also
> running the same distro and kernel as us. And we also run without swap.
> Which begs the question... what version of libc6 are you running!? Here's
> the output from one of our upgraded boxes:
>
> $ dpkg --list | grep libc6
> ii  libc62.11.1-0ubuntu7.7
> Embedded GNU C Library: Shared libraries
> ii  libc6-dev2.11.1-0ubuntu7.7
> Embedded GNU C Library: Development Librarie
>
> Before upgrading the version field showed 2.11.1-0ubuntu7.5. Wondering what
> yours is.
>
> We also found ourselves in a similar situation with different regions.
> We're using the canonical ubuntu ami as the base for our systems. But there
> appear to be small differences between the packages included in the amis
> from different regions. Seems libc6 is one of the things that changed. I
> discovered by diff'ing `dpkg --list` on a node that was good, and one that
> was bad.
>
> The architecture hypothesis is also very interesting. If we could reproduce
> the bug with the latest libc6 build I'd escalate it back up to Amazon. But I
> can't repro it, so nothing to escalate.
>
> For what it's worth, we were able to reproduce the lockup behavior that
> you're describing by running a tight loop that spawns threads. Here's a gist
> of the app I used: https://gist.github.com/a4123705e67e9446f1cc -- I'd be
> interested to know whether that locks things up on your system with a new
> libc6.
>
> Mike
>
> On Thu, Jan 13, 2011 at 10:39 PM, Erik Onnen  wrote:
>
>> May or may not be related but I thought I'd recount a similar experience
>> we had in EC2 in hopes it helps someone else.
>>
>> As background, we had been running several servers in a 0.6.8 ring with no
>> Cassandra issues (some EC2 issues, but none related to Cassandra) on
>> multiple EC2 XL instances in a single availability zone. We decided to add
>> several other nodes to a second AZ for reasons beyond the scope of this
>> email. As we reached steady operational state in the new AZ, we noticed that
>> the new nodes in the new AZ were repeatedly getting dropped from the ring.
>> At first we attributed the drops to phi and expected cross-AZ latency. As we
>> tried to pinpoint the issue, we found something very similar to what you
>> describe - the EC2 VMs in the new AZ would become completely unresponsive.
>> Not just the Java process hosting Cassandra, but the entire host. Shell
>> commands would not execute for existing sessions, we could not establish new
>> SSH sessions and tails we had on active files wouldn't show any progress. It
>> appeared as if the machines in the new AZ would seize for several minutes,
>> then come back to life with little rhyme or reason as to why. Tickets opened
>> with AMZN resulted in responses of "the physical server looks normal".
>>
>> After digging deeper, here's what we found. To confirm all nodes in both
>> AZs were identical at the following levels:
>> * Kernel (2.6.32-305-ec2 #9-Ubuntu SMP) distro (Ubuntu 10.04.1 LTS) and
>> glibc on x86_64
>> * All nodes were running identical Java distributions that we deployed
>> ourselves, sun 1.6.0_22-b04
>> * Same amount of virtualized RAM visible to the guest, same RAID stripe
>> configuration across the same size/number of ephemeral drives
>>
>> We noticed two things that were different across the VMs in the two AZs:
>> * The class of CPU exposed to the guest OSes across the two AZs (and
>> presumably the same physical server above that guest).
>> ** On hosts in the AZ not having issues, we see from the guest older
>> Harpertown class Intel CPUs: "model name  : Intel(R) Xeon(R) CPU
>>   E5430  @ 2.66GHz"
>> ** On hosts in the AZ having issues, we see from the guest newer Nehalem
>> class Intel CPUs: "model name  : Intel(R) Xeon(R) CPU   E5507
>>  @ 2.27GHz"
>> * Percent steal was consistently higher on the new nodes, on average 25%
>> where as the older (stable) VMs were around 9% at peak load
>>
>> Consist

Re: Cassandra freezes under load when using libc6 2.11.1-0ubuntu7.5

2011-01-17 Thread Erik Onnen
Unfortunately, the previous AMI we used to provision the 7.5 version is no
longer available. More unfortunately, the two test nodes we spun up in each
AZ did not get Nehalem architectures so the only things I can say for
certain after running Mike's test 10x on each test node are:

1) I could not repro using Mike's test in either AZ. We had previously only
seen failures in one AZ.
2) I could not repro using Mike's test against libc7.7, the only version of
libc we could provision with a nominal amount of effort.
3) 1&2 above were only tested on older Harpertown cores, an architecture
where we've never seen the issue.

I think the only way I'll be able to further test is to wait until we
shuffle around our production configuration and I can abuse the existing
guests without taking a node out of my ring.

The only silver lining so far is that we've realized we need to better lock
down how we source AMIs for our EC2 nodes. That will give us a more reliable
system, but apparently we have no control over the actual architecture that
AMZN lets us have.

-erik

On Fri, Jan 14, 2011 at 10:09 AM, Mike Malone  wrote:

> That's interesting. For us, the 7.5 version of libc was causing problems.
> Either way, I'm looking forward to hearing about anything you find.
>
> Mike
>
>
> On Thu, Jan 13, 2011 at 11:47 PM, Erik Onnen  wrote:
>
>> Too similar to be a coincidence I'd say:
>>
>> Good node (old AZ):  2.11.1-0ubuntu7.5
>> Bad node (new AZ): 2.11.1-0ubuntu7.6
>>
>> You beat me to the punch with the test program. I was working on something
>> similar to test it out and got side tracked.
>>
>> I'll try the test app tomorrow and verify the versions of the AMIs used
>> for provisioning.
>>
>> On Thu, Jan 13, 2011 at 11:31 PM, Mike Malone  wrote:
>>
>>> Erik, the scenario you're describing is almost identical to what we've
>>> been experiencing. Sounds like you've been pulling your hair out too! You're
>>> also running the same distro and kernel as us. And we also run without swap.
>>> Which begs the question... what version of libc6 are you running!? Here's
>>> the output from one of our upgraded boxes:
>>>
>>> $ dpkg --list | grep libc6
>>> ii  libc62.11.1-0ubuntu7.7
>>>   Embedded GNU C Library: Shared libraries
>>> ii  libc6-dev2.11.1-0ubuntu7.7
>>>   Embedded GNU C Library: Development Librarie
>>>
>>> Before upgrading the version field showed 2.11.1-0ubuntu7.5. Wondering
>>> what yours is.
>>>
>>> We also found ourselves in a similar situation with different regions.
>>> We're using the canonical ubuntu ami as the base for our systems. But there
>>> appear to be small differences between the packages included in the amis
>>> from different regions. Seems libc6 is one of the things that changed. I
>>> discovered by diff'ing `dpkg --list` on a node that was good, and one that
>>> was bad.
>>>
>>> The architecture hypothesis is also very interesting. If we could
>>> reproduce the bug with the latest libc6 build I'd escalate it back up to
>>> Amazon. But I can't repro it, so nothing to escalate.
>>>
>>> For what it's worth, we were able to reproduce the lockup behavior that
>>> you're describing by running a tight loop that spawns threads. Here's a gist
>>> of the app I used: https://gist.github.com/a4123705e67e9446f1cc -- I'd
>>> be interested to know whether that locks things up on your system with a new
>>> libc6.
>>>
>>> Mike
>>>
>>> On Thu, Jan 13, 2011 at 10:39 PM, Erik Onnen  wrote:
>>>
>>>> May or may not be related but I thought I'd recount a similar experience
>>>> we had in EC2 in hopes it helps someone else.
>>>>
>>>> As background, we had been running several servers in a 0.6.8 ring with
>>>> no Cassandra issues (some EC2 issues, but none related to Cassandra) on
>>>> multiple EC2 XL instances in a single availability zone. We decided to add
>>>> several other nodes to a second AZ for reasons beyond the scope of this
>>>> email. As we reached steady operational state in the new AZ, we noticed 
>>>> that
>>>> the new nodes in the new AZ were repeatedly getting dropped from the ring.
>>>> At first we attributed the drops to phi and expected cross-AZ latency. As 
>>>> we
>>>> tried to pinpoint the issue, we found something very similar to wha

Re: Cassandra taking 100% cpu over multiple cores(0.7.0)

2011-01-23 Thread Erik Onnen
One of the developers will have to confirm but this looks like a bug
to me. MessagingService is a singleton and there's a Multimap used for
targets that isn't accessed in a thread safe manner.

The thread dump would seem to confirm this, when you hammer what is
ultimately a standard HashMap with multiple readers while a write
happens, there's a race that can be triggered where all subsequent
reads on that map will get stuck.

Your thread dump would seem to indicate that is happening with many
(>50) threads trying to read from what appears to be the same HashMap.

"pool-1-thread-6451" prio=10 tid=0x7fa5242c9000 nid=0x10f4
runnable [0x7fa52fde4000]
   java.lang.Thread.State: RUNNABLE
at java.util.HashMap.get(HashMap.java:303)
at 
com.google.common.collect.AbstractMultimap.getOrCreateCollection(AbstractMultimap.java:205)
at 
com.google.common.collect.AbstractMultimap.put(AbstractMultimap.java:194)
at 
com.google.common.collect.AbstractListMultimap.put(AbstractListMultimap.java:72)
at 
com.google.common.collect.ArrayListMultimap.put(ArrayListMultimap.java:60)
at 
org.apache.cassandra.net.MessagingService.sendRR(MessagingService.java:303)
at 
org.apache.cassandra.service.StorageProxy.strongRead(StorageProxy.java:353)
at 
org.apache.cassandra.service.StorageProxy.readProtocol(StorageProxy.java:229)
at 
org.apache.cassandra.thrift.CassandraServer.readColumnFamily(CassandraServer.java:98)
at 
org.apache.cassandra.thrift.CassandraServer.get(CassandraServer.java:289)
at 
org.apache.cassandra.thrift.Cassandra$Processor$get.process(Cassandra.java:2655)
at 
org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2555)
at 
org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:167)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

When two threads get to this state, they won't ever come back and any
other thread that tries to read that HashMap will also get wedged
(looks like that's happened in your case).

On Sun, Jan 23, 2011 at 10:11 AM, Thibaut Britz
 wrote:
>
> Hi,
>
> After a while some of my cassandra nodes (out of a pool of 20) get
> unresponsive. All cores (8) are being used to 100% and uptime load
> climbs to 80.
>
> I first suspected the gargabe collector to force the node to 100% cpu
> usage , but the problem even occurs with the default garbage
> collector. A nodetool info also reveals that the heap isn't even
> completely used.
>
> The node has JNA enabled. There are no ERROR message in the logfile
> (except CASSANDRA-1959 which happened a few days before. But I
> restarted the node afterwards). Cassandra runs on non virtualized
> hardware. (e.g. no amazon ec2). The version I use is 0.7.0. I haven't
> applied any patches (e.g. CASSANDRA-1959).
>
> I attached the jstack of the process. (there are many many threads)
>
> Thanks,
> Thibaut
>
>
> ./nodetool -h localhost info
> ccc
> Load             : 1.66 GB
> Generation No    : 1295609201
> Uptime (seconds) : 196155
> Heap Memory (MB) : 1795.55 / 3912.06
>
>
> root      2870  730 57.4 6764500 4703712 ?     SLl  Jan21 23913:00
> /usr/bin/java -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42
> -Xms3997M -Xmx3997M -XX:+HeapDumpOnOutOfMemoryError -XX:+UseLargePages
> -XX:+UseCompressedOops -Xss128k -XX:SurvivorRatio=8
> -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75
> -XX:+UseCMSInitiatingOccupancyOnly -Djava.net.preferIPv4Stack=true
> -Dcom.sun.management.jmxremote.port=8080
> -Dcom.sun.management.jmxremote.ssl=false
> -Dcom.sun.management.jmxremote.authenticate=false
> -Dlog4j.configuration=log4j-server.properties -cp
> /software/cassandra/bin/../conf:/software/cassandra/bin/../build/classes:/software/cassandra/bin/../lib/antlr-3.1.3.jar:/software/cassandra/bin/../lib/apache-cassandra-0.7.0.jar:/software/cassandra/bin/../lib/avro-1.4.0-fixes.jar:/software/cassandra/bin/../lib/avro-1.4.0-sources-fixes.jar:/software/cassandra/bin/../lib/commons-cli-1.1.jar:/software/cassandra/bin/../lib/commons-codec-1.2.jar:/software/cassandra/bin/../lib/commons-collections-3.2.1.jar:/software/cassandra/bin/../lib/commons-lang-2.4.jar:/software/cassandra/bin/../lib/concurrentlinkedhashmap-lru-1.1.jar:/software/cassandra/bin/../lib/guava-r05.jar:/software/cassandra/bin/../lib/high-scale-lib.jar:/software/cassandra/bin/../lib/ivy-2.1.0.jar:/software/cassandra/bin/../lib/jackson-core-asl-1.4.0.jar:/software/cassandra/bin/../lib/jackson-mapper-asl-1.4.0.jar:/software/cassandra/bin/../lib/jetty-6.1.21.jar:/software/cassandra/bin/../lib/jetty-util-6.1.21.jar:/software/cassandra/bin/../lib/jline-0.9.94.jar:/software/cassandra/bin/../lib/jna.jar:/software/cassandra/bin/../lib/json-simple-1.1.jar:/software/cassand

Re: Cassandra taking 100% cpu over multiple cores(0.7.0)

2011-01-23 Thread Erik Onnen
Filed as https://issues.apache.org/jira/browse/CASSANDRA-2037

I can't see how the code would be correct as written but I'm usually
wrong about most things.

On Sun, Jan 23, 2011 at 12:14 PM, Erik Onnen  wrote:
> One of the developers will have to confirm but this looks like a bug
> to me. MessagingService is a singleton and there's a Multimap used for
> targets that isn't accessed in a thread safe manner.
>
> The thread dump would seem to confirm this, when you hammer what is
> ultimately a standard HashMap with multiple readers while a write
> happens, there's a race that can be triggered where all subsequent
> reads on that map will get stuck.
>
> Your thread dump would seem to indicate that is happening with many
> (>50) threads trying to read from what appears to be the same HashMap.
>
> "pool-1-thread-6451" prio=10 tid=0x7fa5242c9000 nid=0x10f4
> runnable [0x7fa52fde4000]
>   java.lang.Thread.State: RUNNABLE
>        at java.util.HashMap.get(HashMap.java:303)
>        at 
> com.google.common.collect.AbstractMultimap.getOrCreateCollection(AbstractMultimap.java:205)
>        at 
> com.google.common.collect.AbstractMultimap.put(AbstractMultimap.java:194)
>        at 
> com.google.common.collect.AbstractListMultimap.put(AbstractListMultimap.java:72)
>        at 
> com.google.common.collect.ArrayListMultimap.put(ArrayListMultimap.java:60)
>        at 
> org.apache.cassandra.net.MessagingService.sendRR(MessagingService.java:303)
>        at 
> org.apache.cassandra.service.StorageProxy.strongRead(StorageProxy.java:353)
>        at 
> org.apache.cassandra.service.StorageProxy.readProtocol(StorageProxy.java:229)
>        at 
> org.apache.cassandra.thrift.CassandraServer.readColumnFamily(CassandraServer.java:98)
>        at 
> org.apache.cassandra.thrift.CassandraServer.get(CassandraServer.java:289)
>        at 
> org.apache.cassandra.thrift.Cassandra$Processor$get.process(Cassandra.java:2655)
>        at 
> org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2555)
>        at 
> org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:167)
>        at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>        at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>        at java.lang.Thread.run(Thread.java:662)
>
> When two threads get to this state, they won't ever come back and any
> other thread that tries to read that HashMap will also get wedged
> (looks like that's happened in your case).
>
> On Sun, Jan 23, 2011 at 10:11 AM, Thibaut Britz
>  wrote:
>>
>> Hi,
>>
>> After a while some of my cassandra nodes (out of a pool of 20) get
>> unresponsive. All cores (8) are being used to 100% and uptime load
>> climbs to 80.
>>
>> I first suspected the gargabe collector to force the node to 100% cpu
>> usage , but the problem even occurs with the default garbage
>> collector. A nodetool info also reveals that the heap isn't even
>> completely used.
>>
>> The node has JNA enabled. There are no ERROR message in the logfile
>> (except CASSANDRA-1959 which happened a few days before. But I
>> restarted the node afterwards). Cassandra runs on non virtualized
>> hardware. (e.g. no amazon ec2). The version I use is 0.7.0. I haven't
>> applied any patches (e.g. CASSANDRA-1959).
>>
>> I attached the jstack of the process. (there are many many threads)
>>
>> Thanks,
>> Thibaut
>>
>>
>> ./nodetool -h localhost info
>> ccc
>> Load             : 1.66 GB
>> Generation No    : 1295609201
>> Uptime (seconds) : 196155
>> Heap Memory (MB) : 1795.55 / 3912.06
>>
>>
>> root      2870  730 57.4 6764500 4703712 ?     SLl  Jan21 23913:00
>> /usr/bin/java -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42
>> -Xms3997M -Xmx3997M -XX:+HeapDumpOnOutOfMemoryError -XX:+UseLargePages
>> -XX:+UseCompressedOops -Xss128k -XX:SurvivorRatio=8
>> -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75
>> -XX:+UseCMSInitiatingOccupancyOnly -Djava.net.preferIPv4Stack=true
>> -Dcom.sun.management.jmxremote.port=8080
>> -Dcom.sun.management.jmxremote.ssl=false
>> -Dcom.sun.management.jmxremote.authenticate=false
>> -Dlog4j.configuration=log4j-server.properties -cp
>> /software/cassandra/bin/../conf:/software/cassandra/bin/../build/classes:/software/cassandra/bin/../lib/antlr-3.1.3.jar:/software/cassandra/bin/../lib/apache-cassandra-0.7.0.jar:/software/cassandra/bin/../lib/avro-1.4.0-fixes.jar:/software/cassandra/bin/../lib/avro-1.4.0-

CompactionExecutor EOF During Bootstrap

2011-03-07 Thread Erik Onnen
Stuff/stuffid_reg_idx-f-4


The same behavior has happened on both attempts. Logs from the node
giving up tokens show activity by the StreamStage thread but after the
failure on the bootstrapping node not much else relative to the
stream.

Lastly, the behavior in both cases seems to have issue with the third
data file. Files f-1,f-2 and f-4 are present but f-3 is not.

Any help would be appreciated.

-erik


Re: CompactionExecutor EOF During Bootstrap

2011-03-07 Thread Erik Onnen
Thanks Jonathan.

Filed: https://issues.apache.org/jira/browse/CASSANDRA-2283

We'll start the scrub during our normal compaction cycle and update
this thread and the bug with the results.

-erik

On Mon, Mar 7, 2011 at 11:27 AM, Jonathan Ellis  wrote:
> It sounds like it doesn't realize the data it's streaming over is
> older-version data.  Can you create a ticket?
>
> In the meantime nodetool scrub (on the existing nodes) will rewrite
> the data in the new format which should workaround the problem.
>
> On Mon, Mar 7, 2011 at 1:23 PM, Erik Onnen  wrote:
>> During a recent upgrade of our cassandra ring from 0.6.8 to 0.7.3 and
>> prior to a drain on the 0.6.8 nodes, we lost a node for reasons
>> unrelated to cassandra. We decided to push forward with the drain on
>> the remaining healthy nodes. The upgrade completed successfully for
>> the remaining nodes and the ring was healthy. However, we're unable to
>> boostrap in a new node. The bootstrap process starts and we can see
>> streaming activity in the logs for the node giving up tokens, but the
>> bootstrapping node encounters the following:
>>
>>
>> INFO [main] 2011-03-07 10:37:32,671 StorageService.java (line 505)
>> Joining: sleeping 3 ms for pending range setup
>>  INFO [main] 2011-03-07 10:38:02,679 StorageService.java (line 505)
>> Bootstrapping
>>  INFO [HintedHandoff:1] 2011-03-07 10:38:02,899
>> HintedHandOffManager.java (line 304) Started hinted handoff for
>> endpoint /10.211.14.200
>>  INFO [HintedHandoff:1] 2011-03-07 10:38:02,900
>> HintedHandOffManager.java (line 360) Finished hinted handoff of 0 rows
>> to endpoint /10.211.14.200
>>  INFO [CompactionExecutor:1] 2011-03-07 10:38:04,924
>> SSTableReader.java (line 154) Opening
>> /mnt/services/cassandra/var/data/0.7.3/data/Stuff/stuff-f-1
>>  INFO [CompactionExecutor:1] 2011-03-07 10:38:05,390
>> SSTableReader.java (line 154) Opening
>> /mnt/services/cassandra/var/data/0.7.3/data/Stuff/stuff-f-2
>>  INFO [CompactionExecutor:1] 2011-03-07 10:38:05,768
>> SSTableReader.java (line 154) Opening
>> /mnt/services/cassandra/var/data/0.7.3/data/Stuff/stuffid-f-1
>>  INFO [CompactionExecutor:1] 2011-03-07 10:38:06,389
>> SSTableReader.java (line 154) Opening
>> /mnt/services/cassandra/var/data/0.7.3/data/Stuff/stuffid-f-2
>>  INFO [CompactionExecutor:1] 2011-03-07 10:38:06,581
>> SSTableReader.java (line 154) Opening
>> /mnt/services/cassandra/var/data/0.7.3/data/Stuff/stuffid-f-3
>> ERROR [CompactionExecutor:1] 2011-03-07 10:38:07,056
>> AbstractCassandraDaemon.java (line 114) Fatal exception in thread
>> Thread[CompactionExecutor:1,1,main]
>> java.io.EOFException
>>        at 
>> org.apache.cassandra.io.sstable.IndexHelper.skipIndex(IndexHelper.java:65)
>>        at 
>> org.apache.cassandra.io.sstable.SSTableWriter$Builder.build(SSTableWriter.java:303)
>>        at 
>> org.apache.cassandra.db.CompactionManager$9.call(CompactionManager.java:923)
>>        at 
>> org.apache.cassandra.db.CompactionManager$9.call(CompactionManager.java:916)
>>        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>        at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>        at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>        at java.lang.Thread.run(Thread.java:662)
>>  INFO [CompactionExecutor:1] 2011-03-07 10:38:08,480
>> SSTableReader.java (line 154) Opening
>> /mnt/services/cassandra/var/data/0.7.3/data/Stuff/stuffid-f-5
>>  INFO [CompactionExecutor:1] 2011-03-07 10:38:08,582
>> SSTableReader.java (line 154) Opening
>> /mnt/services/cassandra/var/data/0.7.3/data/Stuff/stuffid_reg_idx-f-1
>> ERROR [CompactionExecutor:1] 2011-03-07 10:38:08,635
>> AbstractCassandraDaemon.java (line 114) Fatal exception in thread
>> Thread[CompactionExecutor:1,1,main]
>> java.io.EOFException
>>        at 
>> org.apache.cassandra.io.sstable.IndexHelper.skipIndex(IndexHelper.java:65)
>>        at 
>> org.apache.cassandra.io.sstable.SSTableWriter$Builder.build(SSTableWriter.java:303)
>>        at 
>> org.apache.cassandra.db.CompactionManager$9.call(CompactionManager.java:923)
>>        at 
>> org.apache.cassandra.db.CompactionManager$9.call(CompactionManager.java:916)
>>        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>        at 
>> java.util.concurrent.Thread

Re: Aamzon EC2 & Cassandra to ebs or not..

2011-03-09 Thread Erik Onnen
I'd recommend not storing commit logs or data files on EBS volumes if
your machines are under any decent amount of load. I say that for
three reasons.

First, both EBS volumes contend directly for network throughput with
what appears to be a peer QoS policy to standard packets. In other
words, if you're saturating a network link, EBS throughput falls. The
same has not been true of ephemeral volumes in all of our testing,
ephemeral I/O speeds tend to only take a minor hit under network
pressure and are consistently faster in raw speed tests.

Second, at some point it's a given that you will encounter misbehaving
EBS volumes. They won't completely fail, worse they will just get
really, really slow. Often times this is worse than a total failure
because the system just back piles reads/writes but doesn't totally
fall over until the entire cluster becomes overwhelmed. We've never
had single volume ephemeral problems.

Lastly, I think people have a tendency to bolt a large number of EBS
volumes to a host and think that because they have disk capacity they
serve more data from fewer hosts. If you push that too far, you'll
outstrip the ability of the system to keep effective buffer caches and
concurrently serve requests for all the data it is responsible for
managing. IME there is pretty good parity between an EC2 XL and the
ephemeral disks available relative to how Cassandra uses disk and RAM
that adding more storage is right at the breaking point of over
committing your hardware.

If you want protection from AZ failure, split you ring across AZs
(Cassandra is quite good at this) or copy snapshots to EBS volumes.

-erik

There are a lot of benefits to EBS volumes, I/O throughput and
reliability are not among those benefits.

On Wed, Mar 9, 2011 at 8:39 AM, William Oberman
 wrote:
> I thought nodetool snapshot writes the snapshot locally, requiring 2x of
> expensive storage allocation 24x7 (vs. cheap storage allocation of a ebs
> snapshot).  By that I mean EBS allocation is GB allocated per month costs at
> one rate, and EBS snapshots are delta compressed copies to S3.
>
> Can you point the snapshot to an external filesystem?
>
> will
>
> On Wed, Mar 9, 2011 at 11:31 AM, Sasha Dolgy  wrote:
>>
>> Could you not nodetool snapshot the data into an mounted ebs/s3 bucket and
>> satisfy your development requirement?
>> -sd
>>
>> On Wed, Mar 9, 2011 at 5:23 PM, William Oberman 
>> wrote:
>>>
>>> For me, to transition production data into a development environment for
>>> real world testing.  Also, backups are never a bad idea, though I agree most
>>> all risk is mitigated due to cassandra's design.
>>>
>>> will
>
>
>
> --
> Will Oberman
> Civic Science, Inc.
> 3030 Penn Avenue., First Floor
> Pittsburgh, PA 15201
> (M) 412-480-7835
> (E) ober...@civicscience.com
>


Re: FW: Very slow batch insert using version 0.7.2

2011-03-10 Thread Erik Forkalsrud


I see the same behavior with smaller batch sizes.  It appears to happen 
when starting Cassandra with the defaults on relatively large systems.  
Attached is a script I created to reproduce the problem. (usage: 
mutate.sh  /path/to/apache-cassandra-0.7.3-bin.tar.gz)   It extracts a 
stock cassandra distribution to a temp dir and starts it up, (single 
node) then creates a keyspace with a column family and does a batch 
mutate to insert 5000 columns.


When I run it on my laptop (Fedora 14, 64-bit, 4 cores, 8GB RAM)  it 
flushes one Memtable with 5000 operations
When I run it on a server  (RHEL5, 64-bit, 16 cores, 96GB RAM) it 
flushes 100 Memtables with anywhere between 1 operation and 359 
operations (35 bytes and 12499 bytes)


I'm guessing I can override the JVM memory parameters to avoid the 
frequent flushing on the server, but I haven't experimented with that 
yet.  The only difference in the effective command line between the 
laptop and server is "-Xms3932M -Xmx3932M -Xmn400M" on the laptop and 
"-Xms48334M -Xmx48334M -Xmn1600M" on the server.




--
Erik Forkalsrud
Commission Junction


On 03/10/2011 09:18 AM, Ryan King wrote:

Why use such a large batch size?

-ryan

On Thu, Mar 10, 2011 at 6:31 AM, Desimpel, Ignace
  wrote:


Hello,

I had a demo application with embedded cassandra version 0.6.x, inserting
about 120 K  row mutations in one call.

In version 0.6.x that usually took about 5 seconds, and I could repeat this
step adding each time the same amount of data.

Running on a single CPU computer, single hard disk, XP 32 bit OS, 1G memory

I tested this again on CentOS 64 bit OS, 6G memory, different settings of
memtable_throughput_in_mb and memtable_operations_in_millions.

Also tried version 0.7.3. Also the same behavior.



Now with version 0.7.2 the call returns with a timeout exception even using
a timeout of 12 (2 minutes). I see the CPU time going to 100%, a lot of
disk writing ( giga bytes), a lot of log messages  about compacting,
flushing, commitlog, …



Below you can find some information using the nodetool at start of the batch
mutation and also after 14 minutes. The MutationStage is clearly showing how
slow the system handles the row mutations.



Attached : Cassandra.yaml with at end the description of my database
structure using yaml

Attached : log file with cassandra output.



Any idea what I could be doing wrong?



Regards,



Ignace Desimpel



ignace.desim...@nuance.com



At start of the insert (after inserting 124360 row mutations) I get the
following info from the nodetool :



C:\apache-cassandra-07.2\bin>nodetool --host ads.nuance.com info

Starting NodeTool

34035877798200531112672274220979640561

Gossip active: true

Load : 5.49 MB

Generation No: 1299502115

Uptime (seconds) : 1152

Heap Memory (MB) : 179,84 / 1196,81



C:\apache-cassandra-07.2\bin>nodetool --host ads.nuance.com tpstats

Starting NodeTool

Pool NameActive   Pending  Completed

ReadStage 0 0  40637

RequestResponseStage  0 0 30

MutationStage32121679  72149

GossipStage   0 0  0

AntiEntropyStage  0 0  0

MigrationStage0 0  1

MemtablePostFlusher   0 0  6

StreamStage   0 0  0

FlushWriter   0 0  5

MiscStage 0 0  0

FlushSorter   0 0  0

InternalResponseStage 0 0  0

HintedHandoff 0 0  0



After 14 minutes (timeout exception after 2 minutes : see log file) I get :



C:\apache-cassandra-07.2\bin>nodetool --host ads.nuance.com info

Starting NodeTool

34035877798200531112672274220979640561

Gossip active: true

Load : 10.31 MB

Generation No: 1299502115

Uptime (seconds) : 2172

Heap Memory (MB) : 733,82 / 1196,81



C:\apache-cassandra-07.2\bin>nodetool --host ads.nuance.com tpstats

Starting NodeTool

Pool NameActive   Pending  Completed

ReadStage 0 0  40646

RequestResponseStage  0 0 30

MutationStage32103310  90526

GossipStage   0 0  0

AntiEntropyStage  0 0  0

MigrationStage0 0  1

MemtablePostFlusher   0 0 69

StreamStage   0 0  0

FlushWriter   0 0 68

FILEUTILS-DELETE-POOL 0 0 42

MiscStage  

Re: FW: Very slow batch insert using version 0.7.2

2011-03-11 Thread Erik Forkalsrud

On 03/11/2011 04:56 AM, Zhu Han wrote:


When I run it on my laptop (Fedora 14, 64-bit, 4 cores, 8GB RAM)
 it flushes one Memtable with 5000 operations
When I run it on a server  (RHEL5, 64-bit, 16 cores, 96GB RAM) it
flushes 100 Memtables with anywhere between 1 operation and 359
operations (35 bytes and 12499 bytes)


What's the settings of commit log flush, periodic or in batch?



It's whatever the default setting is, (in the cassandra.yaml that is 
packaged in the apache-cassandra-0.7.3-bin.tar.gz download) specifically:


   commitlog_rotation_threshold_in_mb: 128
   commitlog_sync: periodic
   commitlog_sync_period_in_ms: 1
   flush_largest_memtables_at: 0.75

If I describe keyspace I get:

   [default@unknown] describe keyspace Events;
   Keyspace: Events:
 Replication Strategy: org.apache.cassandra.locator.SimpleStrategy
   Replication Factor: 1
 Column Families:
   ColumnFamily: Event
 Columns sorted by: org.apache.cassandra.db.marshal.TimeUUIDType
 Row cache size / save period: 0.0/0
 Key cache size / save period: 20.0/14400
 Memtable thresholds: 14.109375/3010/1440
 GC grace seconds: 864000
 Compaction min/max thresholds: 4/32
 Read repair chance: 1.0
 Built indexes: []


It turns out my suspicion was right. When I tried overriding the jvm 
memory parameters calculated in conf/cassandra-env.sh to use the values 
calculated on my 8GB laptop like this:


   MAX_HEAP_SIZE=3932m HEAP_NEWSIZE=400m ./mutate.sh

That made the server behave much nicer.  This time it kept all 5000 
operations in a single Memtable.  Also, when running with these memory 
settings the Memtable thresholds changed to "1.1390625/243/1440"   (from 
"14.109375/3010/1440")  (all the other output from "describe 
keyspace" remains the same)


So it looks like something goes wrong when cassandra gets too much memory.


--
Erik Forkalsrud
Commission Junstion



Re: FW: Very slow batch insert using version 0.7.2

2011-03-11 Thread Erik Forkalsrud

On 03/11/2011 12:13 PM, Jonathan Ellis wrote:

https://issues.apache.org/jira/browse/CASSANDRA-2158, fixed in 0.7.3

you could have saved a lot of time just by upgrading first. :)



Hmm, I'm testing with 0.7.3 ...
but now I know at least which knob to turn.


- Erik -



Re: FW: Very slow batch insert using version 0.7.2

2011-03-11 Thread Erik Forkalsrud

On 03/11/2011 12:13 PM, Jonathan Ellis wrote:

https://issues.apache.org/jira/browse/CASSANDRA-2158, fixed in 0.7.3

you could have saved a lot of time just by upgrading first. :)



It looks like the fix isn't entirely correct.  The bug is still in 
0.7.3.   In Memtable.java, the line:


   THRESHOLD = cfs.getMemtableThroughputInMB() * 1024 * 1024;

should be changed to:

   THRESHOLD = cfs.getMemtableThroughputInMB() * 1024L * 1024L;


Here's some code that illustrates the difference:

public void testMultiplication() {
int memtableThroughputInMB = 2300;
long thresholdA = memtableThroughputInMB * 1024 * 1024;
long thresholdB = memtableThroughputInMB * 1024L * 1024L;
System.out.println("a=" + thresholdA + " b=" + thresholdB);
}



- Erik -



On Fri, Mar 11, 2011 at 2:02 PM, Erik Forkalsrud  wrote:

On 03/11/2011 04:56 AM, Zhu Han wrote:

When I run it on my laptop (Fedora 14, 64-bit, 4 cores, 8GB RAM)  it
flushes one Memtable with 5000 operations
When I run it on a server  (RHEL5, 64-bit, 16 cores, 96GB RAM) it flushes
100 Memtables with anywhere between 1 operation and 359 operations (35 bytes
and 12499 bytes)

What's the settings of commit log flush, periodic or in batch?


It's whatever the default setting is, (in the cassandra.yaml that is
packaged in the apache-cassandra-0.7.3-bin.tar.gz download) specifically:

commitlog_rotation_threshold_in_mb: 128
commitlog_sync: periodic
commitlog_sync_period_in_ms: 1
flush_largest_memtables_at: 0.75

If I describe keyspace I get:

[default@unknown] describe keyspace Events;
Keyspace: Events:
  Replication Strategy: org.apache.cassandra.locator.SimpleStrategy
Replication Factor: 1
  Column Families:
ColumnFamily: Event
  Columns sorted by: org.apache.cassandra.db.marshal.TimeUUIDType
  Row cache size / save period: 0.0/0
  Key cache size / save period: 20.0/14400
  Memtable thresholds: 14.109375/3010/1440
  GC grace seconds: 864000
  Compaction min/max thresholds: 4/32
  Read repair chance: 1.0
  Built indexes: []


It turns out my suspicion was right. When I tried overriding the jvm memory
parameters calculated in conf/cassandra-env.sh to use the values calculated
on my 8GB laptop like this:

MAX_HEAP_SIZE=3932m HEAP_NEWSIZE=400m ./mutate.sh

That made the server behave much nicer.  This time it kept all 5000
operations in a single Memtable.  Also, when running with these memory
settings the Memtable thresholds changed to "1.1390625/243/1440"   (from
"14.109375/3010/1440")  (all the other output from "describe keyspace"
remains the same)

So it looks like something goes wrong when cassandra gets too much memory.


--
Erik Forkalsrud
Commission Junstion









SSTable Corruption

2011-03-23 Thread Erik Onnen
After an upgrade from 0.7.3 to 0.7.4, we're seeing the following on
several data files:

ERROR [main] 2011-03-23 18:58:33,137 ColumnFamilyStore.java (line 235)
Corrupt sstable
/mnt/services/cassandra/var/data/0.7.4/data/Helium/dp_idx-f-4844=[Index.db,
Statistics.db, Data.db, Filter.db]; skipped
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:375)
at 
org.apache.cassandra.utils.EstimatedHistogram$EstimatedHistogramSerializer.deserialize(EstimatedHistogram.java:207)
at 
org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:166)
at 
org.apache.cassandra.db.ColumnFamilyStore.(ColumnFamilyStore.java:226)
at 
org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:472)
at 
org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:453)
at org.apache.cassandra.db.Table.initCf(Table.java:317)
at org.apache.cassandra.db.Table.(Table.java:254)
at org.apache.cassandra.db.Table.open(Table.java:110)
at 
org.apache.cassandra.service.AbstractCassandraDaemon.setup(AbstractCassandraDaemon.java:160)
at 
org.apache.cassandra.service.AbstractCassandraDaemon.activate(AbstractCassandraDaemon.java:314)
at 
org.apache.cassandra.thrift.CassandraDaemon.main(CassandraDaemon.java:79)

Prior to the upgrade, there weren't problems opening these data files
with the exception that we were not able to bootstrap new nodes into
the ring due to what we believe was caused by CASSANDRA-2283. In
trying to fix 2283, we did run a scrub on this node.

With an RF of 3 in our ring, what are the implications of this? Specifically,

1) If the node is a natural endpoint for rows in the corrupt sstable,
will it simply not respond to requests for that row or will there be
errors?
2) Should a repair fix the broken sstable? This was my expectation
based on similar threads on the mailing list.
3) If #2 is correct and the repair did not fix the corrupt sstables
(which it did not), how should we proceed to repair the tables?

Thanks,
-erik


Re: SSTable Corruption

2011-03-23 Thread Erik Onnen
Thanks, so is it the "[Index.db, Statistics.db, Data.db, Filter.db];
skipped" that indicates it's in Statistics? Basically I need a way to
know if the same is true of all the other tables showing this issue.

-erik


Re: ParNew (promotion failed)

2011-03-23 Thread Erik Onnen
It's been about 7 months now but at the time G1 would regularly
segfault for me under load on Linux x64. I'd advise extra precautions
in testing and make sure you test with representative load.


Re: Flush / Snapshot Triggering Full GCs, Leaving Ring

2011-04-07 Thread Erik Onnen
I'll capture what I we're seeing here for anyone else who may look
into this in more detail later.

Our standard heap growth is ~300K in between collections with regular
ParNew collections happening on average about every 4 seconds. All
very healthy.

The memtable flush (where we see almost all our CMS activity) seems to
have some balloon effect that despite a 64MB memtable size, causes
over 512MB heap to be consumed in half a second. In addition to the
hefty amount of garbage it causes, due to the MaxTenuringThreshold=1
setting most of that garbage seems to spill immediately into the
tenured generation which quickly fills and triggers a CMS. The rate of
garbage overflowing to tenured seems to outstrip the speed of the
concurrent mark worker which is almost always interrupted and failed
to a concurrent collection. However, the tenured collection is usually
hugely effective, recovering over half the total heap.

Two questions for the group then:

1) Does this seem like a sane amount of garbage (512MB) to generate
when flushing a 64MB table to disk?
2) Is this possibly a case of the MaxTenuringThreshold=1 working
against cassandra? The flush seems to create a lot of garbage very
quickly such that normal CMS isn't even possible. I'm sure there was a
reason to introduce this setting but I'm not sure it's universally
beneficial. Is there any history on the decision to opt for immediate
promotion rather than using an adaptable number of survivor
generations?


CL.ONE reads and SimpleSnitch unnecessary timeouts

2011-04-13 Thread Erik Onnen
Sorry for the complex setup, took a while to identify the behavior and
I'm still not sure I'm reading the code correctly.

Scenario:

Six node ring w/ SimpleSnitch and RF3. For the sake of discussion
assume the token space looks like:

node-0 1-10
node-1 11-20
node-2 21-30
node-3 31-40
node-4 41-50
node-5 51-60

In this scenario we want key 35 where nodes 3,4 and 5 are natural
endpoints. Client is connected to node-0, node-1 or node-2. node-3
goes into a full GC lasting 12 seconds.

What I think we're seeing is that as long as we read with CL.ONE *and*
are connected to 0,1 or 2, we'll never get a response for the
requested key until the failure detector kicks in and convicts 3
resulting in reads spilling over to the other endpoints.

We've tested this by switching to CL.QUORUM and since haven't seen
read timeouts during big GCs.

Assuming the above, is this behavior really correct? We have copies of
the data on two other nodes but because this snitch config always
picks node-3, we always timeout until conviction which can take up to
8 seconds sometimes. Shouldn't the read attempt to pick a different
endpoint in the case of the first timeout rather than repeatedly
trying a node that isn't responding?

Thanks,
-erik


Re: CL.ONE reads and SimpleSnitch unnecessary timeouts

2011-04-13 Thread Erik Onnen
So we're not currently using a dynamic snitch, only the SimpleSnitch
is at play (lots of history as to why, I won't go into it). If this
would solve our problems I'm fine changing it.

Understood re: client contract. I guess in this case my issue is that
the server we're connected to never tries more than the one failing
server until failure detector has kicked in - it keeps flogging the
bad server so subsequent requests never produce a different result
until conviction.

Regarding clients retrying, in this configuration the situation
doesn't improve and it still times out because our client libraries
don't try another host. They still have a valid connection to a
working host, it's just that given our configuration that one node
keeps proxying to a bad server and never routes around it. It sounds
like switching to the dynamic switch would adjust for the first
timeout on subsequent attempts so maybe that's the most advisable
thing in this case.

On Wed, Apr 13, 2011 at 10:58 AM, Jonathan Ellis  wrote:
> First, our contract with the client says "we'll give you the answer or
> a timeout after rpc_timeout." Once we start trying to cheat on that
> the client has no guarantee anymore when it should expect a response
> by. So that feels iffy to me.
>
> Second, retrying to a different node isn't expected to give
> substantially better results than the client issuing a retry itself if
> that's what it wants, since by the time we timeout once then FD and/or
> dynamic snitch should route the request to another node for the retry
> without adding additional complexity to StorageProxy.  (If that's not
> what you see in practice, then we probably have a dynamic snitch bug.)
>
> On Wed, Apr 13, 2011 at 12:32 PM, Erik Onnen  wrote:
>> Sorry for the complex setup, took a while to identify the behavior and
>> I'm still not sure I'm reading the code correctly.
>>
>> Scenario:
>>
>> Six node ring w/ SimpleSnitch and RF3. For the sake of discussion
>> assume the token space looks like:
>>
>> node-0 1-10
>> node-1 11-20
>> node-2 21-30
>> node-3 31-40
>> node-4 41-50
>> node-5 51-60
>>
>> In this scenario we want key 35 where nodes 3,4 and 5 are natural
>> endpoints. Client is connected to node-0, node-1 or node-2. node-3
>> goes into a full GC lasting 12 seconds.
>>
>> What I think we're seeing is that as long as we read with CL.ONE *and*
>> are connected to 0,1 or 2, we'll never get a response for the
>> requested key until the failure detector kicks in and convicts 3
>> resulting in reads spilling over to the other endpoints.
>>
>> We've tested this by switching to CL.QUORUM and since haven't seen
>> read timeouts during big GCs.
>>
>> Assuming the above, is this behavior really correct? We have copies of
>> the data on two other nodes but because this snitch config always
>> picks node-3, we always timeout until conviction which can take up to
>> 8 seconds sometimes. Shouldn't the read attempt to pick a different
>> endpoint in the case of the first timeout rather than repeatedly
>> trying a node that isn't responding?
>>
>> Thanks,
>> -erik
>>
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
>


Error in TBaseHelper compareTo(byte [] a , byte [] b)

2010-05-03 Thread Erik Holstad
Hey!
We are currently using Cassandra 0.5.1 and I'm getting a StackOverflowError
when
comparing two ColumnOrSuperColumn objects. It turns out the the comparTo
function
for byte [] has an infinite loop in libthrift-r820831.jar.

We are planning to upgrade to 0.6.1 but not ready to do it today', so just
wanted to check
if it is possible to get a jar where that bug has been fixed that works with
0.5, so we can
just replace it?

-- 
Regards Erik


Re: Error in TBaseHelper compareTo(byte [] a , byte [] b)

2010-05-04 Thread Erik Holstad
Thanks Jonathan!
Yeah, I will just wait until we are ready for upgrade and hold of on that
project for now.

Erik


Re: Batch mutate doesn't work

2010-05-06 Thread Erik Bogghed

Try this. You need to use the dict module. 

MutationMap = dict:store(Key,
dict:store("KeyValue",
[
#mutation {column_or_supercolumn = 
#columnOrSuperColumn {column = 
#column {name = "property", 
value="value", timestamp = 2}}}
], dict:new()),
dict:new()),

Erik


On 30 apr 2010, at 15.58, Zubair Quraishi wrote:

> I have the following code in Erlang to set a value and then add a
> property. The first set works but the mutate fails. Can anyone
> enlighten me?
> Thanks
> 
> {ok, C} = thrift_client:start_link("127.0.0.1",9160, cassandra_thrift),
> 
> Key = "Key1",
> 
> %
> % set first property
> %
> thrift_client:call( C,
>  'insert',
>  [ "Keyspace1",
>Key,
>#columnPath{column_family="KeyValue", column="value"},
>"value1",
>1,
>1
>] ),
> 
> %
> % set second property ( fails! - why? )
> %
> MutationMap =
> {
>   Key,
>   {
> <<"KeyValue">>,
> [
>   #mutation{
> column_or_supercolumn = #column{ name = "property" , value =
> "value" , timestamp = 2 }
>   }
> ]
>   }
> },
> thrift_client:call( C,
>   'batch_mutate',
>   [ "Keyspace1",
>  MutationMap,
>  1
>   ] )
> 
> : The error returned is :
> 
> ** exception exit: {bad_return_value,{error,{function_clause,[{dict,size,
>   [{"Key1",
> 
> {<<"KeyValue">>,
> 
> [{mutation,{column,"property","value",2},undefined}]}}]},
> 
> {thrift_protocol,write,2},
> 
> {thrift_protocol,struct_write_loop,3},
> 
> {thrift_protocol,write,2},
> 
> {thrift_client,send_function_call,3},
> 
> {thrift_client,'-handle_call/3-fun-0-',3},
> 
> {thrift_client,catch_function_exceptions,2},
> 
> {thrift_client,handle_call,3}]}}}
> 7>
> =ERROR REPORT 30-Apr-2010::15:13:42 ===
> ** Generic server <0.55.0> terminating
> ** Last message in was {call,batch_mutate,
>  ["Keyspace1",
>   {"Key1",
>{<<"KeyValue">>,
> [{mutation,
>  {column,"property","value",2},
>  undefined}]}},
>   1]}
> ** When Server state == {state,cassandra_thrift,
>{protocol,thrift_binary_protocol,
> {binary_protocol,
>  {transport,thrift_buffered_transport,<0.58.0>},
>  true,true}},
>0}
> ** Reason for termination ==
> ** {bad_return_value,
>  {error,
>  {function_clause,
>  [{dict,size,
>   [{"Key1",
> {<<"KeyValue">>,
>  [{mutation,
>   {column,"property","value",2},
>   undefined}]}}]},
>   {thrift_protocol,write,2},
>   {thrift_protocol,struct_write_loop,3},
>   {thrift_protocol,write,2},
>   {thrift_client,send_function_call,3},
>   {thrift_client,'-handle_call/3-fun-0-',3},
>   {thrift_client,catch_function_exceptions,2},
>   {thrift_client,handle_call,3}]}}}



Re: non blocking Cassandra with Tornado

2010-08-06 Thread Erik Onnen
On Thu, Jul 29, 2010 at 9:57 PM, Ryan Daum  wrote:

>
> Barring this we (place where I work, Chango) will probably eventually fork
> Cassandra to have a RESTful interface and use the Jetty async HTTP client to
> connect to it. It's just ridiculous for us to have threads and associated
> resources tied up on I/O-blocked operations.
>

We've done exactly this but with Netty rather than Jetty. Helps too because
we can easily have testers look at what we put into our CFs. Took some
cheating to marshall the raw binaries into JSON but I'm pretty happy with
what it's bought us so far.


Cassandra suitability for model?

2010-11-02 Thread Erik Bunn



...or perhaps vice versa: how would I tweak a model to suit Cassandra?

I have in mind data that could be _almost_ shoehorned into the (S)CF
structure, and I'd love to hammer this nail with something hadoopy, but I
have a niggling suspicion I'm setting myself up for frustration.

I have
* a relatively small set of primary tags (dozens or hundreds per cluster)
* under each primary tag a large number (on the scale of 1E6) of arbitrary
  length hierarchical paths ("foo/bar/xyzzy", typically consisting of
  descriptive labels, usually totaling 20-40 chars)
* under each path an arbitrary number (usually a few or a few dozen, but in
  some systematic cases ~1000) of leaf tags (typically descriptive labels, say
  4-16 chars in length)
* under each leaf tag a value (arbitrary; string, number, perhaps binary)

On the surface, it would seem that the primary tag would correspond well with
Supercolumn keys, the intermediate path with ColumnFamily names, and the final
key-value-pairs with Columns. Any warning bells here? (Seems like I could also
use the primary tag as a Keyspace name, but I seem to recall some warnings
about using excessive keyspaces.)

The gist is that each and every leaf tag, across the whole data set, receives
a value every few seconds, indefinitely, and history must be preserved. In
practice, all Columns in the ColumnFamily receive a value at the same time.
100k ColumnFamily updates a second would be routine. Nodes would be added
whenever storage or per-node I/O became an issue.
A query is a much more rare occurrence, and would nearly always involve
retrieving the full contents of a ColumnFamily over some time range
(usually thousands of snapshots, not at all rarely millions).

Just by browsing online documentation I can't resolve whether this
timestamping could work in conjunction with Cassandra's internal native
timestamping as is. Is it possible - out of the box, or with minor coding -
to retain history, and to incorporate time ranges into queries? If so, does
the distributed storage cope with the accumulating data of a single
ColumnFamily flowing over to new nodes?

Or: should I twist the whole thing around and incorporate my timestamp
into the ColumnFamily identifier to enjoy automatic scaling? Would the
sheer number of resulting identifiers become a performance issue?

Thanks for your comments;
//e


0.6 to 0.7 upgrade assumptions

2010-11-02 Thread Erik Onnen
Hello,

We're planning an upgrade from 0.6.7 (after it's released) to 0.7.0 (after
it's released) and I wanted to validate my assumptions about what can be
expected. Obviously I'll need to test my own assumptions but I was hoping
for some guidance to make sure my understanding is correct.

My core assumptions are:

1) After taking the upstream services offline and inducing a flush on 0.6
nodes so that all data is compacted, I should be able to start 0.7 binaries
on top of the 0.6 data files with no issue.

2) When upgrading, I can change from RackUnawareStrategy to a
PropertyFileSnitch configuration. My understanding is that read repair will
kick in for the missing facility replicas at a rate roughly equivalent to
the read repair chance.

3) While read repair from #2 is occurring, the now incorrect replica nodes
will continue to be capable of servicing requests for the data that has not
been migrated to the correct replicas.

Thanks!
-erik


Re: Model to store biggest score

2010-03-18 Thread Erik Holstad
Another approach you can take is to add the userid to the score like,
=> (column=140_uid2, value=[], timestamp=1268841641979)
and f you need the score time sorted you can add
=> (column=140_268841641979_uid2, value=[], timestamp=1268841641979)

But I do think that in any case you need to remove the old entry so that you
don't
get duplicates, unless I'm missing something here.


On Wed, Mar 17, 2010 at 9:52 AM, Brandon Williams  wrote:

> On Wed, Mar 17, 2010 at 11:48 AM, Richard Grossman wrote:
>
>> But in the case of simple column family I've the same problem when I
>> update the score of 1 user then I need to remove his old score too. For
>> example here the user uid5 was at 130 now he is at 140 because I add the
>> random number cassandra will keep all the score evolution.
>>
>
> You can maintain another index mapping users to the values.  Depending on
> your use case though, if this is time-based, you can name the rows by the
> date and just create new rows as time goes on.
>
> -Brandon
>



-- 
Regards Erik


Re: Timestamp for versioning?

2010-03-23 Thread Erik Holstad
If I'm not totally mistaken the timestamp is used for conflict resolution on
the server.

Have a look at:
http://wiki.apache.org/cassandra/DataModel for more info



2010/3/23 Waaij, B.D. (Bram) van der 

> Can you explain then to me what the purpose is of the timestamp?
>
>
> > -Original Message-
> > From: Jonathan Ellis [mailto:jbel...@gmail.com]
> > Sent: dinsdag 23 maart 2010 15:35
> > To: user@cassandra.apache.org
> > Subject: Re: Timestamp for versioning?
> >
> > No, we're not planning to add support for retrieving old versions.
> >
> > 2010/3/23 Waaij, B.D. (Bram) van der :
> > > Are there any plans to add versioning in the near future to
> > cassandra?
> > >
> > > What is the purpose of the timestamp in the current implementation?
> > >
> > > Best regards,
> > > Bram
> > >
> > > 
> > > From: 张宇 [mailto:zyxt...@gmail.com]
> > > Sent: dinsdag 23 maart 2010 14:56
> > > To: user@cassandra.apache.org
> > > Cc: Waaij, B.D. (Bram) van der
> > > Subject: Re: Timestamp for versioning?
> > >
> > > HBase supports the version through the timestamp,but Cassandra not.
> > >
> > > 2010/3/23 Waaij, B.D. (Bram) van der 
> > >>
> > >> Hi there,
> > >>
> > >> Is there already support for doing something with the timestamp? I
> > >> would like to use it as a versioning mechanism of the
> > value belonging
> > >> to a certain key. When i store a column with the same key,
> > different
> > >> value and different timestamp, would i still be able to
> > retrieve both versions?
> > >>
> > >> Looking at the API I can not find such method, but perhaps I am
> > >> missing something as I am new to cassandra.
> > >>
> > >> Best regards,
> > >> Bram
> > >>
> > >> This e-mail and its contents are subject to the DISCLAIMER at
> > >> http://www.tno.nl/disclaimer/email.html
> > >
> > > This e-mail and its contents are subject to the DISCLAIMER at
> > > http://www.tno.nl/disclaimer/email.html
> > >
> >
> This e-mail and its contents are subject to the DISCLAIMER at
> http://www.tno.nl/disclaimer/email.html
>



-- 
Regards Erik


Re: Auto Increament

2010-03-24 Thread Erik Holstad
On Wed, Mar 24, 2010 at 11:00 AM, Jesus Ibanez wrote:

> You can generate UUIDs based on time with http://jug.safehaus.org/ if you
> use Java. And its easy to use, just have to insert one line:
> UUID uuid = UUIDGenerator.getInstance().generateTimeBasedUUID();
>
> Maybe a solution to your cuestion:
>
> To "replace" the autoincrement of MySQL, you can create a column family of
> type Super ordering super columns by decreasing order and naming the super
> columns with a numeric value. So then, if you want to insert a new value,
> first read the last inserted column name and add 1 to the returned value.
> Then insert a new super column wich it name will be the new value.
>
> I'm a little bit confused about this. Where do you get the last inserted
column from? Why is this better than having a fix counter row where you
increment the columnValue, non of them are threadsafe?


> Be careful with super columns, becouse I think that if you want to read
> just a column of a super column, all columns will be deserialized.
>
> So you will have this:
>
> Super_Columns_Family_Name = { // this is a ColumnFamily name of type Super
> one_key: {// this is the key to this row inside the Super CF
>
> n: {column1: "value 1", column2: "value 2", ... , columnN: "value n"},
>
> n-1: {column1: "value 1", column2: "value 2", ... , columnN: "value 
> n"},
>
> ...
>
> 2: {column1: "value 1", column2: "value 2", ... , columnN: "value n"},
> 1: {column1: "value 1", column2: "value 2", ... , columnN: "value n"},
> },
>
>  another_key: {// this is the key to this row inside the Super CF
>
>
> n: {column1: "value 1", column2: "value 2", ... , columnN: "value n"},
>
> n-1: {column1: "value 1", column2: "value 2", ... , columnN: "value 
> n"},
>
> ...
>
> 2: {column1: "value 1", column2: "value 2", ... , columnN: "value n"},
> 1: {column1: "value 1", column2: "value 2", ... , columnN: "value n"},
> },
>
> }
>
> The super columns that represent the autoincrement value are the ones named
> as: n, n-1, ... , 2, 1.
>
> Hope it helps!
>
> Jesus.
>
>
>
> 2010/3/24 Sylvain Lebresne 
>
> > How can I replace the "auto increament" attribute in MySQL
>> > with Cassandra?
>>
>> You can't. Not easily at least.
>>
>> > If I can't, how can I generate an ID which is globally
>> > unique for each of columns?
>>
>> Check UUIDs:
>> http://en.wikipedia.org/wiki/Universally_Unique_Identifier
>>
>> >
>> > Thanks,
>> >
>> > Sent from my iPhone
>> >
>>
>
>


-- 
Regards Erik


Replicating data over the wan?

2010-03-30 Thread Erik Holstad
Is anyone using datacenter aware replication where the replication takes
place over the wan
and not over super fast optical cable between the centers?

Tried to look at all posts related to the topic but haven't really found too
much, only some things
about not doing that if using ZooKeeper and some other small comments.

What are the limitations for this kind of replication, what happens when the
writes are coming
in too fast for the replication to be done, assuming some kind of write
buffer for the replication?

Not really worried too much about the inconsistent state of the cluster,
just how well it would
work is this kind of environment.

-- 
Regards Erik


Re: Replicating data over the wan?

2010-03-30 Thread Erik Holstad
Thanks David and Jonathan for the info.

Those two links were pretty much the only thing that I did find about this
issue, but is wasn't
sure that only because it works for different zones it would also work for
different regions.


-- 
Regards Erik


PropertyFileEndPointSnitch

2010-04-19 Thread Erik Holstad
When building the PropertyFileEndPointSnitch into the jar
cassandra-propsnitch.jar
the files in the jar end up on
src/java/org/apache/cassandra/locator/PropertyFileEndPointSnitch.class
instead of org/apache/cassandra/locator/PropertyFileEndPointSnitch.class. Am
I doing something wrong
, is this intended behavior or is it a bug?

-- 
Regards Erik


Re: PropertyFileEndPointSnitch

2010-04-19 Thread Erik Holstad
Thanks Jonathan!


Re: Best way to store millisecond-accurate data

2010-04-23 Thread Erik Holstad
On Fri, Apr 23, 2010 at 5:54 PM, Miguel Verde wrote:

> TimeUUID's time component is measured in 100-nanosecond intervals. The
> library you use might calculate it with poorer accuracy or precision, but
> from a storage/comparison standpoint in Cassandra millisecond data is easily
> captured by it.
>
> One typical way of dealing with the data explosion of sampled time series
> data is to bucket/shard rows (i.e. Bob-20100423-bloodpressure) so that you
> put an upper bound on the row length.
>
>
> On Apr 23, 2010, at 7:01 PM, Andrew Nguyen <
> andrew-lists-cassan...@ucsfcti.org> wrote:
>
>  Hello,
>>
>> I am looking to store patient physiologic data in Cassandra - it's being
>> collected at rates of 1 to 125 Hz.  I'm thinking of storing the timestamps
>> as the column names and the patient/parameter combo as the row key.  For
>> example, Bob is in the ICU and is currently having his blood pressure,
>> intracranial pressure, and heart rate monitored.  I'd like to collect this
>> with the following row keys:
>>
>> Bob-bloodpressure
>> Bob-intracranialpressure
>> Bob-heartrate
>>
>> The column names would be timestamps but that's where my questions start:
>>
>> I'm not sure what the best data type and CompareWith would be.  From my
>> searching, it sounds like the TimeUUID may be suitable but isn't really
>> designed for millisecond accuracy.  My other thought is just to store them
>> as strings (2010-04-23 10:23:45.016).  While I space isn't the foremost
>> concern, we will be collecting this data 24/7 so we'll be creating many
>> columns over the long-term.
>>
> You could just get an 8 byte millisecond timestamp and store that as a part
of the key

>
>> I found https://issues.apache.org/jira/browse/CASSANDRA-16 which states
>> that the entire row must fit in memory.  Does this include the values as
>> well as the column names?
>>
> Yes. The option is to store one insert per row, you are not going to be
able to do backwards slices this way,  without extra index, but you can
scale mush better.

>
>> In considering the limits of cassandra and the best way to model this, we
>> would be adding 3.9 billion rows per year (assuming 125 Hz @ 24/7).
>>  However, I can't really think of a better way to model this...  So, am I
>> thinking about this all wrong or am I on the right track?
>>
>> Thanks,
>> Andrew
>>
>


-- 
Regards Erik


Re: The Difference Between Cassandra and HBase

2010-04-24 Thread Erik Holstad
I would say that HBase is a little bit more focused on reads and Cassandra
on writes.
HBase has better scans and Cassandra better multi datacenter functionality.

Erik


Re: Performance drop of current Java drivers

2020-05-05 Thread Erik Merkle
Matthias,

Thanks for sharing your findings and test code. We were able to track this
to a regression in the underlying Netty library and already have a similar
issue reported here:
https://datastax-oss.atlassian.net/browse/JAVA-2676

The regression seems to be with the upgrade to Netty version 4.1.45, which
affects driver versions 4.5.0, 4.5.1 and 4.6.0. You can work around this
issue by explicitly downgrading Netty to version 4.1.43. If you are using
Maven, you can do this by explicitly declaring the version of Netty in your
dependencies as follows:


io.netty
netty-handler
4.1.43.Final


We will be addressing this issue and likely release a fixed version very
soon.

Many thanks again,
Erik

On Mon, May 4, 2020 at 6:58 AM Matthias Pfau 
wrote:

> Hi Chris and Adam,
> thanks for looking into this!
>
> You can find my tests for old/new client here:
> https://gist.github.com/mpfau/7905cea3b73d235033e4f3319e219d15
> https://gist.github.com/mpfau/a62cce01b83b56afde0dbb588470bc18
>
>
> May 1, 2020, 16:22 by adam.holmb...@datastax.com:
>
> > Also, if you can share your schema and benchmark code, that would be a
> good start.
> >
> > On Fri, May 1, 2020 at 7:09 AM Chris Splinter <>
> chris.splinter...@gmail.com> > wrote:
> >
> >> Hi Matthias,
> >>
> >> I have forwarded this to the developers that work on the Java driver
> and they will be looking into this first thing next week.
> >>
> >> Will circle back here with findings,
> >>
> >> Chris
> >>
> >> On Fri, May 1, 2020 at 12:28 AM Erick Ramirez <>>
> erick.rami...@datastax.com>> > wrote:
> >>
> >>> Matthias, I don't have an answer to your question but I just wanted to
> note that I don't believe the driver contributors actively watch this
> mailing list (I'm happy to be corrected 🙂 ) so I'd recommend you
> cross-post in the Java driver channels as well. Cheers!
> >>>
> >
> >
> > --
> > Adam Holmberg
> > e.>  > adam.holmb...@datastax.com
> >  > w.>  > www.datastax.com <http://www.datastax.com>
> >
> >
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>

-- 
Erik Merkle
e. erik.mer...@datastax.com
w. www.datastax.com


Re: Integration tests - is anything wrong with Cassandra beta4 docker??

2021-03-10 Thread Erik Merkle
I opened a ticket when beta4 was released, but no Docker image was
available. It turned out the reason the image wasn't published is that the
tests DockerHub uses to verify the builds were timing out, so it may be a
similar issue you are running into. The ticket is here:

https://github.com/docker-library/cassandra/issues/221#issuecomment-761187461

DockerHub ended up altering some of the default Cassandra settings for
their verification tests only, to allow the tests to pass. For your
integration tests, you may want to do something similar.

On Wed, Mar 10, 2021 at 5:08 AM Attila Wind  wrote:

> Hi Guys,
>
> We are using dockerized Cassandra to run our integration tests.
> So far we were using the 4.0-beta2 docker image (
> https://hub.docker.com/layers/cassandra/library/cassandra/4.0-beta2/images/sha256-77aa30c8e82f0e761d1825ef7eb3adc34d063b009a634714af978268b71225a4?context=explore
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__hub.docker.com_layers_cassandra_library_cassandra_4.0-2Dbeta2_images_sha256-2D77aa30c8e82f0e761d1825ef7eb3adc34d063b009a634714af978268b71225a4-3Fcontext-3Dexplore&d=DwMCBA&c=adz96Xi0w1RHqtPMowiL2g&r=uHjHq8qzJoJORfwNE9cgGQeHQBiMQtuQd1uTkDPFJP0&m=cyTtaxvGkKVI7sE73eON05g5XWqhkJbmLe0kdT3u2uk&s=ICCzMRahGNP7SHgwY-VsYo5fNj1gww3OaQRLKOpRHcA&e=>
> )
> Recently I tried to switch to the 4.0-beta4 docker but noticed a few
> problems...
>
>- image starts much much slower for me (4x more time to come up and
>can connect to it)
>- unlike the beta2 version beta4 barely survive all of our test
>cases... typically it gets stucked and fail with timeouts at around 60-80%
>of completed test cases
>
> Anyone similar / related experiences maybe?
>
> thanks
> --
> Attila Wind
>
> http://www.linkedin.com/in/attilaw
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.linkedin.com_in_attilaw&d=DwMCBA&c=adz96Xi0w1RHqtPMowiL2g&r=uHjHq8qzJoJORfwNE9cgGQeHQBiMQtuQd1uTkDPFJP0&m=cyTtaxvGkKVI7sE73eON05g5XWqhkJbmLe0kdT3u2uk&s=6zQfwDS6Cxc7wA1UVRJzQ3AYwjdd1bUVfGhY4jaao9M&e=>
> Mobile: +49 176 43556932
>
>
>

-- 
Erik Merkle
e. erik.mer...@datastax.com
w. www.datastax.com