As a side note, be aware that running with DEBUG logging enabled can make
your cluster run a full order of magnitude slower.
> Ah, must be the status check that I set up. Thanks!
It
With a very small amount of memory, the Cassandra process may be getting
killed by the Linux OOM killer, which should result in a log message to the
kernel logs. See
locate the error if it exists.
What version of Cassandra were you testing with?
> The CLI sometimes gets only 100 results (even though there are more) - and
> sometimes gets all the results, even when there are more than 100!
Raj: the super column indexing is a longstanding issue that we've been
considering recently, and would like to fix. See
> Not that I'm aware of. There are several other decent alternatives to
> And I notice in 0.7 roadmap there is a feature called "vector clock
The orginal plan was to implement vector clocks for Cassandra, but
Cassandra's data model actually provides at alternative solution that we'd
like to start recommending. If you know that you will be experiencing
As mentioned in the article you linked, index creation happens
asynchronously: when you perform the schema update call to add an index, the
index starts building in the background, and will not be completely valid
until it finishes building. I believe there is a JMX call to check the
> I think tuning of Cassandra is overly complex, and even with a single
Also note that an improved and compressible file format has been in the
works for a while now.
I am endlessly optimistic that it will make it into the 'next' version; in
particular, the current hope is 0.8
When the destination node fails to open the streamed SSTable, we assume it
was corrupted during transfer, and retry the stream. Independent of the
exception posted above, it is a problem that the failed transfers were not
cleaned up.
How many of the data files are marked as -tmp-?
Our intention was that if you wanted to add another permission like "update"
(a subset of "write") then you would return it from the method as part of
the EnumSet for that resource. I would see how much trouble it
would be to add a new Permission value for "update".
Not only does the type need to make sense, but it also needs to sort in
exactly the same order as the previous type did... in which case there would
be no reason to change it?
We should probably just say "no, you cannot do this", and explicitly prevent
The expired columns were converted into tombstones, which will live for the
GC timeout. The "empty" row will be cleaned up when those tombstones are
Returning the empty row is unfortunate... we'd love to find a more
appropriate solution that might not involve endless scanning.
The secondary indexes in 0.7.0 (type KEYS) are stored internally in a column
family, and are kept synchronized with the base data via locking on a local
node, meaning they are always consistent on the local node. Eventual
> Does it also mean that the whole row will be deserialized when a query
> just for one column?
No, it does not mean that: at most column_index_size_in_kb will be read to
read a single column, independent of where that column is in the row.
_But_, vote for if
you'd like to be able to perform this type of query easily*. Binned bitmap
indexes can perform compound range queries extremely quickly.
* Assuming that your data isn't extremely volatile, in which case those
I would like to continue to support super columns, but to slowly convert
them into "compound column names", since that is really all they really are.
> I've found super column families quite useful when using
But, the reason that it isn't safe to say that we are a strongly consistent
store is that if 2 of your 3 replicas were to die and come back with no
data, QUORUM might return the wrong result.
A requirement of a strongly consistent store is that replicas cannot begin
I expect that this problem was due to : I'll make noise to
try and get it released soon as 0.7.3
> Thanks, Shimi. I'll keep you posted if we make progress. Riptano is working
In practice, local secondary indexes scale to {RF * the limit of a single
machine} for -low cardinality- values (ex: users living in a certain state)
since the first node is likely to be able to answer your question. This also
means they are good for performing filtering for analytics.
The comment in the example config file next to that setting explains it more
fully, but something like 16 * number of drives is a reasonable setting for
readers. Writers should be a multiple of the number of cores.
The storage proxy latencies are the primary metric: in particular, the
latency histograms show the distribution of query times.
> What are the key things to monitor while running a stress test? There is
If an SSTable contains an update for a row (row, not just column), we need
to read from it. See #1608 for some of the ideas that have been floated on
how to improve this situation: the core ones are 1. partitioning local data
so that the the number of files involved in a read is smaller, 2. adding
Sorry, I meant to say #2319:
> If an SSTable contains an update for a row (row, not just column), we need
> to read from it. See #1608 for some of the ideas that have been floated on
constant updates to rows will lower your performance until a solution to
#1608 is available.
> Sorry, I meant to say #2319:
The row index is an index of the columns stored in a particular row: it is
only written when a row gets larger than column_index_size_in_kb (see your
config file). The sstable index is currently an index of the keys stored in
an sstable, but #2319 proposes to merge the sstable and row indexes.
Readonly Compactions are used to hash column families for . Roger's link refers to anticompaction specifically.
to anticompaction specifically.
Hey Ed,
I've been working on a similar approach for arbitarily nested/compound column
names in #998. See:
The goal is to provide native support and potentially (in the very long term),
API support for
Ian: I think that as get_range_slice gets faster, the approach that Mark was
heading toward may be considerably more efficient than reading the old value in
the OutputFormat.
Mark: Reading all of the data you want to update out of Cassandra using the
InputFormat, merging it with (tagged) new da
Your IPartitioner implementation decides how the row keys are sorted: see . You need to
be using one of the OrderPreservingPartitioners if you'd like a reasonable
order for the keys.
You're right, it should be private. But... I don't think it is worth opening a
ticket for.
I think that it is 100% ideal: it's what I've been working on implementing in
#674, #847 and #998. I'm hoping to post a large patchset and docs this week,
and I'm aiming to get it committed for 0.8.
The work I've been doing doesn't touch the user interface: it only deals with
The Hadoop integration (as demonstrated by contrib/word_count) is locality
aware: it begins by querying Cassandra to generate locality aware splits, and
when the hostnames match up between the Hadoop and Cassandra clusters, the data
can be mapped locally.
A Cassandra OutputFormat was recently contributed, but I haven't had a chance
to review it. Any feedback you can give would be awesome:
Also, when you are testing trunk, please remember to read NEWS.txt, as things
change frequently.
50% of 0 will be rounded up to 1.
See for some background
here: I was just about to start working on this one, but it won't make it in
until 0.7.
Did you watch in the logs to confirm that repair had actually finished? The
`nodetool repair` call is not blocking before 0.6.3 (unreleased): see
> Did you watch in the logs to confirm that repair had actually finished? The
A "major" compaction is any compaction that sees all of the sstables for a
column family. In the context of the method you edited, that means that all of
the SSTables fall into a single bucket, and can be compacted together.
Hey Dave,
This won't work out of the box, but it should be relatively easy to fix.
Implementing a TextColumnFamilyInputFormat that wraps ColumnFamilyInputFormat
to convert the datastructures it outputs to JSON/TSV/CSV.
If you have time to work on this, there is an open ticket:
read nodes based o
Could we conditionally use an MD5 request only if a node was in a different
zone/datacenter according to the replication strategy? Presumably the bandwidth
usage within a datacenter isn't a concern as much as latency.
Cassandra has a very high constant per-row overhead at the moment of around 40
bytes. Additionally, there is around 12 bytes of overhead per column. Finally,
column names are repeated for each row.
CASSANDRA-674 and CASSANDRA-1207 will help with these overheads, but they will
not be fixed until
The Thrift server is embedded in Cassandra, and starts by default. Look for
references to Thrift on:
Hello out there,
If you are running Cassandra 0.6.*, and are using Cassandra's authentication
(IAuthenticator/SimpleAuthenticator), I'd love to hear about it!
Hey Oren,
The Cloud Servers REST API returns a "hostId" for each server that indicates
which physical host you are on: I'm not sure if you can see it from the control
panel, but a quick curl session should get you the answer.
> How many physical client machines are running
One with 50 threads; it is remote from the cluster but within the same
DC in both cases. I also run the test with multiple clients and saw
similar results when summing the reqs/sec.
If you put 25 processes on each of the 2 machines, all you are testing is how
fast 50 processes can hit Cassandra... the point of using more machines is that
you can use more processes.
Presumably, for a single machine, there is some limit (K) to the number of
processes that will give you addit
Did you copy the data directories from one node to the others?
Can you determine approximately what revisions you were running before and
That error is coming from the frontend: the jars must also be on the local
classpath. Take a look at how contrib/pig/bin/pig_cassandra sets up
Needing to manually copy the jars to all of the nodes would mean that you
aren't applying the Pig 'register ;' command properly.
Hey Aaron,
We are thinking a lot about multi-tenancy, but features to support multiple
tenants on a cluster are only beginning to make their way into Cassandra. See for a short listing of features
that are being considered (including a mention of mem
See , or the Upgrading
section in NEWS.txt.
JNA is _not_ necessary to use Cassandra, but the server can perform some
operations more efficiently if JNA is in place.
Not sure what is causing the error you are seeing in the CLI though: those
statements appear to be valid.
Cassandra supports the recommended approach from:
For large numbers of items, skip + limit is extremely inefficent.
Minor compactions will often be able to perform this garbage collection as well
in 0.6.6 and 0.7.0 due to a great optimization implemented by Sylvain:
Take a look at your particular implementation of
org.apache.cassandra.dht.IPartitioner: each partitioner: creates tokens in a
different way, but all of them are straightforward.
Take a look at the get_indexed_slices method in the 0.7.0-beta Thrift interface.
Coool. Would you mind opening an Avro issue for that, or should I?
Hey JT,
I believe this issue should be fixed by CASSANDRA-1571... if you're able to
test that patch, it would be very helpful.
> Specifically I'm wondering if I could create a byte representation of the Long
> that would also be lexicographically ordered.
This is probably what you want to do, combined with the ByteOrderedPartitioner
in 0.7
While the "adding virtual tokens/nodes to Cassandra" discussion is a good one,
there are a few factors that might delay (or remove?) the necessity of adding
that complexity:
* In Cassandra 0.7, removing load from a node is fairly cheap: a bounded number
of reads are used to determine which port
What column comparator/type are you using? Remember that if you are using
BytesType/UTF8Type, columns will be sorted lexicographically.
The actually MerkleTree itself is at org.apache.cassandra.utils.MerkleTree: it
has a reasonable number of tests in the MerkleTreeTest class, and Cassandra
uses a tree to store the hashes of a ColumnFamily in o.a.c.d.CompactionManager
via a o.a.c.s.AntiEntropyService "Validator".
Hey Aditya,
Would you mind attaching that last hundred few lines from before the exception
from the server log to this ticket: ?
At first glance, this appeared to be a very egregious bug, but the effect is
actually minimal: since the size of the buffer is deterministic based on the
size of the data, you will have equal amounts of excess/junk data for equal
rows. Combined with the fact that 0.6 doesn't reuse these buffers,
Is the server logging anything during the failed authentication?
On Fri, Nov 12, 2010 at 8:07 PM, Alaa Zubaidi wrote:
> using SimpleAuthenticator is not working with me in beta 3
> I am doing the following:
> · In Cassandra.yaml Set
> authenticator: org.apache.cassandra.auth.Simple
All write patterns should provide the same performance with Cassandra, since
all writes to disk occur sequentially. The only variance might be in the data
structure used for the Memtable (a concurrent skip list), but I expect that it
is quite stable.
If you have debug logs from the run, would you mind opening a JIRA describing
the problem?
It is much more likely that you always increase your cluster in size by a
certain large percentage. With a 10 node cluster, you are likely to add 5 nodes
at a time, and with a 100 node cluster you'll probably add 25 to 50 per batch.
replication factor == 1 means that there is only one copy of the data. And you
deleted it. Repair depends on the replication factor being greater than 1.
Ack... very sorry. I read the original message too quickly.
The fact that neither read-repair nor anti-entropy are working is suspicious
though. Do you think you could paste your config somewhere?
Eventually the new file format will make it in with #674, and we'll be able to
implement an option to skip corrupted data:
We're not ignoring this issue.
I also tried in version 0.6, but above error still exist.
Perhaps, I will tried the way David Timothy suggest.
@Stu Hood: Do you implement code for 808 issues?
Thank a lot for support.
Please read the README in the contrib/word_count directory.
Code that uses Hadoop will look for mapred-site.xml, core-site.xml,
hdfs-site.xml etc on your CLASSPATH. If you add your Hadoop config directory to
CLASSPATH before running the script, Hadoop will use that configuration to
connect to your cluster.
ColumnFamilyInputFormat no longer uses the fat client API, and instead uses
Thrift. There are still some significant problems with the fat client, so it
shouldn't be used without a good understanding of those problems.
If you still want to use it, check out contrib/bmt_example, but I'd recommend
Subject: Re: Help with MapReduce
Where is the ColumnFamilyInputFormat that uses Thrift? I don't actually
have a preference about client, I just want to be consistent with
On Sun, Apr 18, 2010 at 5:37 PM, Stu Hood wrote:
> ColumnFa
If you used that snippet of code, all connections would go through the same
seed: the input code does additional work to determine which nodes are holding
particular key ranges, and then connects directly.
For outputting from Hadoop to Cassandra, you may want to consider using a Java
Were all of those super column writes going to the same row?
It isn't very well documented apparently, but if you are using 0.6, you can
look at the 'Authenticator' property in the default config for an explanation
of how to authenticate users.
With the SimpleAuthenticator implementation, there are properties files that
define your users and passwords, a
Your keys cannot be an encoded as binary for OPP, since Cassandra will attempt
to decode them as UTF-8, meaning that they may not come back in the same format.
0.7 supports byte keys using the ByteOrderedPartitioner, and tokens are
specified using hex.
The indexes within rows are _not_ implemented with Lucene: there is a custom
index structure that allows for random access within a row. But, you should
probably read to
understand the current limitations of the file format, some of which are
