deletions, it's safe to
>> only use CL.ONE and disable the read repair if we're never deleting
>> counters. (And, of course, if we did start deleting counters, we'd need to
>> revert those client and column family changes.)
>
>
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
nel settings
(typically trading pollution of page cache vs. number of I/O:s).
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
cas to even try to write to.
(Note though: Reads are a bit of a different story and if you want to
test behavior when nodes go down I suggest including that. See
CASSANDRA-2540 and CASSANDRA-3927.)
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
FWIW, we're using openjdk7 on most of our clusters. For those where we
are still on openjdk6, it's not because of an issue - just haven't
gotten to rolling out the upgrade yet.
We haven't had any issues that I recall with upgrading the JDK.
--
/ Peter Sc
On Oct 22, 2012 11:54 AM, "B. Todd Burruss" wrote:
>
> does "nodetool cleanup" perform a major compaction in the process of
> removing unwanted data?
No.
o build my own
> version of cassandra?
It's in the 1.1 branch; I don't remember if it went into a release
yet. If not, it'll be in the next 1.1.x release.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
- because you're creating a single sstable bigger
than what would normally happen, and it takes more total disk space
before it will be part of a compaction again.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
on
it's not, is likely that it depends on the situation. Further, even if
you do play the lottery and win - if you don't know *why*, how are you
able to extrapolate the behavior of the system with slightly changed
workloads? It's very hard to blackbox-test GC settings, which is
probably why GC tuning can be perceived as a useless game of
whack-a-mole.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
and cannot be safely retried. Cassandra counters are generally not
useful if *strict* correctness is desired, for this reason.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
er case as far as I can tell (off the
top of my head), *some* counter increment is lost. The only way I can
see (again off the top of my head) the resulting value being correct
is if the later increment (N2 in this case) is somehow including N1 as
well (e.g., because it was generated by first reading the current
counter value).
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
t that
I know of if you're on Hotspot, is to have the application behave in
such a way that it avoids the causes of un-predictable behavior w.r.t.
GC by being careful about it's memory allocation and *retention*
profile. For the specific case of avoiding *ever* seeing a full gc, it
gets even more complex.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
't see how it would
since each sstable will effectively cover almost the entire range
(since you're effectively spraying random tokens at it, unless clients
are writing data in md5 order).
(Maybe it's different for ordered partitioning though.)
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
> Our full gc:s are typically not very frequent. Few days or even weeks
> in between, depending on cluster.
*PER NODE* that is. On a cluster of hundreds of nodes, that's pretty
often (and all it takes is a single node).
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
like to see how it is
> in action.
FWIW, J9's "balanced" collector is very similar to G1 in it's design.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
c analysis). The only
question is how often. But given the lack of handling of such failure
modes, the effect on clients is huge. Recommend data reads by default
to mitigate this and a slew of other sources of problems (and for
counter increments, we're rolling out least-active-request routi
mbered set scanning costs
(driven by inter-region pointers). If you can avoid that, one might
hope to avoid full gc:s all-together.
The jury is still out on my side; but like I said, I've seen promising
indications.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
> Has anyone tried running 1.1.1 on Java 7?
Have been running jdk 1.7 on several clusters on 1.1 for a while now.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
This problem is not new to 1.1.
On Sep 6, 2012 5:51 AM, "Radim Kolar" wrote:
> i would migrate to 1.0 because 1.1 is highly unstable.
>
-obnoxiously iterate over all rows:
for row_id, row in your_column_family.get_range():
https://github.com/pycassa/pycassa
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
> I think that was clear from your post. I don't see a problem with your
> process. Setting gc grace to 0 and forcing compaction should indeed
> return you to the smallest possible on-disk size.
(But may be unsafe as documented; can cause deleted data to pop back up, etc.)
--
/
sing compression) the Cassandra on-disk format is
not as compact as PostgreSQL. For example column names are duplicated
in each row, and the row key is duplicated twice (once in index, once
in data).
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
le requesting large amounts of data? Large or many columns (or
both), etc. Essentially all "working" data that your request touches
is allocated on the heap and contributes to allocation rate and ParNew
frequency.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
I can recommend Jolokia highly for providing an HTTP/JSON interface to
JMX (it can be trivially run in agent mode by just altering JVM args):
http://www.jolokia.org/
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
re able to disablegossip and make other nodes not send
requests to it. disabling thrift would also be advised, or even
firewalling it prior to restart.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
You're almost certainly using a client that doesn't set TCP_NODELAY on
the thrift TCP socket. The nagle algorithm is enabled, leading to 200
ms latency for each, and thus 5 requests/second.
http://en.wikipedia.org/wiki/Nagle's_algorithm
--
/ Peter Schulle
t failure domains (for reasons outlined
in 3810 above).
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
NDRA-3820
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
n 1.1, so the root cause remains
unknown as far as I can tell (had previously hoped the root cause were
thread-unsafe shard merging, or one of the other counter related
issues fixed during the 0.8 run).
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
> multithreaded_compaction: false
Set to true.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
ng, but it
certainly should significantly decrease memory use.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
ad, want to adjust target bloom filter false positive rates:
https://issues.apache.org/jira/browse/CASSANDRA-3497
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
ache.org/cassandra/Operations#Dealing_with_the_consequences_of_nodetool_repair_not_running_within_GCGraceSeconds
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
sage because of long-running repairs retaining sstables and delaying
their unload/removal (index sampling/bloom filters filling your heap).
If it really only happens for leveled/snappy however, I don't know
what that might be caused by.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
lots of
writes, index sampling will be insignificant.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
ould not have knowledge about topology and relative
latency other than what is driven by traffic, and I could imagine this
happening if read repair were turned completely off.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
0 +? Nowadays there is code to actively make caches
smaller if Cassandra detects that you seem to be running low on heap.
Watch cassandra.log for messages to that effect (don't remember the
exact message right now).
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
tage pending backing up constantly. If on the other hand
these are batch jobs where throughput is the concern, it's not
relevant.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
gine unless you have sub-second
resolution, but would still exhibit un-evenness and have an affect on
latency.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
What is your total data size (nodetool info/nodetool ring) per node,
your heap size, and the amount of memory on the system?
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
close
to maximum at all times, and pending racking up consistently. If
you're just close, you'll likely see spikes sometimes.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
cache only caches the index positions in the data
file, and not the actual data. The key cache will only ever eliminate
the I/O that would have been required to lookup the index entry; it
doesn't help to eliminate seeking to get the data (but as usual, it
may still be in the operating system pa
in query latency is high.
That said, if you're seeing consistently bad latencies for a while
where you sometimes see consistently good latencies, that sounds
different but would hopefully be observable somehow.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
e node while it is being slow, and
observe. Figure out what the bottleneck is. iostat, top, nodetool
tpstats, nodetool netstats, nodetool compactionstats.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
#x27;own' 0% (actually, 1/(2^128) :) ), and
> depending on your replication factor might have no data (if replication were
> 1).
It's also incorrect for rack awareness if your topology is such that
the rack awareness changes ownership (see
https://issues.apache.org/jira/brows
Again the *schema* gets propagated and the keyspace will exist
everywhere. You should just have exactly zero amount of data for the
keyspace in the DC w/o replicas.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
sstables on disk with data for the keyspace?
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
It is expected that the schema is replicated everywhere, but *data*
won't be in the DC with 0 replicas.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
ontinuously
happening. A good rule of thumb is that an individual node should be
able to handle your traffic when doing compaction; you don't want to
be in the position where you're just barely dealing with the traffic,
and a node doing compaction not being able to handle it.
rashed with data loss prior to
the data making it elsewhere.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
umn wins. This
accomplish the reads-see-write invariant.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
. Have a good amount of margin. Less
so with leveled compaction than size tiered compaction, but still
important.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
y reach a steady state of disk usage. It only becomes a
problem if you're almost *entirely* full and are trying to delete data
in a panic.
How far away are you from entirely full? Are you just worried about
the future or are you about to run out of disk space right now?
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
ecially
in relation to memory size), it's not necessarily the best trade-off.
Consider the time it takes to do repairs, streaming, node start-up,
etc.
If it's only about CPU resources then bigger nodes probably make more
sense if the h/w is cost effective.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
AL_QUORUM if it's only within
a DC).
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
ssen promotion into old-gen).
Experiment on a single node, making sure you're not causing too much
disk I/O by stealing memory otherwise used by page cache. Once you
have something that works you might try slowly going back down.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
correspondingly
big single cluster.
It is probably more useful to try to select hardware such that you
have a greater number of smaller nodes, than it is to focus on node
count (although once you start reaching the "few hundreds" level
you're entering territory of less actual
icularly detailed tuning of GC issues is pretty
useless on 0.7 given the significant changes in 1.0. Don't even bother
spending time on this until you're on 1.0, unless this is about a
production cluster that you cannot upgrade for some reason.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
to me there are hints on the node, for other
nodes, that contain writes to a deleted column family.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
o reproduce them. But
> offhand, I don't see any to throttle back the load created by the
> stress test.
I'm not aware of one built-in. It would be a useful patch IMO, to
allow setting a target rate.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
og4j-server.properties at the top) and the strategy (assuming you are
using NetworkTopologyStrategy) will log selected endpoints, and
confirm that it's indeed picking endpoints that you think it should
based on getendpoints.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
But
> this is not happening.
1 node among the ones in the replica set of your row has to be up.
> Will the read repair happen automatically even if I read and write using the
> consistency level ONE?
Yes, assuming it's turned on.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
> If I change endpoint_snitch from SimpleSnitch to PropertyFileSnitch,
> does it require restart of cassandra on that node ?
Yes.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
d
aren't latency critical, that's probably fine though.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
r 99% of all benchmarks ever published on the internet...
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
sing reasons.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
> Compaction should delete empty rows once gc_grace_seconds is passed, right?
Yes.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
rly documented in
> http://wiki.apache.org/cassandra/LargeDataSetConsiderations - that bloom
> filters + index sampling will be responsible for most memory used by node.
> Caching itself has minimal use on large data set used for OLAP.
I added some information at the end.
-
r to heap capacity than regular compaction.
Also, consider tweaking compaction throughput settings to control the
rate of allocation generated during a compaction, even if you don't
need it for disk I/O purposes.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
XXX - try XXX = number of CPU cores for
example in this case). Alternatively, a larger young gen to avoid so
much getting promoted during compaction.
But really, in short: The easiest fix is probably to increase the heap
size. I know this e-mail doesn't begin to explain details but it's
such a long story.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
ng
compaction wouldn't really help anything other than short-term
avoiding a fallback to full GC.
I suggest you describe exactly what the problem is you have and why
you think stopping compaction/repair is the appropriate solution.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
nable bloom filters will be committed.
That is a good reason for both to be configurable IMO.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
this an option now), but a 1% false
positive hit rate will be completely unacceptable in some
circumstances. In others, perfectly acceptable due to the decrease in
memory use and few reads.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
s to
sstables will be higher than the number of reads to the CF (unless you
happen to have exactly one sstable or no rows ever span sstables).
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
that
you're seeing w/o GC correlation? Off the top of my head, that seems
very unexpected (assuming a non-saturated system) and would definitely
invite investigation IMO.
If you're willing to start iterating with the source code I'd start
bisecting down the call stack and see where it's happening .
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
> One other thing to consider is are you creating a few very large rows ? You
> can check the min, max and average row size using nodetool cfstats.
Normall I agree, but assuming the two-node cluster has RF 2 it would
actually not matter ;)
--
/ Peter Schuller (@scode
n happens to be in on the given node (I am assuming you're
not using leveled compaction). That is in addition to any imbalance
that might result from your population of data in the cluster.
Running repair can affect the live size, but *lack* of repair won't
cause a live size divergen
l, and then see whether or not you can sustain
that in terms of old-gen.
Start with this in any case: Run Cassandra with -XX:+PrintGC
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
uot;occupancy") of CMS to a
lower percentage, making the concurrent mark phase start earlier.
* Increase heap size significantly (probably not necessary based on
your graph, but for good measure).
If it then goes away, report back and we can perhaps figure out
details. There are other things
GC log around the time of the pause).
Your graph is looking very unusual for CMS. It's possible that
everything is as it otherwise should and CMS is kicking in too late,
but I am kind of skeptical towards that even the extremely smooth look
of your graph.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
s best, but it's polling. Unfortunately the JDK
provides no way to properly monitor for GC events within the Java
application. The GC inspector can miss a GC.
Also, the GC inspector only tells you time + type of GC; a GC log will
provide all sorts of details.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
until suddenly snapping back to 0 again once compactions catch up.
Whether or not non-zero is a problem depends on the Cassandra version,
how many concurrent compactors you are running, and your column
families/data sizes/flushing speeds etc. (Sorry, kind of a long story)
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
ractively "wait for it", I suggest something as simple as
fireing up an top + iostat for each host and have them on the screen
at the same time, and look for what happens when you see this again.
If the problem is fallback to full GC for example, the affected nodes
should be
nd the negative effects increase as you have
higher demands of low latency on other traffic to the cluster.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
eneral, you will see a
magnification by a factor of RF on the local statistics (in aggregate)
relative to the StorageProxy stats.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
ransactions
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
ses RHEL 6.1 specifically? I mean I can say that I've
run Cassandra on Debian Squeeze in production, but that doesn't really
help you ;)
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
quential I/O
asynchronously.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
ggestions (apart from spreading the load on more nodes).
>
> Cluster is 5 node, BOP, RF=3, AMD opteron 4174 CPU (6 x 2.3 Ghz cores),
> Gigabit ethernet, RAID-0 SATA2 disks
For starters, what *is* the throughput? How many counter mutations are
you submitting per second?
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
minating
them entirely may not be possible).
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
NLY doing these queries, that's not a problem per se. But if you are
also expecting other requests to have low latency, then you want to
avoid it.
In general, batching is good - but don't overdo it, especially for
reads, and especially if you're going to disk for the workload.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
ld all just sit there and work
without intervention.
It's a pretty big ticket though and not something I'm gonna be working
on in my spare time, so I don't know whether or when I would actually
work on that ticket (depends on priorities). I have the ideas but I
can't promise to fix it :)
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
ks - and figure out what the
most cost-effective solution is.
Note that if you're bottlenecking on disk I/O, it's not surprising at
all that repairing ~ 100 gigs of data takes more than 24 hours.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
owse/CASSANDRA-3483 is done. The
patch attached to that ticket should work for 0.8.6 I suspect (but no
guarantees). This also assumes you have no reads running against the
cluster.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
er at all, outside
> the
> maintenance like the repair.
Ok. So what i'm getting at then is that there may be real legitimate
connectivity problems that you aren't noticing in any other way since
you don't have active traffic to the cluster.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
allow you to
rule that out (or not).
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Filed https://issues.apache.org/jira/browse/CASSANDRA-3569 to fix it
so that streams don't die due to conviction.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
e exception you're seeing should be
indicative that it really was considered Down by the node. You might
grep the log for references ot the node in question (UP or DOWN) to
confirm. The question is why though. I would check if the node has
maybe automatically restarted, or went into full GC, e
the two "old" nodes affected by decommissioning node N.
(Unless I'm tripping myself up somewhere now...)
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
completely disk bound, and that'll show up as a
huge amount of pending ReadStage. "iostat -x -k 1" should confirm it.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
> No it's not just the cli tool, our app has the same issue coming back with
> read issues.
You are supposed to not be able to read it. But you should be getting
a proper error, not an empty result.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
1 - 100 of 657 matches
Mail list logo