Looks like no major table version changes since 3.0, and a couple of minor
changes in 3.0.7/3.7 and 3.0.8/3.8:
https://github.com/apache/cassandra/blob/48a539142e9e318f9177ad8cec4781
9d1adc3df9/doc/source/architecture/storage_engine.rst
So, I suppose whether a revert is safe or not depends on whet
JVM GC tuning can be pretty complex, but the simplest solution to OOM is
probably switching to G1GC and feeding it a rather large heap.
Theoretically a smaller heap and carefully-tuned CMS collector is more
efficient, but CMS is kind of fragile and tuning it is more of a black art,
where you can ge
I've run across this problem before - it seems like GNU tar interprets
changes in the link count as changes to the file, so if the file gets
compacted mid-backup it freaks out even if the file contents are
unchanged. I worked around it by just using bsdtar instead.
On Thu, May 24, 2018 at 6:08 AM
ut it's the best I could come up with. :-/
>
> Thanks Jeff & others for your responses.
>
> - Max
>
> On May 25, 2018, at 5:05pm, Elliott Sims wrote:
>
> I've run across this problem before - it seems like GNU tar interprets
> changes in the link count as changes t
I'd say for a large write-heavy workload like, Cassandra is a pretty clear
winner over MongoDB. I agree with the commenters about understanding your
query patterns a bit better before choosing, though. Cassandra's queries
are a bit limited, and if you're loading all new data every day and
discard
Are you seeing significant issues in terms of performance? Increased
garbage collection, long pauses, or even OutOfMemory? Which garbage
collector are you using and with what settings/thresholds? Since the JVM's
garbage-collected, a bigger heap can mean a problem or it can just mean
"hasn't gott
It's possible that it's something more subtle, but keep in mind that
sstables don't include the schema. If you've made schema changes, you need
to apply/revert those first or C* probably doesn't know what to do with
those columns in the sstable.
On Sun, Jun 10, 2018 at 11:38 PM, wrote:
> Dear C
If this is data that expires after a certain amount of time, you probably
want to look into using TWCS and TTLs to minimize the number of tombstones.
Decreasing gc_grace_seconds then compacting will reduce the number of
tombstones, but at the cost of potentially resurrecting deleted data if the
ta
Do you have an actual performance issue anywhere at the application level?
If not, I wouldn't spend too much time on it - load avg is a sort of odd
indirect metric that may or may not mean anything depending on the
situation.
On Fri, Jun 15, 2018 at 6:49 AM, Igor Leão wrote:
> Hi there,
>
> I ha
It depends a bit on which collector you're using, but fairly normal. Heap
grows for a while, then the JVM decides via a variety of metrics that it's
time to run a collection. G1GC is usually a bit steadier and less sawtooth
than the Parallel Mark Sweep , but if your heap's a lot bigger than neede
not quite yet to the
> heap size based on our partition sizes.
> All queries use cluster key, so I'm not accidentally reading a whole
> partition.
> The last place I'm looking - which maybe should be the first - is
> tombstones.
>
> sorry for the afternoon rant! thanks for
Among the hosts in a cluster? It depends on how much data you're trying to
read and write. In general, you're going to want a lot more bandwidth
among hosts in the cluster than you have external-facing. Otherwise things
like repairs and bootstrapping new nodes can get slow/difficult. To put it
You might have more luck trying to analyze at the Java level, either via a
(Java) stack dump and the "ttop" tool from Swiss Java Knife, or Cassandra
tools like "nodetool tpstats"
On Wed, Aug 1, 2018 at 2:08 AM, nokia ceph wrote:
> Hi,
>
> i'm having a 5 node cluster with cassandra 3.0.13.
>
> i
Deflate instead of LZ4 will probably give you somewhat better compression
at the cost of a lot of CPU. Larger chunk length might also help, but in
most cases you probably won't see much benefit above 64K (and it will
increase I/O load).
On Wed, Aug 8, 2018 at 11:18 PM, Eunsu Kim wrote:
> Hi all
Since it's at a consistent time, maybe just look at it with iftop to see
where the traffic's going and what port it's coming from?
On Fri, Aug 10, 2018 at 1:48 AM, Behnam B.Marandi <
behnam.b.mara...@gmail.com> wrote:
> I don't have any external process or planed repair in that time period.
> In
Might be a silly question, but did you run "nodetool upgradesstables" and
convert to the 3.0 format? Also, which 3.0? Newest, or an earlier 3.0.x?
On Fri, Aug 10, 2018 at 3:05 PM, kooljava2
wrote:
> Hello,
>
> We recently upgrade C* from 2.1 to 3.0. After the upgrade we are seeing
> increase i
Step one is always to measure your bottlenecks. Are you spending a lot of
time compacting? Garbage collecting? Are you saturating CPU? Or just a
few cores? Or I/O? Are repairs using all your I/O? Are you just running
out of write threads?
On Wed, Aug 15, 2018 at 5:48 AM, Abdul Patel wrote:
Assuming this isn't an existing cluster, the easiest method is probably to
use logical "racks" to explicitly control which hosts have a full replica
of the data. with RF3 and 3 "racks", each "rack" has one complete replica.
If you're not using the logical racks, I think the replicas are spread
ran
;> Is this a one-time or occasional load or more frequently?
>>
>> Is the data located in the same physical data center as the cluster? (any
>> network latency?)
>>
>>
>>
>> On the client side, prepared statements and ExecuteAsync can really speed
gt;>
>> On Thursday, August 16, 2018, 12:02:55 AM PDT, Behnam B.Marandi <
>> behnam.b.mara...@gmail.com> wrote:
>>
>>
>> Actually I did. It seems this is a cross node traffic from one node to
>> port 7000 (storage_port) of the other node.
>>
>> On Sun, Au
At the time that Facebook chose HBase, Cassandra was drastically less
mature than it is now and I think the original creators had already left.
There were already various Hadoop variants running for data analytics etc,
so lots of operational and engineering experience around it available. So,
prob
It's interesting and a bit surprising that 256 write threads isn't enough.
Even with a lot of cores, I'd expect you to be able to saturate CPU with
that many threads. I'd make sure you don't have other bottlenecks, like
GC, IOPs, network, or "microbursts" where your load is actually fluctuating
be
A few reasons I can think of offhand why your test setup might not see
problems from large readahead:
Your sstables are <4MB or your reads are typically <4MB from the end of the
file
Your queries tend to use the 4MB of data anyways
Your dataset is small enough that most of it fits in the VM cache,
I'll second that - we had some weird inconsistent reads for a long time
that we finally tracked to a small number of clients with significant clock
skew. Make very sure all your client (not just C*) machines have
tightly-synced clocks.
On Fri, Oct 12, 2018 at 7:40 PM maitrayee shah
wrote:
> We
As far as I know, it's not possible to change it live. You have to create
a new "datacenter" with new hosts using the new num_tokens value, then
switch everything to use the new DC and tear down the old.
On Thu, Nov 1, 2018 at 6:16 PM Goutham reddy
wrote:
> Hi team,
> Can someone help me out I
I think you answered your own question, sort of.
When you expand a cluster, it copies the appropriate rows to the new
node(s) but doesn't automatically remove them from the old nodes. When you
ran cleanup on datacenter1, it cleared out those old extra copies. I would
suggest running a repair fir
8, 2018, 8:05 PM Eunsu Kim Thank you for your response.
>
> I will run repair from datacenter2 with your advice. Do I have to run
> repair on every node in datacenter2?
>
> There is no snapshot when checked with nodetool listsnaphosts.
>
> Thank you.
>
> On 29 Nov 2018, a
I would strongly suggest you consider an upgrade to 3.11.x. I found it
decreased space needed by about 30% in addition to significantly lowering
GC.
As a first step, though, why not just revert to CMS for now if that was
working ok for you? Then you can convert one host for diagnosis/tuning so
t
When a snapshot is taken, it includes a "schema.cql" file. That should be
sufficient to restore whatever you need to restore. I'd argue that neither
automatically resurrecting a dropped table nor silently failing to restore
it is a good behavior, so it's not unreasonable to have the user re-creat
I use G1, and I think it's actually the default now for newer Cassandra
versions. For G1, I've done very little custom config/tuning. I increased
heap to 16GB (out of 64GB physical), but most of the rest is at or near
default. For the most part, it's been "feed it more RAM, and it works"
compare
1. Do the same people where you work operate the cluster and write
the code to develop the application?
Mostly. Ops vs dev, although there's some overlap
2. Do you have a metrics stack that allows you to see graphs of
various metrics with all the nodes displayed together?
Yes, Prom
It's not really something that can be easily calculated based on write
rate, but more something you have to find empirically and adjust
periodically.
Generally speaking, I'd start by running "nodetool gcstats" or similar and
just see what the GC pause stats look like. If it's not pausing much or
f
It may also be worth upgrading to Cassandra 3.11.4. There's some changes
in 3.6+ that significantly reduce heap pressure from very large partitions.
On Mon, Aug 12, 2019 at 9:13 AM Gabriel Giussi
wrote:
> I've found a huge partion (~9GB) in my cassandra cluster because I'm
> loosing 3 nodes rec
A container of some sort gives you better isolation and less risk of a
mistake that could cause the instances to conflict in some way. Might be
better for balancing resources between them as well, though using cgroups
directly can also accomplish that.
On Fri, Sep 20, 2019, 8:27 AM Nitan Kainth
Datastax might be a better resource for this. This mailing list is pretty
good about questions that apply to DSE and Apache Cassandra, but the SOLR
integration is pretty specific to DSE.
On Wed, Sep 25, 2019 at 7:15 PM kumar bharath
wrote:
> Hi All,
>
> We are having a 6 node cluster with two d
There's a concurrent_compactors parameter in cassandra.yml that does
exactly what the name says. You may also find
compaction_throughput_mb_per_sec useful.
On Tue, Oct 1, 2019 at 8:16 AM Matthias Pfau
wrote:
> Hi there,
> we recently upgraded from 2.2 to 3.11.4.
>
> Unfortunately, we are runnin
The tar error is because tar also looks for metadata changes. In this
case, it's the refcount that's changing and causing the error. I just
switched to using bsdtar instead as a workaround.
On Tue, Oct 1, 2019, 5:37 PM James A. Robinson
wrote:
> Hi folks,
>
>
> I took a nodetool snapshot of a
Based on my experiences, if you have a new enough kernel I'd strongly
suggest switching the TCP scheduler algorithm to BBR. I've found the rest
tend to be extremely sensitive to even small amounts of packet loss among
cluster members where BBR holds up well.
High ulimits for basically everything
h higher latency).
On Mon, Oct 21, 2019 at 1:53 PM Sergio wrote:
> Thanks Elliott!
>
> How do you know if there is too much RAM used for those settings?
>
> Which metrics do you keep track of?
>
> What would you recommend instead?
>
> Best,
>
> Sergio
>
>
On the systems side of things, I've found that using the new BBR TCP
congestion algorithm results in far better behavior in cases of low to
moderate packet loss compared to any of the older strategies. It can't fix
totally broken, but it takes good advantage of "usable but lossy". 0.5-2%
loss wou
In addition to extra space, queries can potentially be more expensive
because more dead rows and tombstones will need to be scanned. How much of
a difference this makes will depend drastically on the schema and access
pattern, but I wouldn't expect going from 5 days to 8 to be very noticeable.
On
Async-profiler (https://github.com/jvm-profiling-tools/async-profiler )
flamegraphs can also be a really good tool to figure out the exact
callgraph that's leading to the futex_wait, both in and out of the JVM.
I definitely saw a noticeable decrease in GC activity somewhere between
3.11.0 and 3.11.4. I'm not sure which change did it, but I can't think of
any good reason to use 3.11.0 vs 3.11.6.
I would enable and look through GC logs (or just the slow-GC entries in the
default log) to see if the problem
If you're upgrading the whole cluster, I'd recommend going ahead and
upgrading all the way to 3.11.6 if possible. In my experience it's been
noticeably faster, more reliable, and easier to manage compared to 3.0.x.
On Thu, Apr 16, 2020 at 6:37 PM Ashika Umagiliya
wrote:
> Thank you for the clar
The short answer is that CQL isn't SQL. It looks a bit like it, but the
structure of the data is totally different. Essentially (ignoring
secondary indexes, which have some issues in practice and I think are
generally not recommended) the only way to look the data up is by the
partition key. Any
There's also a slightly older mailing list discussion on this subject that
goes into detail on this sort of strategy:
https://www.mail-archive.com/user@cassandra.apache.org/msg60006.html
I've been approximately following it, repeating steps 3-6 for the first
host in each "rack(replica, since I hav
The Cassandra documentation doesn't require IPs to be unique among members
of a cluster, because it's not a Cassandra limitation. Hosts that want to
communicate amongst themselves over the network need non-conflicting IPs,
regardless of application.
On Wed, Jun 24, 2020 at 5:09 AM manish khandelw
I've found there to be some behavior differences in practice as well going
from 2.2 to 3.11 with a high token count, but all differences for the
better. 3.x seems noticeably less likely to crater or GC-thrash during
repairs compared to 2.x, probably due to the sum of small changes rather
than any
You want to look for full or long GCs in the logs, as well as how much
total time it's spending on GCing as a percentage. Probably more the
latter, since you're not seeing long pauses with one core pegged and the
rest idle. G1 handles oversized heaps well, so it's worth bumping to
20-27GB just to
Tracing fully on rather than sampling will definitely add substantial load,
even with shorter TTLs. That's a lot of extra writes.
If it's just on for specific sessions, or is enabled but with low sampling,
that's not bad in terms of load.
On Mon, Nov 16, 2020 at 6:25 AM Shalom Sagges
wrote:
>
Is the heap larger on the M5.4x instance?
Are you sure it's Cassandra generating the read traffic vs just evicting
files read by other systems?
In general, I'd call "more RAM means fewer drive reads" a very expected
result regardless of the details, especially when it's the difference
between fitt
Are you running with RF=3 and QUORUM on both read and write?
If so, I think as long as your fill job reports errors and retries you can
probably get away without repairing.
You can also hedge your bets by doing the data load with ALL, though of
course that has an availability tradeoff.
Personally,
At least by default, Cassandra has pretty short timeouts. I don't know of
a way to kill an in-flight query, but by the time you did it would have
timed out anyways. I don't know of any way to stop it from repeating other
than tracking down the source and stopping it.
On Wed, Jan 6, 2021, 5:41 PM
1% packet loss can definitely lead to drops. At higher speeds, that's
enough to limit TCP throughput to the point that cross-node communication
can't keep up. TCP_BBR will do better than other strategies at maintaining
high throughput despite single-digit packet loss, but you'll also want to
trac
The main downside I see is that you're hitting a less-tested codepath. I
think very few installations have compression disabled today.
On Mon, Jan 25, 2021 at 7:06 AM Lapo Luchini wrote:
> Hi,
> I'm using a fairly standard install of Cassandra 3.11 on FreeBSD
> 12, by default filesystem is
To start with, maybe update to beta4. There's an absolute massive list of
fixes since alpha4. I don't think the alphas are expected to be in a
usable/low-bug state necessarily, where beta4 is approaching RC status.
On Tue, Jan 26, 2021, 10:44 PM Attila Wind wrote:
> Hey All,
>
> I'm coming bac
TO start, I'd try to figure out what your slowdown is. Surely GCP has far,
far more than 17Mbps available.
You don't want to cut it close on this, because for stuff like repairs,
rebuilds, interruptions, etc you'll want to be able to catch up and not
just keep up.
Generally speaking, Cassandra def
I'm not too familiar with the details on what's happened more recently, but
I do remember that while Rocksandra was very favorably compared to
Cassandra 2.x, the improvements looked fairly similar in nature and
magnitude to what Cassandra got from the move to the 3.x sstable format and
increased us
I'm a big fan of this one about LWTs:
https://www.youtube.com/watch?v=wcxQM3ZN20c
Not only if you want to understand LWTs, but also to get a better
understanding of the sometimes-unintuitive consistency promises made and
not made for non-LWT queries.
On Tue, Mar 16, 2021 at 11:53 PM wrote:
> I k
I'm not sure I'd suggest building a single DIY Backblaze pod. The SATA
port multipliers are a pain both from a supply chain and systems management
perspective. Can be worth it when you're amortizing that across a lot of
servers and can exert some leverage over wholesale suppliers, but less so
for
As more general advice, I'd strongly encourage you to update to 3.11.x from
2.2.8. My personal experience is that it's significantly faster and more
space-efficient, and the garbage collection behavior under pressure is
drastically better. There's also improved tooling for diagnosing
performance
Your partition key determines your partition size. Reducing retention
sounds like it would help some in your case, but really you'd have to split
it up somehow. If it fits your query pattern, you could potentially have a
compound key of userid+datetime, or some other time-based split. You could
Shouldn't cause GCs.
You can usually think of heap memory separately from the rest. It's
already allocated as far as the OS is concerned, and it doesn't know
anything about GC going on inside of that allocation. You can set
"-XX:+AlwaysPreTouch" to make sure it's physically allocated on startup.
Depends on your availability requirements, but in general I'd say if you're
going with N replicas, you'd want N failure domains (where one blade
chassis is a failure domain).
On Tue, Aug 10, 2021 at 11:16 PM Erick Ramirez
wrote:
> That's 430TB of eggs in the one 4U basket so consider that agains
Won't option 2 in that list potentially cause some pretty severe load
imbalance in most cases? The last node with 256 tokens will end up with
16x as much data on it as the 16 token nodes, right?
You'd have to mitigate it either by adding 16 new nodes for every one you
replace except the last one,
Ansible here as well with a similar setup. A play at the end of the
playbook that waits until all nodes in the cluster are "UN" before moving
on to the next node to change.
On Mon, Oct 18, 2021 at 10:01 AM vytenis silgalis
wrote:
> Yep, also use Ansible with configs living in git here.
>
> On F
CMS has a higher risk of a long stop-the-world full GC that will cause a
burst of timeouts, but if you're not getting that or don't mind if it
happens now and then CMS is probably the way to go. It's generally
lower-overhead than G1. If you really don't care about latency it might
even be worth
More tokens: better data distribution, more expensive repairs, higher
probability of a multi-host outage taking some data offline and affecting
availability.
I think with >100 nodes the repair times and availability improvements make
a strong case for 16 tokens even though it means you'll need mo
I think this has a much simpler answer: GNU tar interprets inode changes
as "changes" as well as block contents. This includes the hardlink count.
I actually ended up working around it by using bsdtar, which doesn't
interpret hardlink count changes as a change to be concerned about.
On Tue, Mar
In terms of turning it into Ansible, it's going to depend a lot on how you
manage the physical layer as well as replication/consistency. Currently, I
just use groups per "rack". If you have an API-accessible CMDB you could
probably pull the physical location from there and translate that to
rack/
If you set a different num_tokens value for new hosts (the value should
never be changed on an existing host), the amount of data moved to that
host will be proportional to the num_tokens value. So, if the new hosts
are set to 32 when they're added to the cluster, those hosts will get twice
as muc
fect
> efficiency if the token figure were the same across all nodes?
>
>
>
> *From:* Elliott Sims
> *Sent:* Thursday, June 16, 2022 12:24 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Configuration for new(expanding) cluster and new admins.
>
>
>
> EXTERNAL
>
am just
> concerned that the doc says it is *not recommended for clusters over 50
> nodes*.
>
> 16
>
> Best for heavily elastic clusters which expand and shrink regularly, but
> may have issues availability with larger clusters. Not recommended for
> clusters over 50 nodes.
&
If multiple things are dying under load, you'll want to check "dmesg" and
see if the oom-killer is getting triggered. Something like "atop" can be
good for figuring out what was using all of the memory when it was
triggered if the kernel logs don't have enough info.
On Thu, Dec 15, 2022 at 12:41
Consistently 200ms, during the back-and-forth negotiation rather than the
handshake? That sounds suspiciously like Nagle interacting with Delayed
ACK.
On Wed, Jan 11, 2023 at 8:41 AM MyWorld wrote:
> Hi all,
> We are facing a connection latency of 200ms between API server and db
> server during
For dealing with allocate_tokens_for_keyspace in datacenter migrations,
I've just created a dummy keyspace in the new DC with the desired topology,
then removed it once everything's done.
On Mon, Jan 30, 2023 at 3:36 PM Doug Whitfield
wrote:
> Hi folks,
>
> In our 3.11 deployments we are using t
A quick search shows SLES 15 provides Java 11 (java-11-openjdk), which is
just fine for Cassandra 4.x.
On Wed, Mar 8, 2023 at 2:56 PM Eric Ferrenbach <
eric.ferrenb...@milliporesigma.com> wrote:
> We are running Cassandra 4.0.7.
>
> We are preparing to migrate our nodes from Centos to SUSE Linux.
A few weeks ago, we rolled out TLS among hosts in our clusters (running
4.0.7). More recently we also rolled out TLS between Cassandra clients and
the cluster. Today, we started seeing a lot of dropped actions in one
cluster that correlate with warnings like this:
WARN [epollEventLoopGroup-5-31
r.
On Wed, Apr 12, 2023 at 11:36 AM Elliott Sims wrote:
> A few weeks ago, we rolled out TLS among hosts in our clusters (running
> 4.0.7). More recently we also rolled out TLS between Cassandra clients and
> the cluster. Today, we started seeing a lot of dropped actions in one
>
1. Check for Nagle/delayed-ack, but probably nodelay is getting set by the
driver so it shouldn't be a problem.
2. Check for network latency (just regular old ping among hosts, during
traffic)
3. Check your GC metrics and see if garbage collections line up with
outliers. Some tuning can help th
80 matches
Mail list logo