RE: Datacenter decommissioning on Cassandra 4.1.4

2024-04-23 Thread Michalis Kotsiouros (EXT) via user
Hello Sebastien,
Yes, your approach is really interesting. I will test this in my system as
well. I think it reduces some risks involved in the procedure that was
discussed in the previous emails.
Just for the record, availability is a top priority for my use cases that is
why I have switched the default consistency level for
authentication/authorization to LOCAL_ONE as it used in previous C*
versions.

BR
MK
-Original Message-
From: Sebastian Marsching  
Sent: April 22, 2024 21:58
To: Michalis Kotsiouros (EXT) via user 
Subject: Re: Datacenter decommissioning on Cassandra 4.1.4

Recently, I successfully used the following procedure when decommissioning a
datacenter:

1. Reduced the replication factor for this DC to zero for all keyspaces
except the system_auth keyspace. For that keyspace, I reduced the RF to one.
2. Decommissioned all nodes except one in the DC using the regular procedure
(no --force needed).
3. Decommissioned the last node using --force.
4. Set the RF for the system_auth keyspace to 0.

This procedure has two benefits:

1. Authentication on the nodes in the DC being decommissioned will work
until the last node has been decommissioned. This is important when
authentication is enabled for JMX. Otherwise, you cannot proceed when there
are too few nodes left to get a LOCAL_QUORUM on system_auth.
2. One does not have to use --force except when removing the last node.

It would be nice if the RF for the system_auth keyspace could be reduced to
zero before decommissioning the nodes. However, I think that implementing
this correctly may be hard. If there are no local replicas, queries with a
consistency level of LOCAL_QUORUM will probably fail, and this is the
consistency level used for all authentication and authorization related
queries. So, setting the RF to zero might break authentication and
authorization, which in turn might make it impossible to decommission the
nodes (without disabling authentication for that DC).

So, I guess that the code dealing with authentication and authorization
would have to be changed to use a CL of QUORUM instead of LOCAL_QUORUM when
system_auth is not replicated in the local DC.



smime.p7s
Description: S/MIME cryptographic signature


RE: Datacenter decommissioning on Cassandra 4.1.4

2024-04-23 Thread Michalis Kotsiouros (EXT) via user
Hello Alain,
Thanks a lot for the confirmation.
Yes this procedure seems like a workaround. But for my use case where 
system_auth contains a small amount of data and consistency level for 
authentication/authorization is switched to LOCAL_ONE, I think it is good 
enough.
I completely get that this could be improved since there might be requirements 
from other users that cannot be covered with the proposed procedure.

BR
MK
From: Alain Rodriguez 
Sent: April 22, 2024 18:27
To: user@cassandra.apache.org
Cc: Michalis Kotsiouros (EXT) 
Subject: Re: Datacenter decommissioning on Cassandra 4.1.4

Hi Michalis,

It's been a while since I removed a DC for the last time, but I see there is 
now a protection to avoid accidentally leaving a DC without auth capability.

This was introduced in C* 4.1 through CASSANDRA-17478 
(https://issues.apache.org/jira/browse/CASSANDRA-17478).

The process of dropping a data center might have been overlooked while doing 
this work.

It's never correct for an operator to remove a DC from system_auth replication 
settings while there are currently nodes up in that DC.

I believe this assertion is not correct. As Jon and Jeff mentioned, usually we 
remove the replication before decommissioning any node in the case of removing 
an entire DC, for reasons exposed by Jeff. The existing documentation is also 
clear about this: 
https://docs.datastax.com/en/cassandra-oss/3.0/cassandra/operations/opsDecomissionDC.html
 and 
https://thelastpickle.com/blog/2019/02/26/data-center-switch.html.

Michalis, the solution you suggest seems to be the (good/only?) way to go, even 
though it looks like a workaround, not really "clean" and something we need to 
improve. It was also mentioned here: 
https://dba.stackexchange.com/questions/331732/not-a-able-to-decommission-the-old-datacenter#answer-334890.
 It should work quickly, but only because this keyspace has a fairly low amount 
of data, but it will still not be optimal and as fast as it should (it should 
be a near no-op as explained above by Jeff). It also obliges you to use 
"--force" option that could lead you to delete one of your nodes in another DC 
by mistake and in a loaded cluster or a 3-node cluster - RF = 3, this could 
hurt...). Having to operate using "nodetool decommission --force" cannot be 
standard, but for now I can't think of anything better for you. Maybe wait for 
someone else's confirmation, it's been a while since operated Cassandra :).

I think it would make sense to fix this somehow in Cassandra. Maybe should we 
ensure that no other keyspaces has a RF > 0 for this data center instead of 
looking at active nodes, or that there is no client connected to the nodes, add 
a manual flag somewhere, or something else? Even though I understand the 
motivation to protect users against a wrongly distributed system_auth keyspace, 
I think this protection should not be kept with this implementation. If that 
makes sense I can create a ticket for this problem.

C*heers,

Alain Rodriguez

casterix.fr

[Image removed by sender.]



Le lun. 8 avr. 2024 à 16:26, Michalis Kotsiouros (EXT) via user 
mailto:user@cassandra.apache.org>> a écrit :
Hello Jon and Jeff,
Thanks a lot for your replies.
I completely get your points.
Some more clarification about my issue.
When trying to update the Replication before the decommission, I get the 
following error message when I remove the replication for system_auth kesypace.
ConfigurationException: Following datacenters have active nodes and must be 
present in replication options for keyspace system_auth: [datacenter1]

This error message does not appear in the rest of the application keyspaces.
So, may I change the procedure to:

  1.  Make sure no clients are still writing to any nodes in the datacenter.
  2.  Run a full repair with nodetool repair.
  3.  Change all keyspaces so they no longer reference the datacenter being 
removed apart from system_auth keyspace.
  4.  Run nodetool decommission using the --force option on every node in the 
datacenter being removed.
  5.  Change system_auth keyspace so they no longer reference the datacenter 
being removed.
BR
MK



From: Jeff Jirsa mailto:jji...@gmail.com>>
Sent: April 08, 2024 17:19
To: cassandra mailto:user@cassandra.apache.org>>
Cc: Michalis Kotsiouros (EXT) 
mailto:michalis.kotsiouros@ericsson.com>>
Subject: Re: Datacenter decommissioning on Cassandra 4.1.4

To Jon’s point, if you remove from replication after step 1 or step 2 (probably 
step 2 if your goal is to be strictly correct), the nodetool decommission phase 
becomes almost a no-op.

If you use th

Trouble with using group commitlog_sync

2024-04-23 Thread Nathan Marz
I'm doing some benchmarking of Cassandra on a single m6gd.large instance.
It works fine with periodic or batch commitlog_sync options, but I'm having
tons of issues when I change it to "group". I have
"commitlog_sync_group_window" set to 1000ms.

My client is doing writes like this (pseudocode):

Semaphore sem = new Semaphore(numTickets);
while(true) {

sem.acquire();
session.executeAsync(insert.bind(genUUIDStr(), genUUIDStr(), genUUIDStr())
.whenComplete((t, u) -> sem.release())

}

If I set numTickets higher than 20, I get tons of timeout errors.

I've also tried doing single commands with BatchStatement with many inserts
at a time, and that fails with timeout when the batch size gets more than
20.

Increasing the write request timeout in cassandra.yaml makes it time out at
slightly higher numbers of concurrent requests.

With periodic I'm able to get about 38k writes / second, and with batch I'm
able to get about 14k / second.

Any tips on what I should be doing to get group commitlog_sync to work
properly? I didn't expect to have to do anything other than change the
config.


Re: Trouble with using group commitlog_sync

2024-04-23 Thread Bowen Song via user
Why would you want to set commitlog_sync_batch_window to 1 second long 
when commitlog_sync is set to batch mode? The documentation 
 
on this says:


   /This window should be kept short because the writer threads will be
   unable to do extra work while waiting. You may need to increase
   concurrent_writes for the same reason/

If you want to use batch mode, at least ensure 
commitlog_sync_batch_window is reasonably short. The default is 2 
millisecond.



On 23/04/2024 18:32, Nathan Marz wrote:
I'm doing some benchmarking of Cassandra on a single m6gd.large 
instance. It works fine with periodic or batch commitlog_sync options, 
but I'm having tons of issues when I change it to "group". I have 
"commitlog_sync_group_window" set to 1000ms.


My client is doing writes like this (pseudocode):

Semaphore sem = new Semaphore(numTickets);
while(true) {

sem.acquire();
session.executeAsync(insert.bind(genUUIDStr(), genUUIDStr(),
genUUIDStr())
            .whenComplete((t, u) -> sem.release())

}

If I set numTickets higher than 20, I get tons of timeout errors.

I've also tried doing single commands with BatchStatement with many 
inserts at a time, and that fails with timeout when the batch size 
gets more than 20.


Increasing the write request timeout in cassandra.yaml makes it time 
out at slightly higher numbers of concurrent requests.


With periodic I'm able to get about 38k writes / second, and with 
batch I'm able to get about 14k / second.


Any tips on what I should be doing to get group commitlog_sync to work 
properly? I didn't expect to have to do anything other than change the 
config.

Re: Trouble with using group commitlog_sync

2024-04-23 Thread Nathan Marz
"batch" mode works fine. I'm having trouble with "group" mode. The only
config for that is "commitlog_sync_group_window", and I have that set to
the default 1000ms.

On Tue, Apr 23, 2024 at 8:15 AM Bowen Song via user <
user@cassandra.apache.org> wrote:

> Why would you want to set commitlog_sync_batch_window to 1 second long
> when commitlog_sync is set to batch mode? The documentation
> 
> on this says:
>
> *This window should be kept short because the writer threads will be
> unable to do extra work while waiting. You may need to increase
> concurrent_writes for the same reason*
>
> If you want to use batch mode, at least ensure commitlog_sync_batch_window
> is reasonably short. The default is 2 millisecond.
>
>
> On 23/04/2024 18:32, Nathan Marz wrote:
>
> I'm doing some benchmarking of Cassandra on a single m6gd.large instance.
> It works fine with periodic or batch commitlog_sync options, but I'm having
> tons of issues when I change it to "group". I have
> "commitlog_sync_group_window" set to 1000ms.
>
> My client is doing writes like this (pseudocode):
>
> Semaphore sem = new Semaphore(numTickets);
> while(true) {
>
> sem.acquire();
> session.executeAsync(insert.bind(genUUIDStr(), genUUIDStr(), genUUIDStr())
> .whenComplete((t, u) -> sem.release())
>
> }
>
> If I set numTickets higher than 20, I get tons of timeout errors.
>
> I've also tried doing single commands with BatchStatement with many
> inserts at a time, and that fails with timeout when the batch size gets
> more than 20.
>
> Increasing the write request timeout in cassandra.yaml makes it time out
> at slightly higher numbers of concurrent requests.
>
> With periodic I'm able to get about 38k writes / second, and with batch
> I'm able to get about 14k / second.
>
> Any tips on what I should be doing to get group commitlog_sync to work
> properly? I didn't expect to have to do anything other than change the
> config.
>
>


Re: Trouble with using group commitlog_sync

2024-04-23 Thread Bowen Song via user
The default commitlog_sync_group_window is very long for SSDs. Try 
reduce it if you are using SSD-backed storage for the commit log. 10-15 
ms is a good starting point. You may also want to increase the value of 
concurrent_writes, consider at least double or quadruple it from the 
default. You'll need even higher write concurrency for longer 
commitlog_sync_group_window.



On 23/04/2024 19:26, Nathan Marz wrote:
"batch" mode works fine. I'm having trouble with "group" mode. The 
only config for that is "commitlog_sync_group_window", and I have that 
set to the default 1000ms.


On Tue, Apr 23, 2024 at 8:15 AM Bowen Song via user 
 wrote:


Why would you want to set commitlog_sync_batch_window to 1 second
long when commitlog_sync is set to batch mode? The documentation


on this says:

/This window should be kept short because the writer threads
will be unable to do extra work while waiting. You may need to
increase concurrent_writes for the same reason/

If you want to use batch mode, at least ensure
commitlog_sync_batch_window is reasonably short. The default is 2
millisecond.


On 23/04/2024 18:32, Nathan Marz wrote:

I'm doing some benchmarking of Cassandra on a single m6gd.large
instance. It works fine with periodic or batch commitlog_sync
options, but I'm having tons of issues when I change it to
"group". I have "commitlog_sync_group_window" set to 1000ms.

My client is doing writes like this (pseudocode):

Semaphore sem = new Semaphore(numTickets);
while(true) {

sem.acquire();
session.executeAsync(insert.bind(genUUIDStr(), genUUIDStr(),
genUUIDStr())
            .whenComplete((t, u) -> sem.release())

}

If I set numTickets higher than 20, I get tons of timeout errors.

I've also tried doing single commands with BatchStatement with
many inserts at a time, and that fails with timeout when the
batch size gets more than 20.

Increasing the write request timeout in cassandra.yaml makes it
time out at slightly higher numbers of concurrent requests.

With periodic I'm able to get about 38k writes / second, and with
batch I'm able to get about 14k / second.

Any tips on what I should be doing to get group commitlog_sync to
work properly? I didn't expect to have to do anything other than
change the config.


Re: Trouble with using group commitlog_sync

2024-04-23 Thread Nathan Marz
Thanks. I raised concurrent_writes to 128 and
set commitlog_sync_group_window to 20ms. This causes a single execute of a
BatchStatement containing 100 inserts to succeed. However, the throughput
I'm seeing is atrocious.

With these settings, I'm executing 10 BatchStatement concurrently at a time
using the semaphore + loop approach I showed in my first message. So as
requests complete, more are sent out such that there are 10 in-flight at a
time. Each BatchStatement has 100 individual inserts. I'm seeing only 730
inserts / second. Again, with periodic mode I see 38k / second and with
batch I see 14k / second. My expectation was that group commit mode
throughput would be somewhere between those two.

If I set commitlog_sync_group_window to 100ms, the throughput drops to 14 /
second.

If I set commitlog_sync_group_window to 10ms, the throughput increases to
1587 / second.

If I set commitlog_sync_group_window to 5ms, the throughput increases to
3200 / second.

If I set commitlog_sync_group_window to 1ms, the throughput increases to
13k / second, which is slightly less than batch commit mode.

Is group commit mode supposed to have better performance than batch mode?


On Tue, Apr 23, 2024 at 8:46 AM Bowen Song via user <
user@cassandra.apache.org> wrote:

> The default commitlog_sync_group_window is very long for SSDs. Try reduce
> it if you are using SSD-backed storage for the commit log. 10-15 ms is a
> good starting point. You may also want to increase the value of
> concurrent_writes, consider at least double or quadruple it from the
> default. You'll need even higher write concurrency for longer
> commitlog_sync_group_window.
>
> On 23/04/2024 19:26, Nathan Marz wrote:
>
> "batch" mode works fine. I'm having trouble with "group" mode. The only
> config for that is "commitlog_sync_group_window", and I have that set to
> the default 1000ms.
>
> On Tue, Apr 23, 2024 at 8:15 AM Bowen Song via user <
> user@cassandra.apache.org> wrote:
>
>> Why would you want to set commitlog_sync_batch_window to 1 second long
>> when commitlog_sync is set to batch mode? The documentation
>> 
>> on this says:
>>
>> *This window should be kept short because the writer threads will be
>> unable to do extra work while waiting. You may need to increase
>> concurrent_writes for the same reason*
>>
>> If you want to use batch mode, at least ensure
>> commitlog_sync_batch_window is reasonably short. The default is 2
>> millisecond.
>>
>>
>> On 23/04/2024 18:32, Nathan Marz wrote:
>>
>> I'm doing some benchmarking of Cassandra on a single m6gd.large instance.
>> It works fine with periodic or batch commitlog_sync options, but I'm having
>> tons of issues when I change it to "group". I have
>> "commitlog_sync_group_window" set to 1000ms.
>>
>> My client is doing writes like this (pseudocode):
>>
>> Semaphore sem = new Semaphore(numTickets);
>> while(true) {
>>
>> sem.acquire();
>> session.executeAsync(insert.bind(genUUIDStr(), genUUIDStr(), genUUIDStr())
>> .whenComplete((t, u) -> sem.release())
>>
>> }
>>
>> If I set numTickets higher than 20, I get tons of timeout errors.
>>
>> I've also tried doing single commands with BatchStatement with many
>> inserts at a time, and that fails with timeout when the batch size gets
>> more than 20.
>>
>> Increasing the write request timeout in cassandra.yaml makes it time out
>> at slightly higher numbers of concurrent requests.
>>
>> With periodic I'm able to get about 38k writes / second, and with batch
>> I'm able to get about 14k / second.
>>
>> Any tips on what I should be doing to get group commitlog_sync to work
>> properly? I didn't expect to have to do anything other than change the
>> config.
>>
>>


Re: Trouble with using group commitlog_sync

2024-04-23 Thread Bowen Song via user
I suspect you are abusing batch statements. Batch statements should only 
be used where atomicity or isolation is needed. Using batch statements 
won't make inserting multiple partitions faster. In fact, it often will 
make that slower.


Also, the liner relationship between commitlog_sync_group_window and 
write throughput is expected. That's because the max number of 
uncompleted writes is limited by the write concurrency, and a write is 
not considered "complete" before it is synced to disk when commitlog 
sync is in group or batch mode. That means within each interval, only 
limited number of writes can be done. The ways to increase that 
including: add more nodes, sync commitlog at shorter intervals and allow 
more concurrent writes.



On 23/04/2024 20:43, Nathan Marz wrote:
Thanks. I raised concurrent_writes to 128 and 
set commitlog_sync_group_window to 20ms. This causes a single execute 
of a BatchStatement containing 100 inserts to succeed. However, the 
throughput I'm seeing is atrocious.


With these settings, I'm executing 10 BatchStatement concurrently at a 
time using the semaphore + loop approach I showed in my first message. 
So as requests complete, more are sent out such that there are 10 
in-flight at a time. Each BatchStatement has 100 individual inserts. 
I'm seeing only 730 inserts / second. Again, with periodic mode I see 
38k / second and with batch I see 14k / second. My expectation was 
that group commit mode throughput would be somewhere between those two.


If I set commitlog_sync_group_window to 100ms, the throughput drops to 
14 / second.


If I set commitlog_sync_group_window to 10ms, the throughput increases 
to 1587 / second.


If I set commitlog_sync_group_window to 5ms, the throughput increases 
to 3200 / second.


If I set commitlog_sync_group_window to 1ms, the throughput increases 
to 13k / second, which is slightly less than batch commit mode.


Is group commit mode supposed to have better performance than batch mode?


On Tue, Apr 23, 2024 at 8:46 AM Bowen Song via user 
 wrote:


The default commitlog_sync_group_window is very long for SSDs. Try
reduce it if you are using SSD-backed storage for the commit log.
10-15 ms is a good starting point. You may also want to increase
the value of concurrent_writes, consider at least double or
quadruple it from the default. You'll need even higher write
concurrency for longer commitlog_sync_group_window.


On 23/04/2024 19:26, Nathan Marz wrote:

"batch" mode works fine. I'm having trouble with "group" mode.
The only config for that is "commitlog_sync_group_window", and I
have that set to the default 1000ms.

On Tue, Apr 23, 2024 at 8:15 AM Bowen Song via user
 wrote:

Why would you want to set commitlog_sync_batch_window to 1
second long when commitlog_sync is set to batch mode? The
documentation


on this says:

/This window should be kept short because the writer
threads will be unable to do extra work while waiting.
You may need to increase concurrent_writes for the same
reason/

If you want to use batch mode, at least ensure
commitlog_sync_batch_window is reasonably short. The default
is 2 millisecond.


On 23/04/2024 18:32, Nathan Marz wrote:

I'm doing some benchmarking of Cassandra on a single
m6gd.large instance. It works fine with periodic or batch
commitlog_sync options, but I'm having tons of issues when I
change it to "group". I have "commitlog_sync_group_window"
set to 1000ms.

My client is doing writes like this (pseudocode):

Semaphore sem = new Semaphore(numTickets);
while(true) {

sem.acquire();
session.executeAsync(insert.bind(genUUIDStr(),
genUUIDStr(), genUUIDStr())
            .whenComplete((t, u) -> sem.release())

}

If I set numTickets higher than 20, I get tons of timeout
errors.

I've also tried doing single commands with BatchStatement
with many inserts at a time, and that fails with timeout
when the batch size gets more than 20.

Increasing the write request timeout in cassandra.yaml makes
it time out at slightly higher numbers of concurrent requests.

With periodic I'm able to get about 38k writes / second, and
with batch I'm able to get about 14k / second.

Any tips on what I should be doing to get group
commitlog_sync to work properly? I didn't expect to have to
do anything other than change the config.


Re: Trouble with using group commitlog_sync

2024-04-23 Thread Nathan Marz
Thanks for the explanation.

I tried again with commitlog_sync_group_window at 2ms, concurrent_writes at
512, and doing 1000 individual inserts at a time with the same loop +
semaphore approach. This only nets 9k / second.

I got much higher throughput for the other modes with BatchStatement of 100
inserts rather than 100x more individual inserts.

On Tue, Apr 23, 2024 at 10:45 AM Bowen Song via user <
user@cassandra.apache.org> wrote:

> I suspect you are abusing batch statements. Batch statements should only
> be used where atomicity or isolation is needed. Using batch statements
> won't make inserting multiple partitions faster. In fact, it often will
> make that slower.
>
> Also, the liner relationship between commitlog_sync_group_window and
> write throughput is expected. That's because the max number of uncompleted
> writes is limited by the write concurrency, and a write is not considered
> "complete" before it is synced to disk when commitlog sync is in group or
> batch mode. That means within each interval, only limited number of writes
> can be done. The ways to increase that including: add more nodes, sync
> commitlog at shorter intervals and allow more concurrent writes.
>
>
> On 23/04/2024 20:43, Nathan Marz wrote:
>
> Thanks. I raised concurrent_writes to 128 and
> set commitlog_sync_group_window to 20ms. This causes a single execute of a
> BatchStatement containing 100 inserts to succeed. However, the throughput
> I'm seeing is atrocious.
>
> With these settings, I'm executing 10 BatchStatement concurrently at a
> time using the semaphore + loop approach I showed in my first message. So
> as requests complete, more are sent out such that there are 10 in-flight at
> a time. Each BatchStatement has 100 individual inserts. I'm seeing only 730
> inserts / second. Again, with periodic mode I see 38k / second and with
> batch I see 14k / second. My expectation was that group commit mode
> throughput would be somewhere between those two.
>
> If I set commitlog_sync_group_window to 100ms, the throughput drops to 14
> / second.
>
> If I set commitlog_sync_group_window to 10ms, the throughput increases to
> 1587 / second.
>
> If I set commitlog_sync_group_window to 5ms, the throughput increases to
> 3200 / second.
>
> If I set commitlog_sync_group_window to 1ms, the throughput increases to
> 13k / second, which is slightly less than batch commit mode.
>
> Is group commit mode supposed to have better performance than batch mode?
>
>
> On Tue, Apr 23, 2024 at 8:46 AM Bowen Song via user <
> user@cassandra.apache.org> wrote:
>
>> The default commitlog_sync_group_window is very long for SSDs. Try
>> reduce it if you are using SSD-backed storage for the commit log. 10-15 ms
>> is a good starting point. You may also want to increase the value of
>> concurrent_writes, consider at least double or quadruple it from the
>> default. You'll need even higher write concurrency for longer
>> commitlog_sync_group_window.
>>
>> On 23/04/2024 19:26, Nathan Marz wrote:
>>
>> "batch" mode works fine. I'm having trouble with "group" mode. The only
>> config for that is "commitlog_sync_group_window", and I have that set to
>> the default 1000ms.
>>
>> On Tue, Apr 23, 2024 at 8:15 AM Bowen Song via user <
>> user@cassandra.apache.org> wrote:
>>
>>> Why would you want to set commitlog_sync_batch_window to 1 second long
>>> when commitlog_sync is set to batch mode? The documentation
>>> 
>>> on this says:
>>>
>>> *This window should be kept short because the writer threads will be
>>> unable to do extra work while waiting. You may need to increase
>>> concurrent_writes for the same reason*
>>>
>>> If you want to use batch mode, at least ensure
>>> commitlog_sync_batch_window is reasonably short. The default is 2
>>> millisecond.
>>>
>>>
>>> On 23/04/2024 18:32, Nathan Marz wrote:
>>>
>>> I'm doing some benchmarking of Cassandra on a single m6gd.large
>>> instance. It works fine with periodic or batch commitlog_sync options, but
>>> I'm having tons of issues when I change it to "group". I have
>>> "commitlog_sync_group_window" set to 1000ms.
>>>
>>> My client is doing writes like this (pseudocode):
>>>
>>> Semaphore sem = new Semaphore(numTickets);
>>> while(true) {
>>>
>>> sem.acquire();
>>> session.executeAsync(insert.bind(genUUIDStr(), genUUIDStr(),
>>> genUUIDStr())
>>> .whenComplete((t, u) -> sem.release())
>>>
>>> }
>>>
>>> If I set numTickets higher than 20, I get tons of timeout errors.
>>>
>>> I've also tried doing single commands with BatchStatement with many
>>> inserts at a time, and that fails with timeout when the batch size gets
>>> more than 20.
>>>
>>> Increasing the write request timeout in cassandra.yaml makes it time out
>>> at slightly higher numbers of concurrent requests.
>>>
>>> With periodic I'm able to get about 38k writes / second, and with batch
>>> I'm able to get about 14k / second.
>>>
>>> Any tips on 

Re: Trouble with using group commitlog_sync

2024-04-23 Thread Bowen Song via user
Have you checked the thread CPU utilisation of the client side? You 
likely will need more than one thread to do insertion in a loop to 
achieve tens of thousands of inserts per second.



On 23/04/2024 21:55, Nathan Marz wrote:

Thanks for the explanation.

I tried again with commitlog_sync_group_window at 2ms, 
concurrent_writes at 512, and doing 1000 individual inserts at a time 
with the same loop + semaphore approach. This only nets 9k / second.


I got much higher throughput for the other modes with BatchStatement 
of 100 inserts rather than 100x more individual inserts.


On Tue, Apr 23, 2024 at 10:45 AM Bowen Song via user 
 wrote:


I suspect you are abusing batch statements. Batch statements
should only be used where atomicity or isolation is needed. Using
batch statements won't make inserting multiple partitions faster.
In fact, it often will make that slower.

Also, the liner relationship between commitlog_sync_group_window
and write throughput is expected. That's because the max number of
uncompleted writes is limited by the write concurrency, and a
write is not considered "complete" before it is synced to disk
when commitlog sync is in group or batch mode. That means within
each interval, only limited number of writes can be done. The ways
to increase that including: add more nodes, sync commitlog at
shorter intervals and allow more concurrent writes.


On 23/04/2024 20:43, Nathan Marz wrote:

Thanks. I raised concurrent_writes to 128 and
set commitlog_sync_group_window to 20ms. This causes a single
execute of a BatchStatement containing 100 inserts to succeed.
However, the throughput I'm seeing is atrocious.

With these settings, I'm executing 10 BatchStatement concurrently
at a time using the semaphore + loop approach I showed in my
first message. So as requests complete, more are sent out such
that there are 10 in-flight at a time. Each BatchStatement has
100 individual inserts. I'm seeing only 730 inserts / second.
Again, with periodic mode I see 38k / second and with batch I see
14k / second. My expectation was that group commit mode
throughput would be somewhere between those two.

If I set commitlog_sync_group_window to 100ms, the throughput
drops to 14 / second.

If I set commitlog_sync_group_window to 10ms, the throughput
increases to 1587 / second.

If I set commitlog_sync_group_window to 5ms, the throughput
increases to 3200 / second.

If I set commitlog_sync_group_window to 1ms, the throughput
increases to 13k / second, which is slightly less than batch
commit mode.

Is group commit mode supposed to have better performance than
batch mode?


On Tue, Apr 23, 2024 at 8:46 AM Bowen Song via user
 wrote:

The default commitlog_sync_group_window is very long for
SSDs. Try reduce it if you are using SSD-backed storage for
the commit log. 10-15 ms is a good starting point. You may
also want to increase the value of concurrent_writes,
consider at least double or quadruple it from the default.
You'll need even higher write concurrency for longer
commitlog_sync_group_window.


On 23/04/2024 19:26, Nathan Marz wrote:

"batch" mode works fine. I'm having trouble with "group"
mode. The only config for that is
"commitlog_sync_group_window", and I have that set to the
default 1000ms.

On Tue, Apr 23, 2024 at 8:15 AM Bowen Song via user
 wrote:

Why would you want to set commitlog_sync_batch_window to
1 second long when commitlog_sync is set to batch mode?
The documentation


on this says:

/This window should be kept short because the writer
threads will be unable to do extra work while
waiting. You may need to increase concurrent_writes
for the same reason/

If you want to use batch mode, at least ensure
commitlog_sync_batch_window is reasonably short. The
default is 2 millisecond.


On 23/04/2024 18:32, Nathan Marz wrote:

I'm doing some benchmarking of Cassandra on a single
m6gd.large instance. It works fine with periodic or
batch commitlog_sync options, but I'm having tons of
issues when I change it to "group". I have
"commitlog_sync_group_window" set to 1000ms.

My client is doing writes like this (pseudocode):

Semaphore sem = new Semaphore(numTickets);
while(true) {

sem.acquire();
session.executeAsync(insert.bind(genUUIDStr(),
genUUIDStr(), genUUIDStr())
            .whenComplete((t, u) -> sem.release())

}


Re: Trouble with using group commitlog_sync

2024-04-23 Thread Nathan Marz
It's using the async API, so why would it need multiple threads? Using the
exact same approach I'm able to get 38k / second with periodic
commitlog_sync. For what it's worth, I do see 100% CPU utilization in every
single one of these tests.

On Tue, Apr 23, 2024 at 11:01 AM Bowen Song via user <
user@cassandra.apache.org> wrote:

> Have you checked the thread CPU utilisation of the client side? You likely
> will need more than one thread to do insertion in a loop to achieve tens of
> thousands of inserts per second.
>
>
> On 23/04/2024 21:55, Nathan Marz wrote:
>
> Thanks for the explanation.
>
> I tried again with commitlog_sync_group_window at 2ms, concurrent_writes
> at 512, and doing 1000 individual inserts at a time with the same loop +
> semaphore approach. This only nets 9k / second.
>
> I got much higher throughput for the other modes with BatchStatement of
> 100 inserts rather than 100x more individual inserts.
>
> On Tue, Apr 23, 2024 at 10:45 AM Bowen Song via user <
> user@cassandra.apache.org> wrote:
>
>> I suspect you are abusing batch statements. Batch statements should only
>> be used where atomicity or isolation is needed. Using batch statements
>> won't make inserting multiple partitions faster. In fact, it often will
>> make that slower.
>>
>> Also, the liner relationship between commitlog_sync_group_window and
>> write throughput is expected. That's because the max number of uncompleted
>> writes is limited by the write concurrency, and a write is not considered
>> "complete" before it is synced to disk when commitlog sync is in group or
>> batch mode. That means within each interval, only limited number of writes
>> can be done. The ways to increase that including: add more nodes, sync
>> commitlog at shorter intervals and allow more concurrent writes.
>>
>>
>> On 23/04/2024 20:43, Nathan Marz wrote:
>>
>> Thanks. I raised concurrent_writes to 128 and
>> set commitlog_sync_group_window to 20ms. This causes a single execute of a
>> BatchStatement containing 100 inserts to succeed. However, the throughput
>> I'm seeing is atrocious.
>>
>> With these settings, I'm executing 10 BatchStatement concurrently at a
>> time using the semaphore + loop approach I showed in my first message. So
>> as requests complete, more are sent out such that there are 10 in-flight at
>> a time. Each BatchStatement has 100 individual inserts. I'm seeing only 730
>> inserts / second. Again, with periodic mode I see 38k / second and with
>> batch I see 14k / second. My expectation was that group commit mode
>> throughput would be somewhere between those two.
>>
>> If I set commitlog_sync_group_window to 100ms, the throughput drops to 14
>> / second.
>>
>> If I set commitlog_sync_group_window to 10ms, the throughput increases to
>> 1587 / second.
>>
>> If I set commitlog_sync_group_window to 5ms, the throughput increases to
>> 3200 / second.
>>
>> If I set commitlog_sync_group_window to 1ms, the throughput increases to
>> 13k / second, which is slightly less than batch commit mode.
>>
>> Is group commit mode supposed to have better performance than batch mode?
>>
>>
>> On Tue, Apr 23, 2024 at 8:46 AM Bowen Song via user <
>> user@cassandra.apache.org> wrote:
>>
>>> The default commitlog_sync_group_window is very long for SSDs. Try
>>> reduce it if you are using SSD-backed storage for the commit log. 10-15 ms
>>> is a good starting point. You may also want to increase the value of
>>> concurrent_writes, consider at least double or quadruple it from the
>>> default. You'll need even higher write concurrency for longer
>>> commitlog_sync_group_window.
>>>
>>> On 23/04/2024 19:26, Nathan Marz wrote:
>>>
>>> "batch" mode works fine. I'm having trouble with "group" mode. The only
>>> config for that is "commitlog_sync_group_window", and I have that set to
>>> the default 1000ms.
>>>
>>> On Tue, Apr 23, 2024 at 8:15 AM Bowen Song via user <
>>> user@cassandra.apache.org> wrote:
>>>
 Why would you want to set commitlog_sync_batch_window to 1 second long
 when commitlog_sync is set to batch mode? The documentation
 
 on this says:

 *This window should be kept short because the writer threads will be
 unable to do extra work while waiting. You may need to increase
 concurrent_writes for the same reason*

 If you want to use batch mode, at least ensure
 commitlog_sync_batch_window is reasonably short. The default is 2
 millisecond.


 On 23/04/2024 18:32, Nathan Marz wrote:

 I'm doing some benchmarking of Cassandra on a single m6gd.large
 instance. It works fine with periodic or batch commitlog_sync options, but
 I'm having tons of issues when I change it to "group". I have
 "commitlog_sync_group_window" set to 1000ms.

 My client is doing writes like this (pseudocode):

 Semaphore sem = new Semaphore(numTickets);
 while(true) {

 sem

Re: Trouble with using group commitlog_sync

2024-04-23 Thread Bowen Song via user
To achieve 10k loop iterations per second, each iteration must take 0.1 
milliseconds or less. Considering that each iteration needs to lock and 
unlock the semaphore (two syscalls) and make network requests (more 
syscalls), that's a lots of context switches. It may a bit too much to 
ask for a single thread. I would suggest try multi-threading or 
multi-processing, and see if the combined insert rate is higher.


I should also note that executeAsync() also has implicit limits on the 
number of in-flight requests, which default to 1024 requests per 
connection and 1 connection per server. See 
https://docs.datastax.com/en/developer/java-driver/4.17/manual/core/pooling/



On 23/04/2024 23:18, Nathan Marz wrote:
It's using the async API, so why would it need multiple threads? Using 
the exact same approach I'm able to get 38k / second with periodic 
commitlog_sync. For what it's worth, I do see 100% CPU utilization in 
every single one of these tests.


On Tue, Apr 23, 2024 at 11:01 AM Bowen Song via user 
 wrote:


Have you checked the thread CPU utilisation of the client side?
You likely will need more than one thread to do insertion in a
loop to achieve tens of thousands of inserts per second.


On 23/04/2024 21:55, Nathan Marz wrote:

Thanks for the explanation.

I tried again with commitlog_sync_group_window at 2ms,
concurrent_writes at 512, and doing 1000 individual inserts at a
time with the same loop + semaphore approach. This only nets 9k /
second.

I got much higher throughput for the other modes with
BatchStatement of 100 inserts rather than 100x more individual
inserts.

On Tue, Apr 23, 2024 at 10:45 AM Bowen Song via user
 wrote:

I suspect you are abusing batch statements. Batch statements
should only be used where atomicity or isolation is needed.
Using batch statements won't make inserting multiple
partitions faster. In fact, it often will make that slower.

Also, the liner relationship between
commitlog_sync_group_window and write throughput is expected.
That's because the max number of uncompleted writes is
limited by the write concurrency, and a write is not
considered "complete" before it is synced to disk when
commitlog sync is in group or batch mode. That means within
each interval, only limited number of writes can be done. The
ways to increase that including: add more nodes, sync
commitlog at shorter intervals and allow more concurrent writes.


On 23/04/2024 20:43, Nathan Marz wrote:

Thanks. I raised concurrent_writes to 128 and
set commitlog_sync_group_window to 20ms. This causes a
single execute of a BatchStatement containing 100 inserts to
succeed. However, the throughput I'm seeing is atrocious.

With these settings, I'm executing 10 BatchStatement
concurrently at a time using the semaphore + loop approach I
showed in my first message. So as requests complete, more
are sent out such that there are 10 in-flight at a time.
Each BatchStatement has 100 individual inserts. I'm seeing
only 730 inserts / second. Again, with periodic mode I see
38k / second and with batch I see 14k / second. My
expectation was that group commit mode throughput would be
somewhere between those two.

If I set commitlog_sync_group_window to 100ms, the
throughput drops to 14 / second.

If I set commitlog_sync_group_window to 10ms, the throughput
increases to 1587 / second.

If I set commitlog_sync_group_window to 5ms, the throughput
increases to 3200 / second.

If I set commitlog_sync_group_window to 1ms, the throughput
increases to 13k / second, which is slightly less than batch
commit mode.

Is group commit mode supposed to have better performance
than batch mode?


On Tue, Apr 23, 2024 at 8:46 AM Bowen Song via user
 wrote:

The default commitlog_sync_group_window is very long for
SSDs. Try reduce it if you are using SSD-backed storage
for the commit log. 10-15 ms is a good starting point.
You may also want to increase the value of
concurrent_writes, consider at least double or quadruple
it from the default. You'll need even higher write
concurrency for longer commitlog_sync_group_window.


On 23/04/2024 19:26, Nathan Marz wrote:

"batch" mode works fine. I'm having trouble with
"group" mode. The only config for that is
"commitlog_sync_group_window", and I have that set to
the default 1000ms.

On Tue, Apr 23, 2024 at 8:15 AM Bowen Song via user
 wrote:

Why would you want to set
commitlog_sync_batch_window to 1 second long when
 

Re: Trouble with using group commitlog_sync

2024-04-23 Thread Nathan Marz
Tried it again with one more client thread, and that had no effect on
performance. This is unsurprising as there's only 2 CPU on this node and
they were already at 100%. These were good ideas, but I'm still unable to
even match the performance of batch commit mode with group commit mode.

On Tue, Apr 23, 2024 at 12:46 PM Bowen Song via user <
user@cassandra.apache.org> wrote:

> To achieve 10k loop iterations per second, each iteration must take 0.1
> milliseconds or less. Considering that each iteration needs to lock and
> unlock the semaphore (two syscalls) and make network requests (more
> syscalls), that's a lots of context switches. It may a bit too much to ask
> for a single thread. I would suggest try multi-threading or
> multi-processing, and see if the combined insert rate is higher.
>
> I should also note that executeAsync() also has implicit limits on the
> number of in-flight requests, which default to 1024 requests per connection
> and 1 connection per server. See
> https://docs.datastax.com/en/developer/java-driver/4.17/manual/core/pooling/
>
>
> On 23/04/2024 23:18, Nathan Marz wrote:
>
> It's using the async API, so why would it need multiple threads? Using the
> exact same approach I'm able to get 38k / second with periodic
> commitlog_sync. For what it's worth, I do see 100% CPU utilization in every
> single one of these tests.
>
> On Tue, Apr 23, 2024 at 11:01 AM Bowen Song via user <
> user@cassandra.apache.org> wrote:
>
>> Have you checked the thread CPU utilisation of the client side? You
>> likely will need more than one thread to do insertion in a loop to achieve
>> tens of thousands of inserts per second.
>>
>>
>> On 23/04/2024 21:55, Nathan Marz wrote:
>>
>> Thanks for the explanation.
>>
>> I tried again with commitlog_sync_group_window at 2ms, concurrent_writes
>> at 512, and doing 1000 individual inserts at a time with the same loop +
>> semaphore approach. This only nets 9k / second.
>>
>> I got much higher throughput for the other modes with BatchStatement of
>> 100 inserts rather than 100x more individual inserts.
>>
>> On Tue, Apr 23, 2024 at 10:45 AM Bowen Song via user <
>> user@cassandra.apache.org> wrote:
>>
>>> I suspect you are abusing batch statements. Batch statements should only
>>> be used where atomicity or isolation is needed. Using batch statements
>>> won't make inserting multiple partitions faster. In fact, it often will
>>> make that slower.
>>>
>>> Also, the liner relationship between commitlog_sync_group_window and
>>> write throughput is expected. That's because the max number of uncompleted
>>> writes is limited by the write concurrency, and a write is not considered
>>> "complete" before it is synced to disk when commitlog sync is in group or
>>> batch mode. That means within each interval, only limited number of writes
>>> can be done. The ways to increase that including: add more nodes, sync
>>> commitlog at shorter intervals and allow more concurrent writes.
>>>
>>>
>>> On 23/04/2024 20:43, Nathan Marz wrote:
>>>
>>> Thanks. I raised concurrent_writes to 128 and
>>> set commitlog_sync_group_window to 20ms. This causes a single execute of a
>>> BatchStatement containing 100 inserts to succeed. However, the throughput
>>> I'm seeing is atrocious.
>>>
>>> With these settings, I'm executing 10 BatchStatement concurrently at a
>>> time using the semaphore + loop approach I showed in my first message. So
>>> as requests complete, more are sent out such that there are 10 in-flight at
>>> a time. Each BatchStatement has 100 individual inserts. I'm seeing only 730
>>> inserts / second. Again, with periodic mode I see 38k / second and with
>>> batch I see 14k / second. My expectation was that group commit mode
>>> throughput would be somewhere between those two.
>>>
>>> If I set commitlog_sync_group_window to 100ms, the throughput drops to
>>> 14 / second.
>>>
>>> If I set commitlog_sync_group_window to 10ms, the throughput increases
>>> to 1587 / second.
>>>
>>> If I set commitlog_sync_group_window to 5ms, the throughput increases to
>>> 3200 / second.
>>>
>>> If I set commitlog_sync_group_window to 1ms, the throughput increases to
>>> 13k / second, which is slightly less than batch commit mode.
>>>
>>> Is group commit mode supposed to have better performance than batch mode?
>>>
>>>
>>> On Tue, Apr 23, 2024 at 8:46 AM Bowen Song via user <
>>> user@cassandra.apache.org> wrote:
>>>
 The default commitlog_sync_group_window is very long for SSDs. Try
 reduce it if you are using SSD-backed storage for the commit log. 10-15 ms
 is a good starting point. You may also want to increase the value of
 concurrent_writes, consider at least double or quadruple it from the
 default. You'll need even higher write concurrency for longer
 commitlog_sync_group_window.

 On 23/04/2024 19:26, Nathan Marz wrote:

 "batch" mode works fine. I'm having trouble with "group" mode. The only
 config for that is "commitlog_sync_group_window",