Re: CASSANDRA-12888: Streaming and MVs

2016-12-07 Thread Benjamin Roth
Hi Paolo,

First of all thanks for your review!

I had the same concerns as you but I thought it is beeing handled correctly
(which does in some situations) but I found one that creates the
inconsistencies you mentioned. That is kind of split brain syndrom, when
multiple nodes fail between repairs. See here: https://cl.ly/3t0X1c0q1L1h.

I am not happy about it but I support your decision. We should then add
another dtest to test this scenario as existing dtests don't.

Some issues unfortunately remain:
- 12888 is not resolved
- MV repairs may be still f slow. Imagine an inconsistency of a single
cell (also may be due to a validation race condition, see CASSANDRA-12991)
on a big partition. I had issues with reaper and a 30min timeout leading to
1000+ (yes!) consecutive repairs of a single subrange because it always
timed out and I recognized very late. When I deployed 12888 on my system,
this remaining subrange was repaired in a snap
- I guess rebuild works the same as repair and has to go through the write
path, right?

=> The MV repair may induce so much overhead that it is maybe cheaper to
kill and replace a inconsistent node than to repair it. But that may
introduce inconsistencies again. All in all it is not perfect. All this
does not really un-frustrate me a 100%.

Do you have any more thoughts?

Unfortunately I have very little time these days as my second child was
born on monday. So thanks for your support so far. Maybe I have some ideas
on this issues during the next days and I will work on that ticket probably
next week to come to a solution that is at least deployable. I'd also
appreciate your opinion on CASSANDRA-12991.

2016-12-07 2:53 GMT+01:00 Paulo Motta :

> Hello Benjamin,
>
> Thanks for your effort on this investigation! For bootstraps and range
> transfers, I think we can indeed simplify and stream base tables and MVs as
> ordinary tables, unless there is some caveat I'm missing (I didn't find any
> special case for bootstrap/range transfers on CASSANDRA-6477 or in the MV
> design doc, please correct me if I'm wrong).
>
> Regarding repair of base tables, applying mutations via the write path is a
> matter of correctness, given that the base table updates needs to
> potentially remove previously referenced keys in the views, so repairing
> only the base table may leave unreferenced keys in the views, breaking the
> MV contract. Furthermore, these unreferenced keys may be propagated to
> other replicas and never removed if you repair only the view. If you don't
> do overwrites in the base table, this is probably not a problem but the DB
> cannot ensure this (at least not before CASSANDRA-9779). Furthermore, as
> you already noticed repairing only the base table is probably faster so I
> don't see a reason to repair the base and MVs separately since this is
> potentially more costly. I believe your frustration is mostly due to the
> bug described on CASSANDRA-12905, but after that and CASSANDRA-12888 are
> fixed repair on base table should work just fine.
>
> Based on this, I propose:
> - Fix CASSANDRA-12905 with your original patch that retries acquiring the
> MV lock instead of throwing WriteTimeoutException during streaming, since
> this is blocking 3.10.
> - Fix CASSANDRA-12888 by doing sstable-based streaming for base tables
> while still applying MV updates in the paired replicas.
> - Create new ticket to use ordinary streaming for non-repair MV stream
> sessions and keep current behavior for MV streaming originating from
> repair.
> - Create new ticket to include only the base tables and not MVs in
> keyspace-level repair, since repairing the base already repairs the views
> to avoid people shooting themselves in the foot.
>
> Please let me know what do you think. Any suggestions or feedback is
> appreciated.
>
> Cheers,
>
> Paulo
>
> 2016-12-02 8:27 GMT-02:00 Benjamin Roth :
>
> > As I haven't received a single reply on that, I went over to implement
> and
> > test it on my own with our production cluster. I had a real pain with
> > bringing up a new node, so I had to move on.
> >
> > Result:
> > Works like a charm. I ran many dtests that relate in any way with
> storage,
> > stream, bootstrap, ... with good results.
> > The bootstrap finished in under 5:30h, not a single error log during
> > bootstrap. Also afterwards, repairs run smooth, cluster seems to operate
> > quite well.
> >
> > I still need:
> >
> >- Reviews (see 12888, 12905, 12984)
> >- Some opinion if I did the CDC case right. IMHO CDC is not required
> on
> >bootstrap and we don't need to send the mutations through the write
> path
> >just to write the commit log. This will also break incremental
> repairs.
> >Instead for CDC the sstables are streamed like normal but mutations
> are
> >written to commitlog additionally. The worst I see is that the node
> > crashes
> >and the commitlogs for that repair streams are replayed leading to
> >duplicate writes, which is not really crucial a

Re: CASSANDRA-12888: Streaming and MVs

2016-12-07 Thread Benjamin Roth
Grmpf! 1000+ consecutive must be wrong. I guess I mixed sth up. But it
repaired over and over again for 1 or 2 days.

2016-12-07 9:01 GMT+01:00 Benjamin Roth :

> Hi Paolo,
>
> First of all thanks for your review!
>
> I had the same concerns as you but I thought it is beeing handled
> correctly (which does in some situations) but I found one that creates the
> inconsistencies you mentioned. That is kind of split brain syndrom, when
> multiple nodes fail between repairs. See here: https://cl.ly/3t0X1c0q1L1h.
>
> I am not happy about it but I support your decision. We should then add
> another dtest to test this scenario as existing dtests don't.
>
> Some issues unfortunately remain:
> - 12888 is not resolved
> - MV repairs may be still f slow. Imagine an inconsistency of a single
> cell (also may be due to a validation race condition, see CASSANDRA-12991)
> on a big partition. I had issues with reaper and a 30min timeout leading to
> 1000+ (yes!) consecutive repairs of a single subrange because it always
> timed out and I recognized very late. When I deployed 12888 on my system,
> this remaining subrange was repaired in a snap
> - I guess rebuild works the same as repair and has to go through the write
> path, right?
>
> => The MV repair may induce so much overhead that it is maybe cheaper to
> kill and replace a inconsistent node than to repair it. But that may
> introduce inconsistencies again. All in all it is not perfect. All this
> does not really un-frustrate me a 100%.
>
> Do you have any more thoughts?
>
> Unfortunately I have very little time these days as my second child was
> born on monday. So thanks for your support so far. Maybe I have some ideas
> on this issues during the next days and I will work on that ticket probably
> next week to come to a solution that is at least deployable. I'd also
> appreciate your opinion on CASSANDRA-12991.
>
> 2016-12-07 2:53 GMT+01:00 Paulo Motta :
>
>> Hello Benjamin,
>>
>> Thanks for your effort on this investigation! For bootstraps and range
>> transfers, I think we can indeed simplify and stream base tables and MVs
>> as
>> ordinary tables, unless there is some caveat I'm missing (I didn't find
>> any
>> special case for bootstrap/range transfers on CASSANDRA-6477 or in the MV
>> design doc, please correct me if I'm wrong).
>>
>> Regarding repair of base tables, applying mutations via the write path is
>> a
>> matter of correctness, given that the base table updates needs to
>> potentially remove previously referenced keys in the views, so repairing
>> only the base table may leave unreferenced keys in the views, breaking the
>> MV contract. Furthermore, these unreferenced keys may be propagated to
>> other replicas and never removed if you repair only the view. If you don't
>> do overwrites in the base table, this is probably not a problem but the DB
>> cannot ensure this (at least not before CASSANDRA-9779). Furthermore, as
>> you already noticed repairing only the base table is probably faster so I
>> don't see a reason to repair the base and MVs separately since this is
>> potentially more costly. I believe your frustration is mostly due to the
>> bug described on CASSANDRA-12905, but after that and CASSANDRA-12888 are
>> fixed repair on base table should work just fine.
>>
>> Based on this, I propose:
>> - Fix CASSANDRA-12905 with your original patch that retries acquiring the
>> MV lock instead of throwing WriteTimeoutException during streaming, since
>> this is blocking 3.10.
>> - Fix CASSANDRA-12888 by doing sstable-based streaming for base tables
>> while still applying MV updates in the paired replicas.
>> - Create new ticket to use ordinary streaming for non-repair MV stream
>> sessions and keep current behavior for MV streaming originating from
>> repair.
>> - Create new ticket to include only the base tables and not MVs in
>> keyspace-level repair, since repairing the base already repairs the views
>> to avoid people shooting themselves in the foot.
>>
>> Please let me know what do you think. Any suggestions or feedback is
>> appreciated.
>>
>> Cheers,
>>
>> Paulo
>>
>> 2016-12-02 8:27 GMT-02:00 Benjamin Roth :
>>
>> > As I haven't received a single reply on that, I went over to implement
>> and
>> > test it on my own with our production cluster. I had a real pain with
>> > bringing up a new node, so I had to move on.
>> >
>> > Result:
>> > Works like a charm. I ran many dtests that relate in any way with
>> storage,
>> > stream, bootstrap, ... with good results.
>> > The bootstrap finished in under 5:30h, not a single error log during
>> > bootstrap. Also afterwards, repairs run smooth, cluster seems to operate
>> > quite well.
>> >
>> > I still need:
>> >
>> >- Reviews (see 12888, 12905, 12984)
>> >- Some opinion if I did the CDC case right. IMHO CDC is not required
>> on
>> >bootstrap and we don't need to send the mutations through the write
>> path
>> >just to write the commit log. This will also break incremental
>> re

streaming_connections_per_host - speeding up CPU bound bootstrap

2016-12-07 Thread Corentin Chary
Currently the StreamPlan created for bootstrap (and rebuild) will only
create one connection per host. If you have less nodes than cores,
this is likely to be CPU bound (a CPU seems to be able to process
~5MB/s).

Is there any reason why something naive like
https://github.com/iksaif/cassandra/commit/8352c21284811ca15d63183ceae0b11586623f31
would not work ?

I believe this is what is about
https://issues.apache.org/jira/browse/CASSANDRA-4663
See also: https://issues.apache.org/jira/browse/CASSANDRA-12229, but I
don't believe non-blocking I/O would change anything here.

-- 
Corentin Chary
http://xf.iksaif.net


[GitHub] cassandra pull request #87: 12979 fix check disk space

2016-12-07 Thread rustyrazorblade
Github user rustyrazorblade closed the pull request at:

https://github.com/apache/cassandra/pull/87


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---