Re: CASSANDRA-12888: Streaming and MVs
Hi Paolo, First of all thanks for your review! I had the same concerns as you but I thought it is beeing handled correctly (which does in some situations) but I found one that creates the inconsistencies you mentioned. That is kind of split brain syndrom, when multiple nodes fail between repairs. See here: https://cl.ly/3t0X1c0q1L1h. I am not happy about it but I support your decision. We should then add another dtest to test this scenario as existing dtests don't. Some issues unfortunately remain: - 12888 is not resolved - MV repairs may be still f slow. Imagine an inconsistency of a single cell (also may be due to a validation race condition, see CASSANDRA-12991) on a big partition. I had issues with reaper and a 30min timeout leading to 1000+ (yes!) consecutive repairs of a single subrange because it always timed out and I recognized very late. When I deployed 12888 on my system, this remaining subrange was repaired in a snap - I guess rebuild works the same as repair and has to go through the write path, right? => The MV repair may induce so much overhead that it is maybe cheaper to kill and replace a inconsistent node than to repair it. But that may introduce inconsistencies again. All in all it is not perfect. All this does not really un-frustrate me a 100%. Do you have any more thoughts? Unfortunately I have very little time these days as my second child was born on monday. So thanks for your support so far. Maybe I have some ideas on this issues during the next days and I will work on that ticket probably next week to come to a solution that is at least deployable. I'd also appreciate your opinion on CASSANDRA-12991. 2016-12-07 2:53 GMT+01:00 Paulo Motta : > Hello Benjamin, > > Thanks for your effort on this investigation! For bootstraps and range > transfers, I think we can indeed simplify and stream base tables and MVs as > ordinary tables, unless there is some caveat I'm missing (I didn't find any > special case for bootstrap/range transfers on CASSANDRA-6477 or in the MV > design doc, please correct me if I'm wrong). > > Regarding repair of base tables, applying mutations via the write path is a > matter of correctness, given that the base table updates needs to > potentially remove previously referenced keys in the views, so repairing > only the base table may leave unreferenced keys in the views, breaking the > MV contract. Furthermore, these unreferenced keys may be propagated to > other replicas and never removed if you repair only the view. If you don't > do overwrites in the base table, this is probably not a problem but the DB > cannot ensure this (at least not before CASSANDRA-9779). Furthermore, as > you already noticed repairing only the base table is probably faster so I > don't see a reason to repair the base and MVs separately since this is > potentially more costly. I believe your frustration is mostly due to the > bug described on CASSANDRA-12905, but after that and CASSANDRA-12888 are > fixed repair on base table should work just fine. > > Based on this, I propose: > - Fix CASSANDRA-12905 with your original patch that retries acquiring the > MV lock instead of throwing WriteTimeoutException during streaming, since > this is blocking 3.10. > - Fix CASSANDRA-12888 by doing sstable-based streaming for base tables > while still applying MV updates in the paired replicas. > - Create new ticket to use ordinary streaming for non-repair MV stream > sessions and keep current behavior for MV streaming originating from > repair. > - Create new ticket to include only the base tables and not MVs in > keyspace-level repair, since repairing the base already repairs the views > to avoid people shooting themselves in the foot. > > Please let me know what do you think. Any suggestions or feedback is > appreciated. > > Cheers, > > Paulo > > 2016-12-02 8:27 GMT-02:00 Benjamin Roth : > > > As I haven't received a single reply on that, I went over to implement > and > > test it on my own with our production cluster. I had a real pain with > > bringing up a new node, so I had to move on. > > > > Result: > > Works like a charm. I ran many dtests that relate in any way with > storage, > > stream, bootstrap, ... with good results. > > The bootstrap finished in under 5:30h, not a single error log during > > bootstrap. Also afterwards, repairs run smooth, cluster seems to operate > > quite well. > > > > I still need: > > > >- Reviews (see 12888, 12905, 12984) > >- Some opinion if I did the CDC case right. IMHO CDC is not required > on > >bootstrap and we don't need to send the mutations through the write > path > >just to write the commit log. This will also break incremental > repairs. > >Instead for CDC the sstables are streamed like normal but mutations > are > >written to commitlog additionally. The worst I see is that the node > > crashes > >and the commitlogs for that repair streams are replayed leading to > >duplicate writes, which is not really crucial a
Re: CASSANDRA-12888: Streaming and MVs
Grmpf! 1000+ consecutive must be wrong. I guess I mixed sth up. But it repaired over and over again for 1 or 2 days. 2016-12-07 9:01 GMT+01:00 Benjamin Roth : > Hi Paolo, > > First of all thanks for your review! > > I had the same concerns as you but I thought it is beeing handled > correctly (which does in some situations) but I found one that creates the > inconsistencies you mentioned. That is kind of split brain syndrom, when > multiple nodes fail between repairs. See here: https://cl.ly/3t0X1c0q1L1h. > > I am not happy about it but I support your decision. We should then add > another dtest to test this scenario as existing dtests don't. > > Some issues unfortunately remain: > - 12888 is not resolved > - MV repairs may be still f slow. Imagine an inconsistency of a single > cell (also may be due to a validation race condition, see CASSANDRA-12991) > on a big partition. I had issues with reaper and a 30min timeout leading to > 1000+ (yes!) consecutive repairs of a single subrange because it always > timed out and I recognized very late. When I deployed 12888 on my system, > this remaining subrange was repaired in a snap > - I guess rebuild works the same as repair and has to go through the write > path, right? > > => The MV repair may induce so much overhead that it is maybe cheaper to > kill and replace a inconsistent node than to repair it. But that may > introduce inconsistencies again. All in all it is not perfect. All this > does not really un-frustrate me a 100%. > > Do you have any more thoughts? > > Unfortunately I have very little time these days as my second child was > born on monday. So thanks for your support so far. Maybe I have some ideas > on this issues during the next days and I will work on that ticket probably > next week to come to a solution that is at least deployable. I'd also > appreciate your opinion on CASSANDRA-12991. > > 2016-12-07 2:53 GMT+01:00 Paulo Motta : > >> Hello Benjamin, >> >> Thanks for your effort on this investigation! For bootstraps and range >> transfers, I think we can indeed simplify and stream base tables and MVs >> as >> ordinary tables, unless there is some caveat I'm missing (I didn't find >> any >> special case for bootstrap/range transfers on CASSANDRA-6477 or in the MV >> design doc, please correct me if I'm wrong). >> >> Regarding repair of base tables, applying mutations via the write path is >> a >> matter of correctness, given that the base table updates needs to >> potentially remove previously referenced keys in the views, so repairing >> only the base table may leave unreferenced keys in the views, breaking the >> MV contract. Furthermore, these unreferenced keys may be propagated to >> other replicas and never removed if you repair only the view. If you don't >> do overwrites in the base table, this is probably not a problem but the DB >> cannot ensure this (at least not before CASSANDRA-9779). Furthermore, as >> you already noticed repairing only the base table is probably faster so I >> don't see a reason to repair the base and MVs separately since this is >> potentially more costly. I believe your frustration is mostly due to the >> bug described on CASSANDRA-12905, but after that and CASSANDRA-12888 are >> fixed repair on base table should work just fine. >> >> Based on this, I propose: >> - Fix CASSANDRA-12905 with your original patch that retries acquiring the >> MV lock instead of throwing WriteTimeoutException during streaming, since >> this is blocking 3.10. >> - Fix CASSANDRA-12888 by doing sstable-based streaming for base tables >> while still applying MV updates in the paired replicas. >> - Create new ticket to use ordinary streaming for non-repair MV stream >> sessions and keep current behavior for MV streaming originating from >> repair. >> - Create new ticket to include only the base tables and not MVs in >> keyspace-level repair, since repairing the base already repairs the views >> to avoid people shooting themselves in the foot. >> >> Please let me know what do you think. Any suggestions or feedback is >> appreciated. >> >> Cheers, >> >> Paulo >> >> 2016-12-02 8:27 GMT-02:00 Benjamin Roth : >> >> > As I haven't received a single reply on that, I went over to implement >> and >> > test it on my own with our production cluster. I had a real pain with >> > bringing up a new node, so I had to move on. >> > >> > Result: >> > Works like a charm. I ran many dtests that relate in any way with >> storage, >> > stream, bootstrap, ... with good results. >> > The bootstrap finished in under 5:30h, not a single error log during >> > bootstrap. Also afterwards, repairs run smooth, cluster seems to operate >> > quite well. >> > >> > I still need: >> > >> >- Reviews (see 12888, 12905, 12984) >> >- Some opinion if I did the CDC case right. IMHO CDC is not required >> on >> >bootstrap and we don't need to send the mutations through the write >> path >> >just to write the commit log. This will also break incremental >> re
streaming_connections_per_host - speeding up CPU bound bootstrap
Currently the StreamPlan created for bootstrap (and rebuild) will only create one connection per host. If you have less nodes than cores, this is likely to be CPU bound (a CPU seems to be able to process ~5MB/s). Is there any reason why something naive like https://github.com/iksaif/cassandra/commit/8352c21284811ca15d63183ceae0b11586623f31 would not work ? I believe this is what is about https://issues.apache.org/jira/browse/CASSANDRA-4663 See also: https://issues.apache.org/jira/browse/CASSANDRA-12229, but I don't believe non-blocking I/O would change anything here. -- Corentin Chary http://xf.iksaif.net
[GitHub] cassandra pull request #87: 12979 fix check disk space
Github user rustyrazorblade closed the pull request at: https://github.com/apache/cassandra/pull/87 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---