Wouldn't be easier for 1) The CRC to be checked by the sender, and don't send if it doesn't match?
2) And once the stream ends, you could compare the 2 CRCs to see if something got weird during transfer? Also you could implement this in 2 pieces instead of reviewing the streaming architecture as a whole. I have no familiarity with Cassandra code for making this assumptions, so just wanting to contribute (And actually trying to implement at least the first part). Regards, Carlos Juzarte Rolo Cassandra Consultant / Datastax Certified Architect / Cassandra MVP Pythian - Love your data rolo@pythian | Twitter: @cjrolo | Skype: cjr2k3 | Linkedin: *linkedin.com/in/carlosjuzarterolo <http://linkedin.com/in/carlosjuzarterolo>* Mobile: +351 918 918 100 www.pythian.com On Mon, Sep 11, 2017 at 9:12 AM, DuyHai Doan <doanduy...@gmail.com> wrote: > Agree > > A tricky detail about streaming is that: > > 1) On the sender side, the node just send the SSTable (without any other > components like CRC files, partition index, partition summary etc...) > 2) The sender does not even bother to de-serialize the SSTable data, it is > just sending the stream of bytes by reading directly SSTables content from > disk > 3) On the receiver side, the node receives the bytes stream and needs to > serialize it in memory to rebuild all the SSTable components (CRC files, > partition index, partition summary ...) > > So the consequences are: > > a. there is a bottleneck on receiving side because of serialization > b. if there is a bit rot in SSTables, since CRC files are not sent, no > chance to detect it from receiving side > c. if we want to include CRC checks in the streaming path, it's a whole > review of the streaming architecture, not only adding some feature > > On Sat, Sep 9, 2017 at 10:06 PM, Jeff Jirsa <jji...@gmail.com> wrote: > >> (Which isn't to say that someone shouldn't implement this; they should, >> and there's probably a JIRA to do so already written, but it's a project of >> volunteers, and nobody has volunteered to do the work yet) >> >> -- >> Jeff Jirsa >> >> >> On Sep 9, 2017, at 12:59 PM, Jeff Jirsa <jji...@gmail.com> wrote: >> >> There is, but they aren't consulted on the streaming paths (only on >> normal reads) >> >> >> -- >> Jeff Jirsa >> >> >> On Sep 9, 2017, at 12:02 PM, DuyHai Doan <doanduy...@gmail.com> wrote: >> >> Jeff, >> >> With default compression enabled on each table, isn't there CRC files >> created along side with SSTables that can help detecting bit-rot ? >> >> >> On Sat, Sep 9, 2017 at 7:50 PM, Jeff Jirsa <jji...@gmail.com> wrote: >> >>> Cassandra doesn't do that automatically - it can guarantee consistency >>> on read or write via ConsistencyLevel on each query, and it can run active >>> (AntiEntropy) repairs. But active repairs must be scheduled (by human or >>> cron or by third party script like http://cassandra-reaper.io/), and to >>> be pedantic, repair only fixes consistency issue, there's some work to be >>> done to properly address/support fixing corrupted replicas (for example, >>> repair COULD send a bit flip from one node to all of the others) >>> >>> >>> >>> -- >>> Jeff Jirsa >>> >>> >>> On Sep 9, 2017, at 1:07 AM, Ralph Soika <ralph.so...@imixs.com> wrote: >>> >>> Hi, >>> >>> I am searching for a big data storage solution for the Imixs-Workflow >>> project. I started with Hadoop until I became aware of the >>> 'small-file-problem'. So I am considering using Cassandra now. >>> >>> But Hadoop has one important feature for me. The replicator continuously >>> examines whether data blocks are consistent across all datanodes. This will >>> detect disk errors and automatically move data from defective blocks to >>> working blocks. I think this is called 'self-healing mechanism'. >>> >>> Is there a similar feature in Cassandra too? >>> >>> >>> Thanks for help >>> >>> Ralph >>> >>> >>> >>> -- >>> >>> >> > -- --