Re: Cassandra nodes reduce disks per node
For what it is worth, I finally wrote a blog post about this --> http://thelastpickle.com/blog/2016/02/25/removing-a-disk-mapping-from-cassandra.html If you are not done yet, every step is detailed in there. C*heers, --- Alain Rodriguez - al...@thelastpickle.com France The Last Pickle - Apache Cassandra Consulting http://www.thelastpickle.com 2016-02-19 10:04 GMT+01:00 Alain RODRIGUEZ : > Alain, thanks for sharing! I'm confused why you do so many repetitive >> rsyncs. Just being cautious or is there another reason? Also, why do you >> have --delete-before when you're copying data to a temp (assumed empty) >> directory? > > > Since they are immutable I do a first sync while everything is up and >> running to the new location which runs really long. Meanwhile new ones are >> created and I sync them again online, much less files to copy now. After >> that I shutdown the node and my last rsync now has to copy only a few files >> which is quite fast and so the downtime for that node is within minutes. > > > Jan guess is right. Except for the "immutable" thing. Compaction can make > big files go away, replaced by bigger ones you'll have to stream again. > > Here is a detailed explanation about what I did it this way. > > More precisely, let's say we have 10 files of 100 GB on the disk to remove > (let's say 'old-dir') > > I run a first rsync to an empty folder indeed (let's call this 'tmp-dir'), > in the disk that will remain after the operation. Let's say this takes > about 10 hours. This can be run in parallel though. > > So I now have 10 files of 10GB on the tmp-dir. But meanwhile one > compaction triggered and I now have 6 files of 100 GB and 1 of 350 GB. > > At this point I disable compaction, stop running ones. > > My second rsync has to remove the 4 files that were compacted from > tmp-dir, so that's why I use the '--delete-before'. As this tmp-dir needs > to be mirroring old-dir, this is fine. This new operation takes 3.5 hours, > also runnable in parallel (Keep in mind C* won't compact anything for 3.5 > hours, that's why I did not stopped compaction before the first rsync, in > my case dataset was 2 TB big) > > At this point I have 950 GB in tmp-dir, but meanwhile clients continued to > write on the disk. let's say 50 GB more. > > 3rd rsync will take 0.5 hour, no compaction ran, so I just have to add the > diff to tmp-dir. Still runnable in parallel. > > Then the script stop the node, so should be run sequentially, and perform > 2 more rsync, the first one to take the diff between end of 3rd rsync and > the moment you stop the node, should be a few seconds, minutes maybe, > depending how fast you ran the script after 3rd rsync ended. The second > rsync in the script is a 'useless' one. I just like to control things. I > run it, expect to see it to say that there is no diff. It is just a way to > stop the script if for some reason data is still being appended to old-dir. > > Then I just move all the files from tmp-dir to new-dir (the proper data > dir remaining after the operation). This is an instant op a files are not > really moved as they already are on disk. That's due to system files > property. > > I finally unmount and rm -rf old-dir. > > So the full op takes 10h + 3.5 h + 0.5h + (number of noodes * 0.1 h) and > nodes are down for about 5-10 min. > > VS > > Straight forward way (stop node, move, start node) : 10 h * number of node > as this needs to be sequential. Plus each node is down for 10 hours, you > have to repair them as it is higher than hinted handoff limit... > > Branton, I did not went through your process, but I guess you will be able > to review it by yourself after reading the above (typically, repair is not > needed if you use the strategy I describe above, as node is down for 5-10 > minutes). Also, not sure how "rsync -azvuiP /var/data/cassandra/data2/ > /var/data/cassandra/data/" will behave, my guess i this is going to do a > copy, so this might be very long. My script perform an instant move and as > the next command is 'rm -Rf /var/data/cassandra/data2' I see no reason > copying rather than moving files. > > Your solution would probably work, but with big constraints on operational > point of view (very long operation + repair needed) > > Hope this long email will be useful, maybe should I blog about this. Let > me know if the process above makes sense or if some things might be > improved. > > C*heers, > - > Alain Rodriguez > France > > The Last Pickle > http://www.thelastpickle.com > > 2016-02-19 7:19 GMT+01:00 Branton Davis : > >> Jan, thanks! That makes perfect sense to run a second time before >> stopping cassandra. I'll add that in when I do the production cluster. >> >> On Fri, Feb 19, 2016 at 12:16 AM, Jan Kesten >> wrote: >> >>> Hi Branton, >>> >>> two cents from me - I didnt look through the script, but for the rsyncs >>> I do pretty much the same when moving them. Since they are immutable I do a >>> first sync while
Handling uncommitted paxos state
Hi, I have some questions about the behaviour of 'uncommitted paxos state', as described here: http://www.datastax.com/dev/blog/cassandra-error-handling-done-right If a WriteTimeoutException with WriteType.SIMPLE is thrown for a CAS write, that means that the paxos phase was successful, but the data couldn't be committed during the final 'commit/reset' phase. On the next SERIAL write or read, any other node can commit the write on behalf of the original proposer, and must do so in fact before forming a new ballot. The stops the columns from getting 'stuck' if the coordinator experiences a network partition after forming the ballot, but before committing. My questions are on the durability of the uncommitted state: Suppose CAS writes are infrequent, and it takes weeks before another write is done to that column; will the paxos state still be there, waiting forever until the next commit, or does it get automatically committed during GC if you wait long enough? (I don't see how it could be cleaned up by a GC though, since the nodes holding the paxos state don't know if the ballot was won or not.) Or, what if all the nodes are switched off (briefly); is the uncommitted paxos state persisted to disk in the log/journal, so the write can still be completed when the cluster comes back online? Finally, how granular is the paxos state? Will the uncommitted write be completed on the next SERIAL write that touches the same exact combination of cells, or is it per-column across the partition, or? If the CAS write touches two or three cells in the row, will a subsequent SERIAL read from any one of those three columns complete the uncommitted state, presumably on the other columns as well? Thanks for your help, Nick --- Nick Wilson Software engineer, RealVNC
Re: Cassandra nodes reduce disks per node
Nice thanks ! On Thu, Feb 25, 2016 at 1:51 PM, Alain RODRIGUEZ wrote: > For what it is worth, I finally wrote a blog post about this --> > http://thelastpickle.com/blog/2016/02/25/removing-a-disk-mapping-from-cassandra.html > > If you are not done yet, every step is detailed in there. > > C*heers, > --- > Alain Rodriguez - al...@thelastpickle.com > France > > The Last Pickle - Apache Cassandra Consulting > http://www.thelastpickle.com > > 2016-02-19 10:04 GMT+01:00 Alain RODRIGUEZ : > >> Alain, thanks for sharing! I'm confused why you do so many repetitive >>> rsyncs. Just being cautious or is there another reason? Also, why do you >>> have --delete-before when you're copying data to a temp (assumed empty) >>> directory? >> >> >> Since they are immutable I do a first sync while everything is up and >>> running to the new location which runs really long. Meanwhile new ones are >>> created and I sync them again online, much less files to copy now. After >>> that I shutdown the node and my last rsync now has to copy only a few files >>> which is quite fast and so the downtime for that node is within minutes. >> >> >> Jan guess is right. Except for the "immutable" thing. Compaction can >> make big files go away, replaced by bigger ones you'll have to stream again. >> >> Here is a detailed explanation about what I did it this way. >> >> More precisely, let's say we have 10 files of 100 GB on the disk to >> remove (let's say 'old-dir') >> >> I run a first rsync to an empty folder indeed (let's call this >> 'tmp-dir'), in the disk that will remain after the operation. Let's say >> this takes about 10 hours. This can be run in parallel though. >> >> So I now have 10 files of 10GB on the tmp-dir. But meanwhile one >> compaction triggered and I now have 6 files of 100 GB and 1 of 350 GB. >> >> At this point I disable compaction, stop running ones. >> >> My second rsync has to remove the 4 files that were compacted from >> tmp-dir, so that's why I use the '--delete-before'. As this tmp-dir >> needs to be mirroring old-dir, this is fine. This new operation takes 3.5 >> hours, also runnable in parallel (Keep in mind C* won't compact anything >> for 3.5 hours, that's why I did not stopped compaction before the first >> rsync, in my case dataset was 2 TB big) >> >> At this point I have 950 GB in tmp-dir, but meanwhile clients continued >> to write on the disk. let's say 50 GB more. >> >> 3rd rsync will take 0.5 hour, no compaction ran, so I just have to add >> the diff to tmp-dir. Still runnable in parallel. >> >> Then the script stop the node, so should be run sequentially, and perform >> 2 more rsync, the first one to take the diff between end of 3rd rsync and >> the moment you stop the node, should be a few seconds, minutes maybe, >> depending how fast you ran the script after 3rd rsync ended. The second >> rsync in the script is a 'useless' one. I just like to control things. I >> run it, expect to see it to say that there is no diff. It is just a way to >> stop the script if for some reason data is still being appended to old-dir. >> >> Then I just move all the files from tmp-dir to new-dir (the proper data >> dir remaining after the operation). This is an instant op a files are not >> really moved as they already are on disk. That's due to system files >> property. >> >> I finally unmount and rm -rf old-dir. >> >> So the full op takes 10h + 3.5 h + 0.5h + (number of noodes * 0.1 h) and >> nodes are down for about 5-10 min. >> >> VS >> >> Straight forward way (stop node, move, start node) : 10 h * number of >> node as this needs to be sequential. Plus each node is down for 10 hours, >> you have to repair them as it is higher than hinted handoff limit... >> >> Branton, I did not went through your process, but I guess you will be >> able to review it by yourself after reading the above (typically, repair is >> not needed if you use the strategy I describe above, as node is down for >> 5-10 minutes). Also, not sure how "rsync -azvuiP >> /var/data/cassandra/data2/ /var/data/cassandra/data/" will behave, my guess >> i this is going to do a copy, so this might be very long. My script perform >> an instant move and as the next command is 'rm -Rf >> /var/data/cassandra/data2' I see no reason copying rather than moving files. >> >> Your solution would probably work, but with big constraints on >> operational point of view (very long operation + repair needed) >> >> Hope this long email will be useful, maybe should I blog about this. Let >> me know if the process above makes sense or if some things might be >> improved. >> >> C*heers, >> - >> Alain Rodriguez >> France >> >> The Last Pickle >> http://www.thelastpickle.com >> >> 2016-02-19 7:19 GMT+01:00 Branton Davis : >> >>> Jan, thanks! That makes perfect sense to run a second time before >>> stopping cassandra. I'll add that in when I do the production cluster. >>> >>> On Fri, Feb 19, 2016 at 12:16 AM, Jan Kesten >>> wrote: >>>
Re: Cassandra Data Audit
There is an open Jira on this exact topic - Change Data Capture (CDC): https://issues.apache.org/jira/browse/CASSANDRA-8844 Unfortunately, open means not yet done. -- Jack Krupansky On Thu, Feb 25, 2016 at 2:13 AM, Charulata Sharma (charshar) < chars...@cisco.com> wrote: > Thanks for the responses. I was looking for something available in > Cassandra Open Stack not DSE. > > Looks like there isn’t any, so planning to create a Column family and have > it populated. > > > > Thanks, > > Charu > > > > *From:* Alain RODRIGUEZ [mailto:arodr...@gmail.com] > *Sent:* Wednesday, February 24, 2016 2:57 AM > *To:* user@cassandra.apache.org > *Subject:* Re: Cassandra Data Audit > > > > From > http://docs.datastax.com/en/datastax_enterprise/4.6/datastax_enterprise/sec/secAuditCassandraTableColumns.html > > > > I guess you will not have the previous value that easy, yet all the > operations seems to be logged, so looking for the last insert operation on > a specific partition should give you the information you are looking for. > > > > Once again, I never used it, I just wanted to point this to you since it > could be what you are looking for. Maybe someone else will be able to give > you some more detailed informations. If you use DSE, you should be able to > ask Datastax directly as this is DSE specific (AFAIK). > > > > C*heers, > > - > > Alain Rodriguez > > France > > > > The Last Pickle > > http://www.thelastpickle.com > > > > 2016-02-24 11:41 GMT+01:00 Raman Gugnani : > > Hi Alain, > > > > As per the document. Which column of the dse_audit.audit_log will hold > the previous or new data. > > > > On Wed, Feb 24, 2016 at 3:59 PM, Alain RODRIGUEZ > wrote: > > Hi Charu, > > > > Are you using DSE or Open source Cassandra ? > > > > I never used it, but DSE brings a feature that seems to be what you are > looking for --> > http://docs.datastax.com/en/datastax_enterprise/4.6/datastax_enterprise/sec/secAuditingCassandraTable.html > > > > Never heard about such a thing in the Open source version though. > > > > C*heers, > > - > > Alain Rodriguez > > France > > > > The Last Pickle > > http://www.thelastpickle.com > > > > 2016-02-24 6:36 GMT+01:00 Charulata Sharma (charshar) >: > > To all Cassandra experts out there, > > Can you please let me know if there is any inbuilt Cassandra > feature that allows audits on Column family data ? > > > > When I change any data in a CF, I want to record that change. Probably > store the old value as well as the changed one. > > One way of doing this is to create new CFs , but I wanted to know if there > is any standard C* feature that could be used. > > Any guidance in this and implementation approaches would really help. > > > > Thanks, > > Charu > > > > > > > > -- > > Thanks & Regards > > > > > *Raman Gugnani **Senior Software Engineer | CaMS* > > M: +91 8588892293 | T: 0124-660 | EXT: 14255 > ASF Centre A | 2nd Floor | CA-2130 | Udyog Vihar Phase IV | > Gurgaon | Haryana | India > > > > > > *Disclaimer:* This communication is for the sole use of the addressee and > is confidential and privileged information. If you are not the intended > recipient of this communication, you are prohibited from disclosing it and > are required to delete it forthwith. Please note that the contents of this > communication do not necessarily represent the views of Jasper Infotech > Private Limited ("Company"). E-mail transmission cannot be guaranteed to be > secure or error-free as information could be intercepted, corrupted, lost, > destroyed, arrive late or incomplete, or contain viruses. The Company, > therefore, does not accept liability for any loss caused due to this > communication. *Jasper Infotech Private Limited, Registered Office: 1st > Floor, Plot 238, Okhla Industrial Estate, New Delhi - 110020 INDIA CIN: > U72300DL2007PTC168097* > > >
Re: Cassandra nodes reduce disks per node
You're welcome, if you have some feedback you can comment the blog post :-). C*heers, --- Alain Rodriguez - al...@thelastpickle.com France The Last Pickle - Apache Cassandra Consulting http://www.thelastpickle.com 2016-02-25 12:28 GMT+01:00 Anishek Agarwal : > Nice thanks ! > > On Thu, Feb 25, 2016 at 1:51 PM, Alain RODRIGUEZ > wrote: > >> For what it is worth, I finally wrote a blog post about this --> >> http://thelastpickle.com/blog/2016/02/25/removing-a-disk-mapping-from-cassandra.html >> >> If you are not done yet, every step is detailed in there. >> >> C*heers, >> --- >> Alain Rodriguez - al...@thelastpickle.com >> France >> >> The Last Pickle - Apache Cassandra Consulting >> http://www.thelastpickle.com >> >> 2016-02-19 10:04 GMT+01:00 Alain RODRIGUEZ : >> >>> Alain, thanks for sharing! I'm confused why you do so many repetitive rsyncs. Just being cautious or is there another reason? Also, why do you have --delete-before when you're copying data to a temp (assumed empty) directory? >>> >>> >>> Since they are immutable I do a first sync while everything is up and running to the new location which runs really long. Meanwhile new ones are created and I sync them again online, much less files to copy now. After that I shutdown the node and my last rsync now has to copy only a few files which is quite fast and so the downtime for that node is within minutes. >>> >>> >>> Jan guess is right. Except for the "immutable" thing. Compaction can >>> make big files go away, replaced by bigger ones you'll have to stream again. >>> >>> Here is a detailed explanation about what I did it this way. >>> >>> More precisely, let's say we have 10 files of 100 GB on the disk to >>> remove (let's say 'old-dir') >>> >>> I run a first rsync to an empty folder indeed (let's call this >>> 'tmp-dir'), in the disk that will remain after the operation. Let's say >>> this takes about 10 hours. This can be run in parallel though. >>> >>> So I now have 10 files of 10GB on the tmp-dir. But meanwhile one >>> compaction triggered and I now have 6 files of 100 GB and 1 of 350 GB. >>> >>> At this point I disable compaction, stop running ones. >>> >>> My second rsync has to remove the 4 files that were compacted from >>> tmp-dir, so that's why I use the '--delete-before'. As this tmp-dir >>> needs to be mirroring old-dir, this is fine. This new operation takes 3.5 >>> hours, also runnable in parallel (Keep in mind C* won't compact anything >>> for 3.5 hours, that's why I did not stopped compaction before the first >>> rsync, in my case dataset was 2 TB big) >>> >>> At this point I have 950 GB in tmp-dir, but meanwhile clients continued >>> to write on the disk. let's say 50 GB more. >>> >>> 3rd rsync will take 0.5 hour, no compaction ran, so I just have to add >>> the diff to tmp-dir. Still runnable in parallel. >>> >>> Then the script stop the node, so should be run sequentially, and >>> perform 2 more rsync, the first one to take the diff between end of 3rd >>> rsync and the moment you stop the node, should be a few seconds, minutes >>> maybe, depending how fast you ran the script after 3rd rsync ended. The >>> second rsync in the script is a 'useless' one. I just like to control >>> things. I run it, expect to see it to say that there is no diff. It is just >>> a way to stop the script if for some reason data is still being appended to >>> old-dir. >>> >>> Then I just move all the files from tmp-dir to new-dir (the proper data >>> dir remaining after the operation). This is an instant op a files are not >>> really moved as they already are on disk. That's due to system files >>> property. >>> >>> I finally unmount and rm -rf old-dir. >>> >>> So the full op takes 10h + 3.5 h + 0.5h + (number of noodes * 0.1 h) and >>> nodes are down for about 5-10 min. >>> >>> VS >>> >>> Straight forward way (stop node, move, start node) : 10 h * number of >>> node as this needs to be sequential. Plus each node is down for 10 hours, >>> you have to repair them as it is higher than hinted handoff limit... >>> >>> Branton, I did not went through your process, but I guess you will be >>> able to review it by yourself after reading the above (typically, repair is >>> not needed if you use the strategy I describe above, as node is down for >>> 5-10 minutes). Also, not sure how "rsync -azvuiP >>> /var/data/cassandra/data2/ /var/data/cassandra/data/" will behave, my guess >>> i this is going to do a copy, so this might be very long. My script perform >>> an instant move and as the next command is 'rm -Rf >>> /var/data/cassandra/data2' I see no reason copying rather than moving files. >>> >>> Your solution would probably work, but with big constraints on >>> operational point of view (very long operation + repair needed) >>> >>> Hope this long email will be useful, maybe should I blog about this. Let >>> me know if the process above makes sense or if some things might be >>> impr
Re: how to read parent_repair_history table?
Hello Jimmy, The parent_repair_history table keeps track of start and finish information of a repair session. The other table repair_history keeps track of repair status as it progresses. So, you must first query the parent_repair_history table to check if a repair started and finish, as well as its duration, and inspect the repair_history table to troubleshoot more specific details of a given repair session. Answering your questions below: > Is every invocation of nodetool repair execution will be recorded as one entry in parent_repair_history CF regardless if it is across DC, local node repair, or other options ? Actually two entries, one for start and one for finish. > A repair job is done only if "finished" column contains value? and a repair job is successfully done only if there is no value in exce ption_messages or exception_stacktrace ? correct > what is the purpose of successful_ranges column? do i have to check they are all matched with requested_range to ensure a successful run? correct - > Ultimately, how to find out the overall repair health/status in a given cluster? Check if repair is being executed on all nodes within gc_grace_seconds, and tune that value or troubleshoot problems otherwise. > Scanning through parent_repair_history and making sure all the known keyspaces has a good repair run in recent days? Sounds good. You can check https://issues.apache.org/jira/browse/CASSANDRA-5839 for more information. 2016-02-25 3:13 GMT-03:00 Jimmy Lin : > > hi all, > few questions regarding how to read or digest the > system_distributed.parent_repair_history CF, that I am very intereted to > use to find out our repair status... > > - > Is every invocation of nodetool repair execution will be recorded as one > entry in parent_repair_history CF regardless if it is across DC, local node > repair, or other options ? > > - > A repair job is done only if "finished" column contains value? and a > repair job is successfully done only if there is no value in exce > ption_messages or exception_stacktrace ? > what is the purpose of successful_ranges column? do i have to check they > are all matched with requested_range to ensure a successful run? > > - > Ultimately, how to find out the overall repair health/status in a given > cluster? > Scanning through parent_repair_history and making sure all the known > keyspaces has a good repair run in recent days? > > --- > CREATE TABLE system_distributed.parent_repair_history ( > parent_id timeuuid PRIMARY KEY, > columnfamily_names set, > exception_message text, > exception_stacktrace text, > finished_at timestamp, > keyspace_name text, > requested_ranges set, > started_at timestamp, > successful_ranges set > ) >
Consistent read timeouts for bursts of reads
Hello, We're having a problem with concurrent requests. It seems that whenever we try resolving more than ~ 15 queries at the same time, one or two get a read timeout and then succeed on a retry. We're running Cassandra 2.2.4 accessed via the 2.1.9 Datastax driver on AWS. What we've found while investigating: * this is not db-wide. Trying the same pattern against another table everything works fine. * it fails 1 or 2 requests regardless of how many are executed in parallel, i.e., it's still 1 or 2 when we ramp it up to ~ 120 concurrent requests and doesn't seem to scale up. * the problem is consistently reproducible. It happens both under heavier load and when just firing off a single batch of requests for testing. * tracing the faulty requests says everything is great. An example trace: https://gist.github.com/emilssolmanis/41e1e2ecdfd9a0569b1a * the only peculiar thing in the logs is there's no acknowledgement of the request being accepted by the server, as seen in https://gist.github.com/emilssolmanis/242d9d02a6d8fb91da8a * there's nothing funny in the timed out Cassandra node's logs around that time as far as I can tell, not even in the debug logs. Any ideas about what might be causing this, pointers to server config options, or how else we might debug this would be much appreciated. Kind regards, Emils
Re: Consistent read timeouts for bursts of reads
Having had a read through the archives, I missed this at first, but this seems to be *exactly* like what we're experiencing. http://www.mail-archive.com/user@cassandra.apache.org/msg46064.html Only difference is we're getting this for reads and using CQL, but the behaviour is identical. On Thu, 25 Feb 2016 at 14:55 Emīls Šolmanis wrote: > Hello, > > We're having a problem with concurrent requests. It seems that whenever we > try resolving more > than ~ 15 queries at the same time, one or two get a read timeout and then > succeed on a retry. > > We're running Cassandra 2.2.4 accessed via the 2.1.9 Datastax driver on > AWS. > > What we've found while investigating: > > * this is not db-wide. Trying the same pattern against another table > everything works fine. > * it fails 1 or 2 requests regardless of how many are executed in > parallel, i.e., it's still 1 or 2 when we ramp it up to ~ 120 concurrent > requests and doesn't seem to scale up. > * the problem is consistently reproducible. It happens both under heavier > load and when just firing off a single batch of requests for testing. > * tracing the faulty requests says everything is great. An example trace: > https://gist.github.com/emilssolmanis/41e1e2ecdfd9a0569b1a > * the only peculiar thing in the logs is there's no acknowledgement of > the request being accepted by the server, as seen in > https://gist.github.com/emilssolmanis/242d9d02a6d8fb91da8a > * there's nothing funny in the timed out Cassandra node's logs around > that time as far as I can tell, not even in the debug logs. > > Any ideas about what might be causing this, pointers to server config > options, or how else we might debug this would be much appreciated. > > Kind regards, > Emils > >
Re: how to read parent_repair_history table?
Hi Jimmy, We are on 2.0.x. We are planning to use JMX notifications for getting repair status. To repair database, we call forceTableRepairPrimaryRange JMX operation from our Java client application on each node. You can call other latest JMX methods for repair. I would be keen in knowing the pros/cons of handling repair status via JMX notifications Vs via database tables. We are planning to implement it as follows: 1. Before repairing each keyspace via JMX, register two listeners: one for listening to StorageService MBean notifications about repair status and other the connection listener for detecting connection failures and lost JMX notifications. 2. We ensure that if 256 success session notifications are received, keyspace repair is successful. We have 256 ranges on each node. 3.If there are connection closed notifications, we will re-register the Mbean listener and retry repair once. 4. If there are Lost Notifications we retry the repair once before failing it. ThanksAnuj Sent from Yahoo Mail on Android On Thu, 25 Feb, 2016 at 7:18 pm, Paulo Motta wrote: Hello Jimmy, The parent_repair_history table keeps track of start and finish information of a repair session. The other table repair_history keeps track of repair status as it progresses. So, you must first query the parent_repair_history table to check if a repair started and finish, as well as its duration, and inspect the repair_history table to troubleshoot more specific details of a given repair session. Answering your questions below: > Is every invocation of nodetool repair execution will be recorded as one > entry in parent_repair_history CF regardless if it is across DC, local node > repair, or other options ? Actually two entries, one for start and one for finish. > A repair job is done only if "finished" column contains value? and a repair > job is successfully done only if there is no value in exce ption_messages or > exception_stacktrace ? correct > what is the purpose of successful_ranges column? do i have to check they are > all matched with requested_range to ensure a successful run? correct - > Ultimately, how to find out the overall repair health/status in a given > cluster? Check if repair is being executed on all nodes within gc_grace_seconds, and tune that value or troubleshoot problems otherwise. > Scanning through parent_repair_history and making sure all the known > keyspaces has a good repair run in recent days? Sounds good. You can check https://issues.apache.org/jira/browse/CASSANDRA-5839 for more information. 2016-02-25 3:13 GMT-03:00 Jimmy Lin : hi all, few questions regarding how to read or digest the system_distributed.parent_repair_history CF, that I am very intereted to use to find out our repair status... - Is every invocation of nodetool repair execution will be recorded as one entry in parent_repair_history CF regardless if it is across DC, local node repair, or other options ? - A repair job is done only if "finished" column contains value? and a repair job is successfully done only if there is no value in exce ption_messages or exception_stacktrace ? what is the purpose of successful_ranges column? do i have to check they are all matched with requested_range to ensure a successful run? - Ultimately, how to find out the overall repair health/status in a given cluster? Scanning through parent_repair_history and making sure all the known keyspaces has a good repair run in recent days? --- CREATE TABLE system_distributed.parent_repair_history ( parent_id timeuuid PRIMARY KEY, columnfamily_names set, exception_message text, exception_stacktrace text, finished_at timestamp, keyspace_name text, requested_ranges set, started_at timestamp, successful_ranges set )
Re: how to read parent_repair_history table?
Hi Anuj, i never thought of using JMX notification as way to check. Partially i think it require a live connection or application to keep the notification flowing in, while the DB approach let you look it up whenever you want current or the past jobs. thanks Sent from my iPhone > On Feb 25, 2016, at 9:25 AM, Anuj Wadehra wrote: > > Hi Jimmy, > > We are on 2.0.x. We are planning to use JMX notifications for getting repair > status. To repair database, we call forceTableRepairPrimaryRange JMX > operation from our Java client application on each node. You can call other > latest JMX methods for repair. > > I would be keen in knowing the pros/cons of handling repair status via JMX > notifications Vs via database tables. > > We are planning to implement it as follows: > > 1. Before repairing each keyspace via JMX, register two listeners: one for > listening to StorageService MBean notifications about repair status and other > the connection listener for detecting connection failures and lost JMX > notifications. > > 2. We ensure that if 256 success session notifications are received, keyspace > repair is successful. We have 256 ranges on each node. > > 3.If there are connection closed notifications, we will re-register the Mbean > listener and retry repair once. > > 4. If there are Lost Notifications we retry the repair once before failing it. > > > > Thanks > Anuj > > > Sent from Yahoo Mail on Android > > On Thu, 25 Feb, 2016 at 7:18 pm, Paulo Motta > wrote: > Hello Jimmy, > > The parent_repair_history table keeps track of start and finish information > of a repair session. The other table repair_history keeps track of repair > status as it progresses. So, you must first query the parent_repair_history > table to check if a repair started and finish, as well as its duration, and > inspect the repair_history table to troubleshoot more specific details of a > given repair session. > > Answering your questions below: > > > Is every invocation of nodetool repair execution will be recorded as one > > entry in parent_repair_history CF regardless if it is across DC, local node > > repair, or other options ? > > Actually two entries, one for start and one for finish. > > > A repair job is done only if "finished" column contains value? and a repair > > job is successfully done only if there is no value in exce ption_messages > > or exception_stacktrace ? > > correct > > > what is the purpose of successful_ranges column? do i have to check they > > are all matched with requested_range to ensure a successful run? > > correct > > - > > Ultimately, how to find out the overall repair health/status in a given > > cluster? > > Check if repair is being executed on all nodes within gc_grace_seconds, and > tune that value or troubleshoot problems otherwise. > > > Scanning through parent_repair_history and making sure all the known > > keyspaces has a good repair run in recent days? > > Sounds good. > > You can check https://issues.apache.org/jira/browse/CASSANDRA-5839 for more > information. > > > 2016-02-25 3:13 GMT-03:00 Jimmy Lin : >> >> hi all, >> few questions regarding how to read or digest the >> system_distributed.parent_repair_history CF, that I am very intereted to use >> to find out our repair status... >> >> - >> Is every invocation of nodetool repair execution will be recorded as one >> entry in parent_repair_history CF regardless if it is across DC, local node >> repair, or other options ? >> >> - >> A repair job is done only if "finished" column contains value? and a repair >> job is successfully done only if there is no value in exce >> ption_messages or exception_stacktrace ? >> what is the purpose of successful_ranges column? do i have to check they are >> all matched with requested_range to ensure a successful run? >> >> - >> Ultimately, how to find out the overall repair health/status in a given >> cluster? >> Scanning through parent_repair_history and making sure all the known >> keyspaces has a good repair run in recent days? >> >> --- >> CREATE TABLE system_distributed.parent_repair_history ( >> parent_id timeuuid PRIMARY KEY, >> columnfamily_names set, >> exception_message text, >> exception_stacktrace text, >> finished_at timestamp, >> keyspace_name text, >> requested_ranges set, >> started_at timestamp, >> successful_ranges set >> ) >
Re: how to read parent_repair_history table?
hi Paulo, follow up on the # of entries question... why each job repair execution will have 2 entries? I thought it will be one entry, begining with started_at column filled, and when it completed, finished_at column will be filled. Also, if my cluster has more than 1 keyspace, and the way this table is structured, it will have multiple entries, one for each keysapce_name value. no ? thanks Sent from my iPhone > On Feb 25, 2016, at 5:48 AM, Paulo Motta wrote: > > Hello Jimmy, > > The parent_repair_history table keeps track of start and finish information > of a repair session. The other table repair_history keeps track of repair > status as it progresses. So, you must first query the parent_repair_history > table to check if a repair started and finish, as well as its duration, and > inspect the repair_history table to troubleshoot more specific details of a > given repair session. > > Answering your questions below: > > > Is every invocation of nodetool repair execution will be recorded as one > > entry in parent_repair_history CF regardless if it is across DC, local node > > repair, or other options ? > > Actually two entries, one for start and one for finish. > > > A repair job is done only if "finished" column contains value? and a repair > > job is successfully done only if there is no value in exce ption_messages > > or exception_stacktrace ? > > correct > > > what is the purpose of successful_ranges column? do i have to check they > > are all matched with requested_range to ensure a successful run? > > correct > > - > > Ultimately, how to find out the overall repair health/status in a given > > cluster? > > Check if repair is being executed on all nodes within gc_grace_seconds, and > tune that value or troubleshoot problems otherwise. > > > Scanning through parent_repair_history and making sure all the known > > keyspaces has a good repair run in recent days? > > Sounds good. > > You can check https://issues.apache.org/jira/browse/CASSANDRA-5839 for more > information. > > > 2016-02-25 3:13 GMT-03:00 Jimmy Lin : >> >> hi all, >> few questions regarding how to read or digest the >> system_distributed.parent_repair_history CF, that I am very intereted to use >> to find out our repair status... >> >> - >> Is every invocation of nodetool repair execution will be recorded as one >> entry in parent_repair_history CF regardless if it is across DC, local node >> repair, or other options ? >> >> - >> A repair job is done only if "finished" column contains value? and a repair >> job is successfully done only if there is no value in exce >> ption_messages or exception_stacktrace ? >> what is the purpose of successful_ranges column? do i have to check they are >> all matched with requested_range to ensure a successful run? >> >> - >> Ultimately, how to find out the overall repair health/status in a given >> cluster? >> Scanning through parent_repair_history and making sure all the known >> keyspaces has a good repair run in recent days? >> >> --- >> CREATE TABLE system_distributed.parent_repair_history ( >> parent_id timeuuid PRIMARY KEY, >> columnfamily_names set, >> exception_message text, >> exception_stacktrace text, >> finished_at timestamp, >> keyspace_name text, >> requested_ranges set, >> started_at timestamp, >> successful_ranges set >> ) >
Re: Handling uncommitted paxos state
The paxos state is written to a system table (system.paxos) on each of the paxos coordinators, so it goes through the normal write path, including persisting to the log and being stored in a memtable until being flushed to disk. As such, the state can survive restarts. These states are not treated differently from our normal memtables, so there isn't any special handling for a GC. There is no process which will come in and fix up the values; they are fixed at a partition level when trying to perform a CAS operation, or when reading at a SERIAL consistency. This operation happens at the partition, so if any part of the partition is read of updated, it will finish previous transactions. If you want to know more, http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0 has a lot more information about lightweight transactions. -Carl On Thu, Feb 25, 2016 at 4:23 AM, Nicholas Wilson < nicholas.wil...@realvnc.com> wrote: > Hi, > > I have some questions about the behaviour of 'uncommitted paxos state', as > described here: > > http://www.datastax.com/dev/blog/cassandra-error-handling-done-right > > If a WriteTimeoutException with WriteType.SIMPLE is thrown for a CAS > write, that means that the paxos phase was successful, but the data > couldn't be committed during the final 'commit/reset' phase. On the next > SERIAL write or read, any other node can commit the write on behalf of the > original proposer, and must do so in fact before forming a new ballot. The > stops the columns from getting 'stuck' if the coordinator experiences a > network partition after forming the ballot, but before committing. > > My questions are on the durability of the uncommitted state: > > Suppose CAS writes are infrequent, and it takes weeks before another write > is done to that column; will the paxos state still be there, waiting > forever until the next commit, or does it get automatically committed > during GC if you wait long enough? (I don't see how it could be cleaned up > by a GC though, since the nodes holding the paxos state don't know if the > ballot was won or not.) > > Or, what if all the nodes are switched off (briefly); is the uncommitted > paxos state persisted to disk in the log/journal, so the write can still be > completed when the cluster comes back online? > > Finally, how granular is the paxos state? Will the uncommitted write be > completed on the next SERIAL write that touches the same exact combination > of cells, or is it per-column across the partition, or? If the CAS > write touches two or three cells in the row, will a subsequent SERIAL read > from any one of those three columns complete the uncommitted state, > presumably on the other columns as well? > > Thanks for your help, > Nick > > --- > Nick Wilson > Software engineer, RealVNC
Re: how to read parent_repair_history table?
> why each job repair execution will have 2 entries? I thought it will be one entry, begining with started_at column filled, and when it completed, finished_at column will be filled. that's correct, I was mistaken! > Also, if my cluster has more than 1 keyspace, and the way this table is structured, it will have multiple entries, one for each keysapce_name value. no ? thanks right, because repair sessions in different keyspaces will have different repair session ids. 2016-02-25 15:04 GMT-03:00 Jimmy Lin : > hi Paulo, > > follow up on the # of entries question... > > why each job repair execution will have 2 entries? > I thought it will be one entry, begining with started_at column filled, and > when it completed, finished_at column will be filled. > > Also, if my cluster has more than 1 keyspace, and the way this table is > structured, it will have multiple entries, one for each keysapce_name value. > no ? > > thanks > > > > Sent from my iPhone > > On Feb 25, 2016, at 5:48 AM, Paulo Motta wrote: > > Hello Jimmy, > > The parent_repair_history table keeps track of start and finish > information of a repair session. The other table repair_history keeps > track of repair status as it progresses. So, you must first query the > parent_repair_history table to check if a repair started and finish, as > well as its duration, and inspect the repair_history table to troubleshoot > more specific details of a given repair session. > > Answering your questions below: > > > Is every invocation of nodetool repair execution will be recorded as one > entry in parent_repair_history CF regardless if it is across DC, local node > repair, or other options ? > > Actually two entries, one for start and one for finish. > > > A repair job is done only if "finished" column contains value? and a > repair job is successfully done only if there is no value in exce > ption_messages or exception_stacktrace ? > > correct > > > what is the purpose of successful_ranges column? do i have to check they > are all matched with requested_range to ensure a successful run? > > correct > > - > > Ultimately, how to find out the overall repair health/status in a given > cluster? > > Check if repair is being executed on all nodes within gc_grace_seconds, > and tune that value or troubleshoot problems otherwise. > > > Scanning through parent_repair_history and making sure all the known > keyspaces has a good repair run in recent days? > > Sounds good. > > You can check https://issues.apache.org/jira/browse/CASSANDRA-5839 for > more information. > > > 2016-02-25 3:13 GMT-03:00 Jimmy Lin : > >> >> hi all, >> few questions regarding how to read or digest the >> system_distributed.parent_repair_history CF, that I am very intereted to >> use to find out our repair status... >> >> - >> Is every invocation of nodetool repair execution will be recorded as one >> entry in parent_repair_history CF regardless if it is across DC, local node >> repair, or other options ? >> >> - >> A repair job is done only if "finished" column contains value? and a >> repair job is successfully done only if there is no value in exce >> ption_messages or exception_stacktrace ? >> what is the purpose of successful_ranges column? do i have to check they >> are all matched with requested_range to ensure a successful run? >> >> - >> Ultimately, how to find out the overall repair health/status in a given >> cluster? >> Scanning through parent_repair_history and making sure all the known >> keyspaces has a good repair run in recent days? >> >> --- >> CREATE TABLE system_distributed.parent_repair_history ( >> parent_id timeuuid PRIMARY KEY, >> columnfamily_names set, >> exception_message text, >> exception_stacktrace text, >> finished_at timestamp, >> keyspace_name text, >> requested_ranges set, >> started_at timestamp, >> successful_ranges set >> ) >> > >
Re: Handling uncommitted paxos state
On Thu, Feb 25, 2016 at 1:23 AM, Nicholas Wilson < nicholas.wil...@realvnc.com> wrote: > If a WriteTimeoutException with WriteType.SIMPLE is thrown for a CAS > write, that means that the paxos phase was successful, but the data > couldn't be committed during the final 'commit/reset' phase. On the next > SERIAL write or read, any other node can commit the write on behalf of the > original proposer, and must do so in fact before forming a new ballot. The > stops the columns from getting 'stuck' if the coordinator experiences a > network partition after forming the ballot, but before committing. > If you're asking these questions, you probably want to read : https://issues.apache.org/jira/browse/CASSANDRA-9328 =Rob
Re: how to read parent_repair_history table?
hi Paulo, one more follow up ... :) I noticed these tables are suppose to replicatd to all nodes in the cluster, and it is not per node specific. how does it work when repair job targeting only local vs all DC? is there any columns or flag i can tell the difference? or does it actualy matter? thanks Sent from my iPhone > On Feb 25, 2016, at 10:37 AM, Paulo Motta wrote: > > > why each job repair execution will have 2 entries? I thought it will be one > > entry, begining with started_at column filled, and when it completed, > > finished_at column will be filled. > > that's correct, I was mistaken! > > > Also, if my cluster has more than 1 keyspace, and the way this table is > > structured, it will have multiple entries, one for each keysapce_name > > value. no ? thanks > > right, because repair sessions in different keyspaces will have different > repair session ids. > > 2016-02-25 15:04 GMT-03:00 Jimmy Lin : >> hi Paulo, >> follow up on the # of entries question... >> why each job repair execution will have 2 entries? I thought it will be one >> entry, begining with started_at column filled, and when it completed, >> finished_at column will be filled. >> Also, if my cluster has more than 1 keyspace, and the way this table is >> structured, it will have multiple entries, one for each keysapce_name value. >> no ? thanks >> >> >> Sent from my iPhone >> >>> On Feb 25, 2016, at 5:48 AM, Paulo Motta wrote: >>> >>> Hello Jimmy, >>> >>> The parent_repair_history table keeps track of start and finish information >>> of a repair session. The other table repair_history keeps track of repair >>> status as it progresses. So, you must first query the parent_repair_history >>> table to check if a repair started and finish, as well as its duration, and >>> inspect the repair_history table to troubleshoot more specific details of a >>> given repair session. >>> >>> Answering your questions below: >>> >>> > Is every invocation of nodetool repair execution will be recorded as one >>> > entry in parent_repair_history CF regardless if it is across DC, local >>> > node repair, or other options ? >>> >>> Actually two entries, one for start and one for finish. >>> >>> > A repair job is done only if "finished" column contains value? and a >>> > repair job is successfully done only if there is no value in exce >>> > ption_messages or exception_stacktrace ? >>> >>> correct >>> >>> > what is the purpose of successful_ranges column? do i have to check they >>> > are all matched with requested_range to ensure a successful run? >>> >>> correct >>> >>> - >>> > Ultimately, how to find out the overall repair health/status in a given >>> > cluster? >>> >>> Check if repair is being executed on all nodes within gc_grace_seconds, and >>> tune that value or troubleshoot problems otherwise. >>> >>> > Scanning through parent_repair_history and making sure all the known >>> > keyspaces has a good repair run in recent days? >>> >>> Sounds good. >>> >>> You can check https://issues.apache.org/jira/browse/CASSANDRA-5839 for more >>> information. >>> >>> >>> 2016-02-25 3:13 GMT-03:00 Jimmy Lin : hi all, few questions regarding how to read or digest the system_distributed.parent_repair_history CF, that I am very intereted to use to find out our repair status... - Is every invocation of nodetool repair execution will be recorded as one entry in parent_repair_history CF regardless if it is across DC, local node repair, or other options ? - A repair job is done only if "finished" column contains value? and a repair job is successfully done only if there is no value in exce ption_messages or exception_stacktrace ? what is the purpose of successful_ranges column? do i have to check they are all matched with requested_range to ensure a successful run? - Ultimately, how to find out the overall repair health/status in a given cluster? Scanning through parent_repair_history and making sure all the known keyspaces has a good repair run in recent days? --- CREATE TABLE system_distributed.parent_repair_history ( parent_id timeuuid PRIMARY KEY, columnfamily_names set, exception_message text, exception_stacktrace text, finished_at timestamp, keyspace_name text, requested_ranges set, started_at timestamp, successful_ranges set ) >
Checking replication status
hi all, what are the better ways to check replication overall status of cassandra cluster? within a single DC, unless a node is down for long time, most of the time i feel it is pretty much non-issue and things are replicated pretty fast. But when a node come back from a long offline, is there a way to check that the node has finished its data sync with other nodes ? Now across DC, we have frequent VPN outage (sometime short sometims long) between DCs, i also like to know if there is a way to find how the replication progress between DC catching up under this condtion? Also, if i understand correctly, the only gaurantee way to make sure data are synced is to run a complete repair job, is that correct? I am trying to see if there is a way to "force a quick replication sync" between DCs after vpn outage. Or maybe this is unnecessary, as Cassandra will catch up as fast as it can, there is nothing else we/(system admin) can do to make it faster or better? Sent from my iPhone
CsvReporter not spitting out metrics in cassandra
Hi, I have added the following file on my cassandra node /etc/dse/cassandra/metrics-reporter-config.yaml csv: - outdir: '/mnt/cassandra/metrics' period: 10 timeunit: 'SECONDS' predicate: color: "white" useQualifiedName: true patterns: - "^org.apache.cassandra.metrics.Cache.+" - "^org.apache.cassandra.metrics.ClientRequest.+" - "^org.apache.cassandra.metrics.CommitLog.+" - "^org.apache.cassandra.metrics.Compaction.+" - "^org.apache.cassandra.metrics.DroppedMetrics.+" - "^org.apache.cassandra.metrics.ReadRepair.+" - "^org.apache.cassandra.metrics.Storage.+" - "^org.apache.cassandra.metrics.ThreadPools.+" - "^org.apache.cassandra.metrics.ColumnFamily.+" - "^org.apache.cassandra.metrics.Streaming.+" And then added this line to etc/dse/cassandra/cassandra-env.sh JVM_OPTS="$JVM_OPTS -Dcassandra.metricsReporterConfigFile=metrics-reporter-config.yaml And then finally restarted DSE, /etc/init.d/dse restart I dont see any csv metrics files being spitted out by the MetricsReported in /mnt/cassandra/metrics folder. any ideas why?
Re: Checking replication status
Hmm. What are your processes when a node comes back after "a long offline"? Long enough to take the node offline and do a repair? Run the risk of serving stale data? Parallel repairs? ??? So, what sort of time frames are "a long time"? *...* *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872* On Thu, Feb 25, 2016 at 11:36 AM, Jimmy Lin wrote: > hi all, > > what are the better ways to check replication overall status of cassandra > cluster? > > within a single DC, unless a node is down for long time, most of the time i > feel it is pretty much non-issue and things are replicated pretty fast. But > when a node come back from a long offline, is there a way to check that the > node has finished its data sync with other nodes ? > > Now across DC, we have frequent VPN outage (sometime short sometims long) > between DCs, i also like to know if there is a way to find how the > replication progress between DC catching up under this condtion? > > Also, if i understand correctly, the only gaurantee way to make sure data > are synced is to run a complete repair job, > is that correct? I am trying to see if there is a way to "force a quick > replication sync" between DCs after vpn outage. > Or maybe this is unnecessary, as Cassandra will catch up as fast as it can, > there is nothing else we/(system admin) can do to make it faster or better? > > > > Sent from my iPhone >
Re: how to read parent_repair_history table?
> how does it work when repair job targeting only local vs all DC? is there any columns or flag i can tell the difference? or does it actualy matter? You can not easily find out from the parent_repair_session table if a repair is local-only or multi-dc. I created https://issues.apache.org/jira/browse/CASSANDRA-11244 to add more information to that table. Since that table only has id as primary key, you'd need to do a full scan to perform checks on it, or keep track of the parent id session when submitting the repair and query by primary key. What you could probably do to health check your nodes are repaired on time is to check for each table: select * from repair_history where keyspace = 'ks' columnfamily_name = 'cf' and id > mintimeuuid(now() - gc_grace_seconds/2); And then verify for each node if all of its ranges have been repaired in this period, and send an alert otherwise. You can find out a nodes range by querying JMX via StorageServiceMBean.getRangeToEndpointMap. To make this task a bit simpler you could probably add a secondary index to the participants column of repair_history table with: CREATE INDEX myindex ON system_distributed.repair_history (participants) ; and check each node status individually with: select * from repair_history where keyspace = 'ks' columnfamily_name = 'cf' and id > mintimeuuid(now() - gc_grace_seconds/2) AND participants CONTAINS 'node_IP'; 2016-02-25 16:22 GMT-03:00 Jimmy Lin : > hi Paulo, > > one more follow up ... :) > > I noticed these tables are suppose to replicatd to all nodes in the cluster, > and it is not per node specific. > > how does it work when repair job targeting only local vs all DC? is there any > columns or flag i can tell the difference? > or does it actualy matter? > > thanks > > > > > Sent from my iPhone > > On Feb 25, 2016, at 10:37 AM, Paulo Motta > wrote: > > > why each job repair execution will have 2 entries? I thought it will be > one entry, begining with started_at column filled, and when it completed, > finished_at column will be filled. > > that's correct, I was mistaken! > > > Also, if my cluster has more than 1 keyspace, and the way this table is > structured, it will have multiple entries, one for each keysapce_name > value. no ? thanks > > right, because repair sessions in different keyspaces will have different > repair session ids. > > 2016-02-25 15:04 GMT-03:00 Jimmy Lin : > >> hi Paulo, >> >> follow up on the # of entries question... >> >> why each job repair execution will have 2 entries? >> I thought it will be one entry, begining with started_at column filled, and >> when it completed, finished_at column will be filled. >> >> Also, if my cluster has more than 1 keyspace, and the way this table is >> structured, it will have multiple entries, one for each keysapce_name value. >> no ? >> >> thanks >> >> >> >> Sent from my iPhone >> >> On Feb 25, 2016, at 5:48 AM, Paulo Motta >> wrote: >> >> Hello Jimmy, >> >> The parent_repair_history table keeps track of start and finish >> information of a repair session. The other table repair_history keeps >> track of repair status as it progresses. So, you must first query the >> parent_repair_history table to check if a repair started and finish, as >> well as its duration, and inspect the repair_history table to troubleshoot >> more specific details of a given repair session. >> >> Answering your questions below: >> >> > Is every invocation of nodetool repair execution will be recorded as >> one entry in parent_repair_history CF regardless if it is across DC, local >> node repair, or other options ? >> >> Actually two entries, one for start and one for finish. >> >> > A repair job is done only if "finished" column contains value? and a >> repair job is successfully done only if there is no value in exce >> ption_messages or exception_stacktrace ? >> >> correct >> >> > what is the purpose of successful_ranges column? do i have to check >> they are all matched with requested_range to ensure a successful run? >> >> correct >> >> - >> > Ultimately, how to find out the overall repair health/status in a given >> cluster? >> >> Check if repair is being executed on all nodes within gc_grace_seconds, >> and tune that value or troubleshoot problems otherwise. >> >> > Scanning through parent_repair_history and making sure all the known >> keyspaces has a good repair run in recent days? >> >> Sounds good. >> >> You can check https://issues.apache.org/jira/browse/CASSANDRA-5839 for >> more information. >> >> >> 2016-02-25 3:13 GMT-03:00 Jimmy Lin : >> >>> >>> hi all, >>> few questions regarding how to read or digest the >>> system_distributed.parent_repair_history CF, that I am very intereted to >>> use to find out our repair status... >>> >>> - >>> Is every invocation of nodetool repair execution will be recorded as one >>> entry in parent_repair_history CF regardless if it is across DC, local node >>> repair, or other options ? >>> >>> - >>> A repair job is done only if "finishe
Unexpected high internode network activity
Hello, We have a Cassandra 2.1.9 cluster on EC2 for one of our live applications. There's a total of 21 nodes across 3 AWS availability zones, c3.2xlarge instances. The configuration is pretty standard, we use the default settings that come with the datastax AMI and the driver in our application is configured to use lz4 compression. The keyspace where all the activity happens has RF 3 and we read and write at quorum to get strong consistency. While analyzing our monthly bill, we noticed that the amount of network traffic related to Cassandra was significantly higher than expected. After breaking it down by port, it seems like over any given time, the internode network activity is 6-7 times higher than the traffic on port 9042, whereas we would expect something around 2-3 times, given the replication factor and the consistency level of our queries. For example, this is the network traffic broken down by port and direction over a few minutes, measured as sum of each node: Port 9042 from client to cluster (write queries): 1 GB Port 9042 from cluster to client (read queries): 1.5 GB Port 7000: 35 GB, which must be divided by two because the traffic is always directed to another instance of the cluster, so that makes it 17.5 GB generated traffic The traffic on port 9042 completely matches our expectations, we do about 100k write operations writing 10KB binary blobs for each query, and a bit more reads on the same data. According to our calculations, in the worst case, when the coordinator of the query is not a replica for the data, this should generate about (1 + 1.5) * 3 = 7.5 GB, and instead we see 17 GB, which is quite a lot more. Also, hinted handoffs are disabled and nodes are healthy over the period of observation, and I get the same numbers across pretty much every time window, even including an entire 24 hours period. I tried to replicate this problem in a test environment so I connected a client to a test cluster done in a bunch of Docker containers (same parameters, essentially the only difference is the GossipingPropertyFileSnitch instead of the EC2 one) and I always get what I expect, the amount of traffic on port 7000 is between 2 and 3 times the amount of traffic on port 9042 and the queries are pretty much the same ones. Before doing more analysis, I was wondering if someone has an explanation on this problem, since perhaps we are missing something obvious here? Thanks
Migrating from single node to cluster
Hi, I am wondering if there is any documentation on migrating from a single node cassandra instance to a multinode cluster? My searches have been unsuccessful so far and I have had no luck playing with tools due to terse output from the tools. I currently use a single node having data that must be retained and I want to add two nodes to create a cluster. I have tried to follow the instructions at the link below but it is unclear if it even works to go from 1 node to 2. https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_add_node_to_cluster_t.html Almost no data has been transferred across and nodetool status is showing that 0% of the data is owned by either node although I cannot determine what the percentages should be in the case that the configuration is intended for data redundancy. Datacenter: datacenter1 === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 192.168.10.8 648.16 MB 256 0.0% 5ce4f8ff-3ba4-41b2-8fd5-7d00d98c415f rack1 UN 192.168.10.9 3.31 MB 256 0.0% b56f6d58-0f60-473f-b202-f43ecc7a83f5 rack1 I also looked to see if there were any tools to check whether replication is in progress but had no luck. The second node is bootstrapped and nodetool repair indicates that nothing needs to be done. Any suggestions on a path to take? I am at a loss. Thanks, Jason
Re: how to read parent_repair_history table?
hi Paulo, that is right, I forgot there is another table that actually tracking the rest of the detail of the repairs. thanks for the pointers, will explore more with those info. I am actually surprised not much doc out there talk about these two tables, or other tools or utilities harvesting these data. (?) thanks On Thu, Feb 25, 2016 at 1:38 PM, Paulo Motta wrote: > > how does it work when repair job targeting only local vs all DC? is > there any columns or flag i can tell the difference? or does it actualy > matter? > > You can not easily find out from the parent_repair_session table if a > repair is local-only or multi-dc. I created > https://issues.apache.org/jira/browse/CASSANDRA-11244 to add more > information to that table. Since that table only has id as primary key, > you'd need to do a full scan to perform checks on it, or keep track of the > parent id session when submitting the repair and query by primary key. > > What you could probably do to health check your nodes are repaired on time > is to check for each table: > > select * from repair_history where keyspace = 'ks' columnfamily_name = > 'cf' and id > mintimeuuid(now() - gc_grace_seconds/2); > > And then verify for each node if all of its ranges have been repaired in > this period, and send an alert otherwise. You can find out a nodes range by > querying JMX via StorageServiceMBean.getRangeToEndpointMap. > > To make this task a bit simpler you could probably add a secondary index > to the participants column of repair_history table with: > > CREATE INDEX myindex ON system_distributed.repair_history (participants) ; > > and check each node status individually with: > > select * from repair_history where keyspace = 'ks' columnfamily_name = > 'cf' and id > mintimeuuid(now() - gc_grace_seconds/2) AND participants > CONTAINS 'node_IP'; > > > > 2016-02-25 16:22 GMT-03:00 Jimmy Lin : > >> hi Paulo, >> >> one more follow up ... :) >> >> I noticed these tables are suppose to replicatd to all nodes in the >> cluster, and it is not per node specific. >> >> how does it work when repair job targeting only local vs all DC? is there >> any columns or flag i can tell the difference? >> or does it actualy matter? >> >> thanks >> >> >> >> >> Sent from my iPhone >> >> On Feb 25, 2016, at 10:37 AM, Paulo Motta >> wrote: >> >> > why each job repair execution will have 2 entries? I thought it will >> be one entry, begining with started_at column filled, and when it >> completed, finished_at column will be filled. >> >> that's correct, I was mistaken! >> >> > Also, if my cluster has more than 1 keyspace, and the way this table >> is structured, it will have multiple entries, one for each keysapce_name >> value. no ? thanks >> >> right, because repair sessions in different keyspaces will have different >> repair session ids. >> >> 2016-02-25 15:04 GMT-03:00 Jimmy Lin : >> >>> hi Paulo, >>> >>> follow up on the # of entries question... >>> >>> why each job repair execution will have 2 entries? >>> I thought it will be one entry, begining with started_at column filled, and >>> when it completed, finished_at column will be filled. >>> >>> Also, if my cluster has more than 1 keyspace, and the way this table is >>> structured, it will have multiple entries, one for each keysapce_name >>> value. no ? >>> >>> thanks >>> >>> >>> >>> Sent from my iPhone >>> >>> On Feb 25, 2016, at 5:48 AM, Paulo Motta >>> wrote: >>> >>> Hello Jimmy, >>> >>> The parent_repair_history table keeps track of start and finish >>> information of a repair session. The other table repair_history keeps >>> track of repair status as it progresses. So, you must first query the >>> parent_repair_history table to check if a repair started and finish, as >>> well as its duration, and inspect the repair_history table to troubleshoot >>> more specific details of a given repair session. >>> >>> Answering your questions below: >>> >>> > Is every invocation of nodetool repair execution will be recorded as >>> one entry in parent_repair_history CF regardless if it is across DC, local >>> node repair, or other options ? >>> >>> Actually two entries, one for start and one for finish. >>> >>> > A repair job is done only if "finished" column contains value? and a >>> repair job is successfully done only if there is no value in exce >>> ption_messages or exception_stacktrace ? >>> >>> correct >>> >>> > what is the purpose of successful_ranges column? do i have to check >>> they are all matched with requested_range to ensure a successful run? >>> >>> correct >>> >>> - >>> > Ultimately, how to find out the overall repair health/status in a >>> given cluster? >>> >>> Check if repair is being executed on all nodes within gc_grace_seconds, >>> and tune that value or troubleshoot problems otherwise. >>> >>> > Scanning through parent_repair_history and making sure all the known >>> keyspaces has a good repair run in recent days? >>> >>> Sounds good. >>> >>> You can check https://issues.apache.org/jira/
Re: Checking replication status
so far they are not long, just some config change and restart. if it is a 2 hrs downtime due to whatever reason, a repair is better option than trying to figure out if replication syn finish or not? On Thu, Feb 25, 2016 at 1:09 PM, daemeon reiydelle wrote: > Hmm. What are your processes when a node comes back after "a long > offline"? Long enough to take the node offline and do a repair? Run the > risk of serving stale data? Parallel repairs? ??? > > So, what sort of time frames are "a long time"? > > > *...* > > > > *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198 > <%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872 > <%28%2B44%29%20%280%29%2020%208144%209872>* > > On Thu, Feb 25, 2016 at 11:36 AM, Jimmy Lin wrote: > >> hi all, >> >> what are the better ways to check replication overall status of cassandra >> cluster? >> >> within a single DC, unless a node is down for long time, most of the time i >> feel it is pretty much non-issue and things are replicated pretty fast. But >> when a node come back from a long offline, is there a way to check that the >> node has finished its data sync with other nodes ? >> >> Now across DC, we have frequent VPN outage (sometime short sometims long) >> between DCs, i also like to know if there is a way to find how the >> replication progress between DC catching up under this condtion? >> >> Also, if i understand correctly, the only gaurantee way to make sure data >> are synced is to run a complete repair job, >> is that correct? I am trying to see if there is a way to "force a quick >> replication sync" between DCs after vpn outage. >> Or maybe this is unnecessary, as Cassandra will catch up as fast as it can, >> there is nothing else we/(system admin) can do to make it faster or better? >> >> >> >> Sent from my iPhone >> > >
Re: Unexpected high internode network activity
If read & write at quorum then you write 3 copies of the data then return to the caller; when reading you read one copy (assume it is not on the coordinator), and 1 digest (because read at quorum is 2, not 3). When you insert, how many keyspaces get written to? (Are you using e.g. inverted indices?) That is my guess, that your db has about 1.8 bytes written for every byte inserted. Every byte you write is counted also as a read (system a sends 1gb to system b, so system b receives 1gb). You would not be charged if intra AZ, but inter AZ and inter DC will get that double count. So, my guess is reverse indexes, and you forgot to include receive and transmit. *...* *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872* On Thu, Feb 25, 2016 at 6:51 PM, Gianluca Borello wrote: > Hello, > > We have a Cassandra 2.1.9 cluster on EC2 for one of our live applications. > There's a total of 21 nodes across 3 AWS availability zones, c3.2xlarge > instances. > > The configuration is pretty standard, we use the default settings that > come with the datastax AMI and the driver in our application is configured > to use lz4 compression. The keyspace where all the activity happens has RF > 3 and we read and write at quorum to get strong consistency. > > While analyzing our monthly bill, we noticed that the amount of network > traffic related to Cassandra was significantly higher than expected. After > breaking it down by port, it seems like over any given time, the internode > network activity is 6-7 times higher than the traffic on port 9042, whereas > we would expect something around 2-3 times, given the replication factor > and the consistency level of our queries. > > For example, this is the network traffic broken down by port and direction > over a few minutes, measured as sum of each node: > > Port 9042 from client to cluster (write queries): 1 GB > Port 9042 from cluster to client (read queries): 1.5 GB > Port 7000: 35 GB, which must be divided by two because the traffic is > always directed to another instance of the cluster, so that makes it 17.5 > GB generated traffic > > The traffic on port 9042 completely matches our expectations, we do about > 100k write operations writing 10KB binary blobs for each query, and a bit > more reads on the same data. > > According to our calculations, in the worst case, when the coordinator of > the query is not a replica for the data, this should generate about (1 + > 1.5) * 3 = 7.5 GB, and instead we see 17 GB, which is quite a lot more. > > Also, hinted handoffs are disabled and nodes are healthy over the period > of observation, and I get the same numbers across pretty much every time > window, even including an entire 24 hours period. > > I tried to replicate this problem in a test environment so I connected a > client to a test cluster done in a bunch of Docker containers (same > parameters, essentially the only difference is the > GossipingPropertyFileSnitch instead of the EC2 one) and I always get what I > expect, the amount of traffic on port 7000 is between 2 and 3 times the > amount of traffic on port 9042 and the queries are pretty much the same > ones. > > Before doing more analysis, I was wondering if someone has an explanation > on this problem, since perhaps we are missing something obvious here? > > Thanks > > >
Re: Unexpected high internode network activity
Thank you for your reply. To answer your points: - I fully agree on the write volume, in fact my isolated tests confirm your estimation - About the read, I agree as well, but the volume of data is still much higher - I am writing to one single keyspace with RF 3, there's just one keyspace - I am not using any indexes, the column families are very simple - I am aware of the double count, in fact, I measured the traffic on port 9042 at the client side (so just counted once) and I divided by two the traffic on port 7000 as measured on each node (35 GB -> 17.5 GB). All the measurements have been done with iftop with proper bpf filters on the port and the total traffic matches what I see in cloudwatch (divided by two) So unfortunately I still don't have any ideas about what's going on and why I'm seeing 17 GB of internode traffic instead of ~ 5-6. On Thursday, February 25, 2016, daemeon reiydelle wrote: > If read & write at quorum then you write 3 copies of the data then return > to the caller; when reading you read one copy (assume it is not on the > coordinator), and 1 digest (because read at quorum is 2, not 3). > > When you insert, how many keyspaces get written to? (Are you using e.g. > inverted indices?) That is my guess, that your db has about 1.8 bytes > written for every byte inserted. > > Every byte you write is counted also as a read (system a sends 1gb to > system b, so system b receives 1gb). You would not be charged if intra AZ, > but inter AZ and inter DC will get that double count. > > So, my guess is reverse indexes, and you forgot to include receive and > transmit. > > > > *...* > > > > *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872* > > On Thu, Feb 25, 2016 at 6:51 PM, Gianluca Borello > wrote: > >> Hello, >> >> We have a Cassandra 2.1.9 cluster on EC2 for one of our live >> applications. There's a total of 21 nodes across 3 AWS availability zones, >> c3.2xlarge instances. >> >> The configuration is pretty standard, we use the default settings that >> come with the datastax AMI and the driver in our application is configured >> to use lz4 compression. The keyspace where all the activity happens has RF >> 3 and we read and write at quorum to get strong consistency. >> >> While analyzing our monthly bill, we noticed that the amount of network >> traffic related to Cassandra was significantly higher than expected. After >> breaking it down by port, it seems like over any given time, the internode >> network activity is 6-7 times higher than the traffic on port 9042, whereas >> we would expect something around 2-3 times, given the replication factor >> and the consistency level of our queries. >> >> For example, this is the network traffic broken down by port and >> direction over a few minutes, measured as sum of each node: >> >> Port 9042 from client to cluster (write queries): 1 GB >> Port 9042 from cluster to client (read queries): 1.5 GB >> Port 7000: 35 GB, which must be divided by two because the traffic is >> always directed to another instance of the cluster, so that makes it 17.5 >> GB generated traffic >> >> The traffic on port 9042 completely matches our expectations, we do about >> 100k write operations writing 10KB binary blobs for each query, and a bit >> more reads on the same data. >> >> According to our calculations, in the worst case, when the coordinator of >> the query is not a replica for the data, this should generate about (1 + >> 1.5) * 3 = 7.5 GB, and instead we see 17 GB, which is quite a lot more. >> >> Also, hinted handoffs are disabled and nodes are healthy over the period >> of observation, and I get the same numbers across pretty much every time >> window, even including an entire 24 hours period. >> >> I tried to replicate this problem in a test environment so I connected a >> client to a test cluster done in a bunch of Docker containers (same >> parameters, essentially the only difference is the >> GossipingPropertyFileSnitch instead of the EC2 one) and I always get what I >> expect, the amount of traffic on port 7000 is between 2 and 3 times the >> amount of traffic on port 9042 and the queries are pretty much the same >> ones. >> >> Before doing more analysis, I was wondering if someone has an explanation >> on this problem, since perhaps we are missing something obvious here? >> >> Thanks >> >> >> >
Re: Unexpected high internode network activity
Intriguing. It's enough data to look like full data is coming from the replicants instead of digests when the read of the copy occurs. Are you doing backup/dr? Are directories copied regularly and over the network or ? *...* *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872* On Thu, Feb 25, 2016 at 8:12 PM, Gianluca Borello wrote: > Thank you for your reply. > > To answer your points: > > - I fully agree on the write volume, in fact my isolated tests confirm > your estimation > > - About the read, I agree as well, but the volume of data is still much > higher > > - I am writing to one single keyspace with RF 3, there's just one keyspace > > - I am not using any indexes, the column families are very simple > > - I am aware of the double count, in fact, I measured the traffic on port > 9042 at the client side (so just counted once) and I divided by two the > traffic on port 7000 as measured on each node (35 GB -> 17.5 GB). All the > measurements have been done with iftop with proper bpf filters on the > port and the total traffic matches what I see in cloudwatch (divided by two) > > So unfortunately I still don't have any ideas about what's going on and > why I'm seeing 17 GB of internode traffic instead of ~ 5-6. > > On Thursday, February 25, 2016, daemeon reiydelle > wrote: > >> If read & write at quorum then you write 3 copies of the data then return >> to the caller; when reading you read one copy (assume it is not on the >> coordinator), and 1 digest (because read at quorum is 2, not 3). >> >> When you insert, how many keyspaces get written to? (Are you using e.g. >> inverted indices?) That is my guess, that your db has about 1.8 bytes >> written for every byte inserted. >> >> Every byte you write is counted also as a read (system a sends 1gb to >> system b, so system b receives 1gb). You would not be charged if intra AZ, >> but inter AZ and inter DC will get that double count. >> >> So, my guess is reverse indexes, and you forgot to include receive and >> transmit. >> >> >> >> *...* >> >> >> >> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198 >> <%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872 >> <%28%2B44%29%20%280%29%2020%208144%209872>* >> >> On Thu, Feb 25, 2016 at 6:51 PM, Gianluca Borello >> wrote: >> >>> Hello, >>> >>> We have a Cassandra 2.1.9 cluster on EC2 for one of our live >>> applications. There's a total of 21 nodes across 3 AWS availability zones, >>> c3.2xlarge instances. >>> >>> The configuration is pretty standard, we use the default settings that >>> come with the datastax AMI and the driver in our application is configured >>> to use lz4 compression. The keyspace where all the activity happens has RF >>> 3 and we read and write at quorum to get strong consistency. >>> >>> While analyzing our monthly bill, we noticed that the amount of network >>> traffic related to Cassandra was significantly higher than expected. After >>> breaking it down by port, it seems like over any given time, the internode >>> network activity is 6-7 times higher than the traffic on port 9042, whereas >>> we would expect something around 2-3 times, given the replication factor >>> and the consistency level of our queries. >>> >>> For example, this is the network traffic broken down by port and >>> direction over a few minutes, measured as sum of each node: >>> >>> Port 9042 from client to cluster (write queries): 1 GB >>> Port 9042 from cluster to client (read queries): 1.5 GB >>> Port 7000: 35 GB, which must be divided by two because the traffic is >>> always directed to another instance of the cluster, so that makes it 17.5 >>> GB generated traffic >>> >>> The traffic on port 9042 completely matches our expectations, we do >>> about 100k write operations writing 10KB binary blobs for each query, and a >>> bit more reads on the same data. >>> >>> According to our calculations, in the worst case, when the coordinator >>> of the query is not a replica for the data, this should generate about (1 + >>> 1.5) * 3 = 7.5 GB, and instead we see 17 GB, which is quite a lot more. >>> >>> Also, hinted handoffs are disabled and nodes are healthy over the period >>> of observation, and I get the same numbers across pretty much every time >>> window, even including an entire 24 hours period. >>> >>> I tried to replicate this problem in a test environment so I connected a >>> client to a test cluster done in a bunch of Docker containers (same >>> parameters, essentially the only difference is the >>> GossipingPropertyFileSnitch instead of the EC2 one) and I always get what I >>> expect, the amount of traffic on port 7000 is between 2 and 3 times the >>> amount of traffic on port 9042 and the queries are pretty much the same >>> ones. >>> >>> Before doing more analysis, I was wondering if someone has an >>> explanation on this problem, since perhaps we are missing something obvious >>> here? >>> >>> Thanks >>> >>> >>> >>
Re: Unexpected high internode network activity
It is indeed very intriguing and I really hope to learn more from the experience of this mailing list. To address your points: - The theory that full data is coming from replicas during reads is not enough to explain the situation. In my scenario, over a time window I had 17.5 GB of intra node activity (port 7000) for 1 GB of writes and 1.5 GB of reads (measured on port 9042), so even if both reads and writes affected all replicas, I would have (1 + 1.5) * 3 = 7.5 GB, still leaving 10 GB on port 7000 unaccounted - We are doing regular backups the standard way, using periodic snapshots and synchronizing them to S3. This traffic is not part of the anomalous traffic we're seeing above, since this one goes on port 80 and it's clearly visible with a separate bpf filter, and its magnitude is far lower than that anyway Thanks On Thu, Feb 25, 2016 at 9:03 PM, daemeon reiydelle wrote: > Intriguing. It's enough data to look like full data is coming from the > replicants instead of digests when the read of the copy occurs. Are you > doing backup/dr? Are directories copied regularly and over the network or ? > > > *...* > > > > *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198 > <%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872 > <%28%2B44%29%20%280%29%2020%208144%209872>* > > On Thu, Feb 25, 2016 at 8:12 PM, Gianluca Borello > wrote: > >> Thank you for your reply. >> >> To answer your points: >> >> - I fully agree on the write volume, in fact my isolated tests confirm >> your estimation >> >> - About the read, I agree as well, but the volume of data is still much >> higher >> >> - I am writing to one single keyspace with RF 3, there's just one >> keyspace >> >> - I am not using any indexes, the column families are very simple >> >> - I am aware of the double count, in fact, I measured the traffic on port >> 9042 at the client side (so just counted once) and I divided by two the >> traffic on port 7000 as measured on each node (35 GB -> 17.5 GB). All the >> measurements have been done with iftop with proper bpf filters on the >> port and the total traffic matches what I see in cloudwatch (divided by two) >> >> So unfortunately I still don't have any ideas about what's going on and >> why I'm seeing 17 GB of internode traffic instead of ~ 5-6. >> >> On Thursday, February 25, 2016, daemeon reiydelle >> wrote: >> >>> If read & write at quorum then you write 3 copies of the data then >>> return to the caller; when reading you read one copy (assume it is not on >>> the coordinator), and 1 digest (because read at quorum is 2, not 3). >>> >>> When you insert, how many keyspaces get written to? (Are you using e.g. >>> inverted indices?) That is my guess, that your db has about 1.8 bytes >>> written for every byte inserted. >>> >>> Every byte you write is counted also as a read (system a sends 1gb to >>> system b, so system b receives 1gb). You would not be charged if intra AZ, >>> but inter AZ and inter DC will get that double count. >>> >>> So, my guess is reverse indexes, and you forgot to include receive and >>> transmit. >>> >>> >>> >>> *...* >>> >>> >>> >>> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198 >>> <%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872 >>> <%28%2B44%29%20%280%29%2020%208144%209872>* >>> >>> On Thu, Feb 25, 2016 at 6:51 PM, Gianluca Borello >>> wrote: >>> Hello, We have a Cassandra 2.1.9 cluster on EC2 for one of our live applications. There's a total of 21 nodes across 3 AWS availability zones, c3.2xlarge instances. The configuration is pretty standard, we use the default settings that come with the datastax AMI and the driver in our application is configured to use lz4 compression. The keyspace where all the activity happens has RF 3 and we read and write at quorum to get strong consistency. While analyzing our monthly bill, we noticed that the amount of network traffic related to Cassandra was significantly higher than expected. After breaking it down by port, it seems like over any given time, the internode network activity is 6-7 times higher than the traffic on port 9042, whereas we would expect something around 2-3 times, given the replication factor and the consistency level of our queries. For example, this is the network traffic broken down by port and direction over a few minutes, measured as sum of each node: Port 9042 from client to cluster (write queries): 1 GB Port 9042 from cluster to client (read queries): 1.5 GB Port 7000: 35 GB, which must be divided by two because the traffic is always directed to another instance of the cluster, so that makes it 17.5 GB generated traffic The traffic on port 9042 completely matches our expectations, we do about 100k write operations writing 10KB binary blobs for each query, and a bit more reads on the same data. According to our calculations, in the worst case, wh
RE: CsvReporter not spitting out metrics in cassandra
Hi, I configured this reporter recently thought the Apache Cassandra v2.1.x and I had no troubles. Here is some points to check : - The directory “/etc/dse/Cassandra” has to be in the classpath (I’m not a DSE user so I don’t know if it is already the case.) - If the CVSReporter fails to start (rights issue on the output directory?), you should have some logs with ERROR level into your Cassandra log files. Eric De : Vikram Kone [mailto:vikramk...@gmail.com] Envoyé : jeudi 25 février 2016 21:41 À : user@cassandra.apache.org Objet : CsvReporter not spitting out metrics in cassandra Hi, I have added the following file on my cassandra node /etc/dse/cassandra/metrics-reporter-config.yaml csv: - outdir: '/mnt/cassandra/metrics' period: 10 timeunit: 'SECONDS' predicate: color: "white" useQualifiedName: true patterns: - "^org.apache.cassandra.metrics.Cache.+" - "^org.apache.cassandra.metrics.ClientRequest.+" - "^org.apache.cassandra.metrics.CommitLog.+" - "^org.apache.cassandra.metrics.Compaction.+" - "^org.apache.cassandra.metrics.DroppedMetrics.+" - "^org.apache.cassandra.metrics.ReadRepair.+" - "^org.apache.cassandra.metrics.Storage.+" - "^org.apache.cassandra.metrics.ThreadPools.+" - "^org.apache.cassandra.metrics.ColumnFamily.+" - "^org.apache.cassandra.metrics.Streaming.+" And then added this line to etc/dse/cassandra/cassandra-env.sh JVM_OPTS="$JVM_OPTS -Dcassandra.metricsReporterConfigFile=metrics-reporter-config.yaml And then finally restarted DSE, /etc/init.d/dse restart I dont see any csv metrics files being spitted out by the MetricsReported in /mnt/cassandra/metrics folder. any ideas why? !!!* "Ce message et les pièces jointes sont confidentiels et réservés à l'usage exclusif de ses destinataires. Il peut également être protégé par le secret professionnel. Si vous recevez ce message par erreur, merci d'en avertir immédiatement l'expéditeur et de le détruire. L'intégrité du message ne pouvant être assurée sur Internet, la responsabilité de Worldline ne pourra être recherchée quant au contenu de ce message. Bien que les meilleurs efforts soient faits pour maintenir cette transmission exempte de tout virus, l'expéditeur ne donne aucune garantie à cet égard et sa responsabilité ne saurait être recherchée pour tout dommage résultant d'un virus transmis. This e-mail and the documents attached are confidential and intended solely for the addressee; it may also be privileged. If you receive this e-mail in error, please notify the sender immediately and destroy it. As its integrity cannot be secured on the Internet, the Worldline liability cannot be triggered for the message content. Although the sender endeavours to maintain a computer virus-free network, the sender does not warrant that this transmission is virus-free and will not be liable for any damages resulting from any virus transmitted.!!!"
Re: Unexpected high internode network activity
Hmm. From the AWS FAQ: *Q: If I have two instances in different availability zones, how will I be charged for regional data transfer?* Each instance is charged for its data in and data out. Therefore, if data is transferred between these two instances, it is charged out for the first instance and in for the second instance. I really am not seeing this factored into your numbers fully. If data transfer is only twice as much as expected, the above billing would seem to put the numbers in line. Since (I assume) you have one copy in EACH AZ (dc aware but really dc=az) I am not seeing the bandwidth as that much out of line. *...* *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872* On Thu, Feb 25, 2016 at 11:00 PM, Gianluca Borello wrote: > It is indeed very intriguing and I really hope to learn more from the > experience of this mailing list. To address your points: > > - The theory that full data is coming from replicas during reads is not > enough to explain the situation. In my scenario, over a time window I had > 17.5 GB of intra node activity (port 7000) for 1 GB of writes and 1.5 GB of > reads (measured on port 9042), so even if both reads and writes affected > all replicas, I would have (1 + 1.5) * 3 = 7.5 GB, still leaving 10 GB on > port 7000 unaccounted > > - We are doing regular backups the standard way, using periodic snapshots > and synchronizing them to S3. This traffic is not part of the anomalous > traffic we're seeing above, since this one goes on port 80 and it's clearly > visible with a separate bpf filter, and its magnitude is far lower than > that anyway > > Thanks > > On Thu, Feb 25, 2016 at 9:03 PM, daemeon reiydelle > wrote: > >> Intriguing. It's enough data to look like full data is coming from the >> replicants instead of digests when the read of the copy occurs. Are you >> doing backup/dr? Are directories copied regularly and over the network or ? >> >> >> *...* >> >> >> >> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198 >> <%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872 >> <%28%2B44%29%20%280%29%2020%208144%209872>* >> >> On Thu, Feb 25, 2016 at 8:12 PM, Gianluca Borello >> wrote: >> >>> Thank you for your reply. >>> >>> To answer your points: >>> >>> - I fully agree on the write volume, in fact my isolated tests confirm >>> your estimation >>> >>> - About the read, I agree as well, but the volume of data is still much >>> higher >>> >>> - I am writing to one single keyspace with RF 3, there's just one >>> keyspace >>> >>> - I am not using any indexes, the column families are very simple >>> >>> - I am aware of the double count, in fact, I measured the traffic on >>> port 9042 at the client side (so just counted once) and I divided by two >>> the traffic on port 7000 as measured on each node (35 GB -> 17.5 GB). All >>> the measurements have been done with iftop with proper bpf filters on the >>> port and the total traffic matches what I see in cloudwatch (divided by two) >>> >>> So unfortunately I still don't have any ideas about what's going on and >>> why I'm seeing 17 GB of internode traffic instead of ~ 5-6. >>> >>> On Thursday, February 25, 2016, daemeon reiydelle >>> wrote: >>> If read & write at quorum then you write 3 copies of the data then return to the caller; when reading you read one copy (assume it is not on the coordinator), and 1 digest (because read at quorum is 2, not 3). When you insert, how many keyspaces get written to? (Are you using e.g. inverted indices?) That is my guess, that your db has about 1.8 bytes written for every byte inserted. Every byte you write is counted also as a read (system a sends 1gb to system b, so system b receives 1gb). You would not be charged if intra AZ, but inter AZ and inter DC will get that double count. So, my guess is reverse indexes, and you forgot to include receive and transmit. *...* *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198 <%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872 <%28%2B44%29%20%280%29%2020%208144%209872>* On Thu, Feb 25, 2016 at 6:51 PM, Gianluca Borello wrote: > Hello, > > We have a Cassandra 2.1.9 cluster on EC2 for one of our live > applications. There's a total of 21 nodes across 3 AWS availability zones, > c3.2xlarge instances. > > The configuration is pretty standard, we use the default settings that > come with the datastax AMI and the driver in our application is configured > to use lz4 compression. The keyspace where all the activity happens has RF > 3 and we read and write at quorum to get strong consistency. > > While analyzing our monthly bill, we noticed that the amount of > network traffic related to Cassandra was significantly higher than > expected. After breaking it down by port, it seems like over any given > time, the internode