Disc size for cluster
Hi Just want to validate my estimation for a C* cluster which should have around 3 TB of usable storage. Assuming a RF of 3 and SizeTiered Compaction Strategy. Is it correct, that SizeTiered Compaction Strategy needs (in the worst case) 50% free disc space during compaction? So this would then result in a cluster of 3TB x 3 x 2 == 18 TB of raw storage? Thanks and Regards Raphael Vogel
Re: Disc size for cluster
Hi! This is basically right, but: 1. How do you know the 3TB storage will be 3TB on cassandra? This depends how the data is serialized, compressed and how often it changes and it depends on your compaction settings 2. 50% free space on STCS is only required if you do a full compaction of a single CF that takes all the space. Normally you need as much free space as the target SSTable of a compaction will take. If you split your data across more CFs, its unlikely you really hit this value. .. probably you should do some tests. But in the end it is always good to have some headroom. I personally would scale out if free space is < 30% but that always depends on your model. 2017-01-26 9:56 GMT+01:00 Raphael Vogel : > Hi > Just want to validate my estimation for a C* cluster which should have > around 3 TB of usable storage. > Assuming a RF of 3 and SizeTiered Compaction Strategy. > Is it correct, that SizeTiered Compaction Strategy needs (in the worst > case) 50% free disc space during compaction? > > So this would then result in a cluster of 3TB x 3 x 2 == 18 TB of raw > storage? > > Thanks and Regards > Raphael Vogel > -- Benjamin Roth Prokurist Jaumo GmbH · www.jaumo.com Wehrstraße 46 · 73035 Göppingen · Germany Phone +49 7161 304880-6 · Fax +49 7161 304880-1 AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
Re: Disc size for cluster
Adding to what Benjamin said.. It is hard to estimate disk space if you are using STCS for a table where rows are updated frequently leading to lot of fragmentation. STCS may also lead to scenarios where tombstones are not evicted for long times. You may go live and everything goes well for months. Then gradually you realize that large sstables are holding on to tombstones as they are not getting compacted. It is not easy to test disk space requirements with precision upfront unless you test your system with data patterns for some time. Your life can be easy much easier if you take care of following points with STCS: 1. If you can afford some extra IO, go for slightly aggressive STCS strategy using one or more of following settings: min_threshold=2, bucket_high=2,unchecked_tombstone_compactions=true. Which one of these to use depends on your use case.Study these settings. 2. Estimate free disk required for compactions at any point of time. For example, suppose you have 5 tables with 3 TB data in total and you estimate that data distribution will be as follows:A: 800 gb B:700gb C:600gb D:500gb E:400gb If you have concurrent_compactors=3 and 90% data of your largest tables are getting compacted simultaneously, you will need 90/100*(800+700+600)gb =1.9 TB free disk space. So you wont need 6 TB disk for 3 TB data. Only 4.9 TB would do. 3. Take 10-15% buffer for future schema changes and calculation errors. Better safe than sorry :) Thanks Anuj On Thu, 26 Jan, 2017 at 2:41 PM, Benjamin Roth wrote: Hi! This is basically right, but:1. How do you know the 3TB storage will be 3TB on cassandra? This depends how the data is serialized, compressed and how often it changes and it depends on your compaction settings2. 50% free space on STCS is only required if you do a full compaction of a single CF that takes all the space. Normally you need as much free space as the target SSTable of a compaction will take. If you split your data across more CFs, its unlikely you really hit this value. .. probably you should do some tests. But in the end it is always good to have some headroom. I personally would scale out if free space is < 30% but that always depends on your model. 2017-01-26 9:56 GMT+01:00 Raphael Vogel : HiJust want to validate my estimation for a C* cluster which should have around 3 TB of usable storage.Assuming a RF of 3 and SizeTiered Compaction Strategy.Is it correct, that SizeTiered Compaction Strategy needs (in the worst case) 50% free disc space during compaction? So this would then result in a cluster of 3TB x 3 x 2 == 18 TB of raw storage? Thanks and RegardsRaphael Vogel -- Benjamin Roth Prokurist Jaumo GmbH · www.jaumo.com Wehrstraße 46 · 73035 Göppingen · Germany Phone +49 7161 304880-6 · Fax +49 7161 304880-1 AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
Re: Expensive to run nodetool status often?
On Wed, Jan 25, 2017 at 11:20 AM, Xiaolei Li wrote: > Thanks for the advice! > > I do export a lot via JMX already. But I couldn't find the equivalent of the > Status column (Up/Down + Normal/Leaving/Joining/Moving) from the status > output. Does anyone know if those are available via JMX? I've been working on this off and on for a while (adding things as I have a need for them): https://github.com/eevans/creole The idea was to create a high-level, Cassandra-specific abstraction for JMX. Mostly it builds representations that are similar to what nodetool provides, but outputs JSON, either on the command line where it could be wrapped by a script, or via a REST interface. There is no exact equivalent to status just yet, but it would be pretty trivial to add. I'm happy to do that (give me a few days), or I'd gladly accept a pull request. -- Eric Evans john.eric.ev...@gmail.com
Re: Expensive to run nodetool status often?
Very cool! On Thu, Jan 26, 2017 at 8:53 AM Eric Evans wrote: > On Wed, Jan 25, 2017 at 11:20 AM, Xiaolei Li > wrote: > > Thanks for the advice! > > > > I do export a lot via JMX already. But I couldn't find the equivalent of > the > > Status column (Up/Down + Normal/Leaving/Joining/Moving) from the status > > output. Does anyone know if those are available via JMX? > > I've been working on this off and on for a while (adding things as I > have a need for them): > > https://github.com/eevans/creole > > The idea was to create a high-level, Cassandra-specific abstraction > for JMX. Mostly it builds representations that are similar to what > nodetool provides, but outputs JSON, either on the command line where > it could be wrapped by a script, or via a REST interface. There is no > exact equivalent to status just yet, but it would be pretty trivial to > add. I'm happy to do that (give me a few days), or I'd gladly accept > a pull request. > > -- > Eric Evans > john.eric.ev...@gmail.com >
Cassandra ad hoc search options
Hi All, Hope I can get some help here. We're using Cassandra for services, and recently we're adding UI support. With Cassandra, what are the options for ad hoc query/search similar to RDBMS? We love the features of Cassandra but it seems it's a known "weakness" that it doesn't come with strong support of indexing and ad hoc queries. There're some recent development with SASI as part of secondary index. However I heard from a video where it says it shall not be extensively used. Has anyone have much experience with SASI? How does it compare to Lucene plugin? What is the direction of Apache Cassandra in the search area? We're also looking into Solr or ElasticSearch integration, but it seems it might take more efforts, and possibly involve data duplication. For Solr, we don't have DSE. Sorry if this has been asked before, but I haven't seen a more complete answer. Thanks! John NOTICE OF CONFIDENTIALITY: This message may contain information that is considered confidential and which may be prohibited from disclosure under applicable law or by contractual agreement. The information is intended solely for the use of the individual or entity named above. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or use of the information contained in or attached to this message is strictly prohibited. If you have received this email transmission in error, please notify the sender by replying to this email and then delete it from your system.
Re: Cassandra ad hoc search options
> With Cassandra, what are the options for ad hoc query/search similar to RDBMS? Your best options are Spark w/ the DataStax connector or Presto. Cassandra isn't built for ad-hoc queries so you need to use other tools to make it work. On Thu, Jan 26, 2017 at 2:22 PM Yu, John wrote: > Hi All, > > > > Hope I can get some help here. We’re using Cassandra for services, and > recently we’re adding UI support. > > With Cassandra, what are the options for ad hoc query/search similar to > RDBMS? We love the features of Cassandra but it seems it’s a known > “weakness” that it doesn’t come with strong support of indexing and ad hoc > queries. There’re some recent development with SASI as part of secondary > index. However I heard from a video where it says it shall not be > extensively used. > > > > Has anyone have much experience with SASI? How does it compare to Lucene > plugin? > > What is the direction of Apache Cassandra in the search area? > > > > We’re also looking into Solr or ElasticSearch integration, but it seems it > might take more efforts, and possibly involve data duplication. > > For Solr, we don’t have DSE. > > Sorry if this has been asked before, but I haven’t seen a more complete > answer. > > > > Thanks! > > John > -- > NOTICE OF CONFIDENTIALITY: > This message may contain information that is considered confidential and > which may be prohibited from disclosure under applicable law or by > contractual agreement. The information is intended solely for the use of > the individual or entity named above. If you are not the intended > recipient, you are hereby notified that any disclosure, copying, > distribution or use of the information contained in or attached to this > message is strictly prohibited. If you have received this email > transmission in error, please notify the sender by replying to this email > and then delete it from your system. >
Re: Expensive to run nodetool status often?
Nice! Will take a look. Best, x. On Thu, Jan 26, 2017 at 10:30 AM, Jonathan Haddad wrote: > Very cool! > > On Thu, Jan 26, 2017 at 8:53 AM Eric Evans > wrote: > >> On Wed, Jan 25, 2017 at 11:20 AM, Xiaolei Li >> wrote: >> > Thanks for the advice! >> > >> > I do export a lot via JMX already. But I couldn't find the equivalent >> of the >> > Status column (Up/Down + Normal/Leaving/Joining/Moving) from the status >> > output. Does anyone know if those are available via JMX? >> >> I've been working on this off and on for a while (adding things as I >> have a need for them): >> >> https://github.com/eevans/creole >> >> The idea was to create a high-level, Cassandra-specific abstraction >> for JMX. Mostly it builds representations that are similar to what >> nodetool provides, but outputs JSON, either on the command line where >> it could be wrapped by a script, or via a REST interface. There is no >> exact equivalent to status just yet, but it would be pretty trivial to >> add. I'm happy to do that (give me a few days), or I'd gladly accept >> a pull request. >> >> -- >> Eric Evans >> john.eric.ev...@gmail.com >> >
RE: [External] Re: Cassandra ad hoc search options
Thanks a lot. Mind sharing a couple of points where you feel it’s better than the alternatives. Regards, John From: Jonathan Haddad [mailto:j...@jonhaddad.com] Sent: Thursday, January 26, 2017 2:33 PM To: user@cassandra.apache.org Subject: [External] Re: Cassandra ad hoc search options > With Cassandra, what are the options for ad hoc query/search similar to RDBMS? Your best options are Spark w/ the DataStax connector or Presto. Cassandra isn't built for ad-hoc queries so you need to use other tools to make it work. On Thu, Jan 26, 2017 at 2:22 PM Yu, John mailto:john...@sandc.com>> wrote: Hi All, Hope I can get some help here. We’re using Cassandra for services, and recently we’re adding UI support. With Cassandra, what are the options for ad hoc query/search similar to RDBMS? We love the features of Cassandra but it seems it’s a known “weakness” that it doesn’t come with strong support of indexing and ad hoc queries. There’re some recent development with SASI as part of secondary index. However I heard from a video where it says it shall not be extensively used. Has anyone have much experience with SASI? How does it compare to Lucene plugin? What is the direction of Apache Cassandra in the search area? We’re also looking into Solr or ElasticSearch integration, but it seems it might take more efforts, and possibly involve data duplication. For Solr, we don’t have DSE. Sorry if this has been asked before, but I haven’t seen a more complete answer. Thanks! John NOTICE OF CONFIDENTIALITY: This message may contain information that is considered confidential and which may be prohibited from disclosure under applicable law or by contractual agreement. The information is intended solely for the use of the individual or entity named above. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or use of the information contained in or attached to this message is strictly prohibited. If you have received this email transmission in error, please notify the sender by replying to this email and then delete it from your system.