Disc size for cluster

2017-01-26 Thread Raphael Vogel


Hi

Just want to validate my estimation for a C* cluster which should have around 3 TB of usable storage.

Assuming a RF of 3 and SizeTiered Compaction Strategy.

Is it correct, that SizeTiered Compaction Strategy needs (in the worst case) 50% free disc space during compaction?

 

So this would then result in a cluster of 3TB x 3 x 2 == 18 TB of raw storage?

 

Thanks and Regards

Raphael Vogel




Re: Disc size for cluster

2017-01-26 Thread Benjamin Roth
Hi!

This is basically right, but:
1. How do you know the 3TB storage will be 3TB on cassandra? This depends
how the data is serialized, compressed and how often it changes and it
depends on your compaction settings
2. 50% free space on STCS is only required if you do a full compaction of a
single CF that takes all the space. Normally you need as much free space as
the target SSTable of a compaction will take. If you split your data across
more CFs, its unlikely you really hit this value.

.. probably you should do some tests. But in the end it is always good to
have some headroom. I personally would scale out if free space is < 30% but
that always depends on your model.


2017-01-26 9:56 GMT+01:00 Raphael Vogel :

> Hi
> Just want to validate my estimation for a C* cluster which should have
> around 3 TB of usable storage.
> Assuming a RF of 3 and SizeTiered Compaction Strategy.
> Is it correct, that SizeTiered Compaction Strategy needs (in the worst
> case) 50% free disc space during compaction?
>
> So this would then result in a cluster of 3TB x 3 x 2 == 18 TB of raw
> storage?
>
> Thanks and Regards
> Raphael Vogel
>



-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer


Re: Disc size for cluster

2017-01-26 Thread Anuj Wadehra
Adding to what Benjamin said..
It is hard to estimate disk space if you are using STCS for a table where rows 
are updated frequently leading to lot of fragmentation. STCS may also lead to 
scenarios where tombstones are not evicted for long times. You may go live and 
everything goes well for months. Then gradually you realize that large sstables 
are holding on to tombstones as they are not getting compacted.  It is not easy 
to test disk space requirements with precision upfront unless you test your 
system with data patterns for some time.
Your life can be easy much easier if you take care of following points with 
STCS:
1. If you can afford some extra IO, go for slightly aggressive STCS strategy 
using one or more of following settings: min_threshold=2, 
bucket_high=2,unchecked_tombstone_compactions=true. Which one of these to use 
depends on your use case.Study these settings.
2. Estimate free disk required for compactions at any point of time. 
For example, suppose you have 5 tables with 3 TB data in total and you estimate 
that data distribution will be as follows:A: 800 gb B:700gb C:600gb D:500gb 
E:400gb
If you have concurrent_compactors=3 and 90% data of your largest tables are 
getting compacted simultaneously, you will need 90/100*(800+700+600)gb =1.9 TB 
free disk space. So you wont need 6 TB disk for 3 TB data. Only 4.9 TB would do.
3. Take 10-15% buffer for future schema changes and calculation errors. Better 
safe than sorry :)

Thanks
Anuj 
 
  On Thu, 26 Jan, 2017 at 2:41 PM, Benjamin Roth 
wrote:   Hi!
This is basically right, but:1. How do you know the 3TB storage will be 3TB on 
cassandra? This depends how the data is serialized, compressed and how often it 
changes and it depends on your compaction settings2. 50% free space on STCS is 
only required if you do a full compaction of a single CF that takes all the 
space. Normally you need as much free space as the target SSTable of a 
compaction will take. If you split your data across more CFs, its unlikely you 
really hit this value.
.. probably you should do some tests. But in the end it is always good to have 
some headroom. I personally would scale out if free space is < 30% but that 
always depends on your model.

2017-01-26 9:56 GMT+01:00 Raphael Vogel :

HiJust want to validate my estimation for a C* cluster which should have around 
3 TB of usable storage.Assuming a RF of 3 and SizeTiered Compaction Strategy.Is 
it correct, that SizeTiered Compaction Strategy needs (in the worst case) 50% 
free disc space during compaction? So this would then result in a cluster of 
3TB x 3 x 2 == 18 TB of raw storage? Thanks and RegardsRaphael Vogel



-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer  


Re: Expensive to run nodetool status often?

2017-01-26 Thread Eric Evans
On Wed, Jan 25, 2017 at 11:20 AM, Xiaolei Li  wrote:
> Thanks for the advice!
>
> I do export a lot via JMX already. But I couldn't find the equivalent of the
> Status column (Up/Down + Normal/Leaving/Joining/Moving) from the status
> output. Does anyone know if those are available via JMX?

I've been working on this off and on for a while (adding things as I
have a need for them):

https://github.com/eevans/creole

The idea was to create a high-level, Cassandra-specific abstraction
for JMX.  Mostly it builds representations that are similar to what
nodetool provides, but outputs JSON, either on the command line where
it could be wrapped by a script, or via a REST interface.  There is no
exact equivalent to status just yet, but it would be pretty trivial to
add.  I'm happy to do that (give me a few days), or I'd gladly accept
a pull request.

-- 
Eric Evans
john.eric.ev...@gmail.com


Re: Expensive to run nodetool status often?

2017-01-26 Thread Jonathan Haddad
Very cool!

On Thu, Jan 26, 2017 at 8:53 AM Eric Evans 
wrote:

> On Wed, Jan 25, 2017 at 11:20 AM, Xiaolei Li 
> wrote:
> > Thanks for the advice!
> >
> > I do export a lot via JMX already. But I couldn't find the equivalent of
> the
> > Status column (Up/Down + Normal/Leaving/Joining/Moving) from the status
> > output. Does anyone know if those are available via JMX?
>
> I've been working on this off and on for a while (adding things as I
> have a need for them):
>
> https://github.com/eevans/creole
>
> The idea was to create a high-level, Cassandra-specific abstraction
> for JMX.  Mostly it builds representations that are similar to what
> nodetool provides, but outputs JSON, either on the command line where
> it could be wrapped by a script, or via a REST interface.  There is no
> exact equivalent to status just yet, but it would be pretty trivial to
> add.  I'm happy to do that (give me a few days), or I'd gladly accept
> a pull request.
>
> --
> Eric Evans
> john.eric.ev...@gmail.com
>


Cassandra ad hoc search options

2017-01-26 Thread Yu, John
Hi All,

Hope I can get some help here. We're using Cassandra for services, and recently 
we're adding UI support.
With Cassandra, what are the options for ad hoc query/search similar to RDBMS? 
We love the features of Cassandra but it seems it's a known "weakness" that it 
doesn't come with strong support of indexing and ad hoc queries. There're some 
recent development with SASI as part of secondary index. However I heard from a 
video where it says it shall not be extensively used.

Has anyone have much experience with SASI? How does it compare to Lucene plugin?
What is the direction of Apache Cassandra in the search area?

We're also looking into Solr or ElasticSearch integration, but it seems it 
might take more efforts, and possibly involve data duplication.
For Solr, we don't have DSE.
Sorry if this has been asked before, but I haven't seen a more complete answer.

Thanks!
John

NOTICE OF CONFIDENTIALITY:
This message may contain information that is considered confidential and which 
may be prohibited from disclosure under applicable law or by contractual 
agreement. The information is intended solely for the use of the individual or 
entity named above. If you are not the intended recipient, you are hereby 
notified that any disclosure, copying, distribution or use of the information 
contained in or attached to this message is strictly prohibited. If you have 
received this email transmission in error, please notify the sender by replying 
to this email and then delete it from your system.


Re: Cassandra ad hoc search options

2017-01-26 Thread Jonathan Haddad
> With Cassandra, what are the options for ad hoc query/search similar to
RDBMS?

Your best options are Spark w/ the DataStax connector or Presto.  Cassandra
isn't built for ad-hoc queries so you need to use other tools to make it
work.

On Thu, Jan 26, 2017 at 2:22 PM Yu, John  wrote:

> Hi All,
>
>
>
> Hope I can get some help here. We’re using Cassandra for services, and
> recently we’re adding UI support.
>
> With Cassandra, what are the options for ad hoc query/search similar to
> RDBMS? We love the features of Cassandra but it seems it’s a known
> “weakness” that it doesn’t come with strong support of indexing and ad hoc
> queries. There’re some recent development with SASI as part of secondary
> index. However I heard from a video where it says it shall not be
> extensively used.
>
>
>
> Has anyone have much experience with SASI? How does it compare to Lucene
> plugin?
>
> What is the direction of Apache Cassandra in the search area?
>
>
>
> We’re also looking into Solr or ElasticSearch integration, but it seems it
> might take more efforts, and possibly involve data duplication.
>
> For Solr, we don’t have DSE.
>
> Sorry if this has been asked before, but I haven’t seen a more complete
> answer.
>
>
>
> Thanks!
>
> John
> --
> NOTICE OF CONFIDENTIALITY:
> This message may contain information that is considered confidential and
> which may be prohibited from disclosure under applicable law or by
> contractual agreement. The information is intended solely for the use of
> the individual or entity named above. If you are not the intended
> recipient, you are hereby notified that any disclosure, copying,
> distribution or use of the information contained in or attached to this
> message is strictly prohibited. If you have received this email
> transmission in error, please notify the sender by replying to this email
> and then delete it from your system.
>


Re: Expensive to run nodetool status often?

2017-01-26 Thread Xiaolei Li
Nice! Will take a look.

Best,
x.

On Thu, Jan 26, 2017 at 10:30 AM, Jonathan Haddad  wrote:

> Very cool!
>
> On Thu, Jan 26, 2017 at 8:53 AM Eric Evans 
> wrote:
>
>> On Wed, Jan 25, 2017 at 11:20 AM, Xiaolei Li 
>> wrote:
>> > Thanks for the advice!
>> >
>> > I do export a lot via JMX already. But I couldn't find the equivalent
>> of the
>> > Status column (Up/Down + Normal/Leaving/Joining/Moving) from the status
>> > output. Does anyone know if those are available via JMX?
>>
>> I've been working on this off and on for a while (adding things as I
>> have a need for them):
>>
>> https://github.com/eevans/creole
>>
>> The idea was to create a high-level, Cassandra-specific abstraction
>> for JMX.  Mostly it builds representations that are similar to what
>> nodetool provides, but outputs JSON, either on the command line where
>> it could be wrapped by a script, or via a REST interface.  There is no
>> exact equivalent to status just yet, but it would be pretty trivial to
>> add.  I'm happy to do that (give me a few days), or I'd gladly accept
>> a pull request.
>>
>> --
>> Eric Evans
>> john.eric.ev...@gmail.com
>>
>


RE: [External] Re: Cassandra ad hoc search options

2017-01-26 Thread Yu, John
Thanks a lot. Mind sharing a couple of points where you feel it’s better than 
the alternatives.

Regards,
John

From: Jonathan Haddad [mailto:j...@jonhaddad.com]
Sent: Thursday, January 26, 2017 2:33 PM
To: user@cassandra.apache.org
Subject: [External] Re: Cassandra ad hoc search options

> With Cassandra, what are the options for ad hoc query/search similar to RDBMS?

Your best options are Spark w/ the DataStax connector or Presto.  Cassandra 
isn't built for ad-hoc queries so you need to use other tools to make it work.

On Thu, Jan 26, 2017 at 2:22 PM Yu, John 
mailto:john...@sandc.com>> wrote:
Hi All,

Hope I can get some help here. We’re using Cassandra for services, and recently 
we’re adding UI support.
With Cassandra, what are the options for ad hoc query/search similar to RDBMS? 
We love the features of Cassandra but it seems it’s a known “weakness” that it 
doesn’t come with strong support of indexing and ad hoc queries. There’re some 
recent development with SASI as part of secondary index. However I heard from a 
video where it says it shall not be extensively used.

Has anyone have much experience with SASI? How does it compare to Lucene plugin?
What is the direction of Apache Cassandra in the search area?

We’re also looking into Solr or ElasticSearch integration, but it seems it 
might take more efforts, and possibly involve data duplication.
For Solr, we don’t have DSE.
Sorry if this has been asked before, but I haven’t seen a more complete answer.

Thanks!
John

NOTICE OF CONFIDENTIALITY:
This message may contain information that is considered confidential and which 
may be prohibited from disclosure under applicable law or by contractual 
agreement. The information is intended solely for the use of the individual or 
entity named above. If you are not the intended recipient, you are hereby 
notified that any disclosure, copying, distribution or use of the information 
contained in or attached to this message is strictly prohibited. If you have 
received this email transmission in error, please notify the sender by replying 
to this email and then delete it from your system.