Re: Unexplainably large reported partition sizes

2016-03-05 Thread Tom van den Berge
I don't think compression can be the cause of the difference, because of
two reasons:

1) The partition size I calculated myself (3 MB) is the uncompressed size,
and so is the reported size (2.3 GB)

2) The difference is simply way too big to be explained by compression,
even if the calculated size would have been the compressed size. The
compression would be 0.125% of the original, which is not realistic. In the
logs, I can see that the typical compression that is achieved for this
table is around 80% of the original.

Tom

On Fri, Mar 4, 2016 at 9:48 PM, Robert Coli  wrote:

> On Fri, Mar 4, 2016 at 5:56 AM, Tom van den Berge 
> wrote:
>
>>  Compacting large partition
>> drillster/subscriberstats:rqtPewK-1chi0JSO595u-Q (1,470,058,292 bytes)
>>
>> This means that this single partition is about 1.4GB large. This is much
>> larger that it can possibly be, because of two reasons:
>>   1) the partition has appr. 50K rows, each roughly 62 bytes = ~3 MB
>>   2) the entire table consumes appr. 500MB of disk space on the node
>> containing the partition (including snapshots)
>>
>> Furthermore, nodetool cfstats tells me this:
>> Space used (live): 253,928,111
>> Space used (total): 253,928,111
>> Compacted partition maximum bytes: 2,395,318,855
>> The space used seem to match the actual size (excl. snapshots), but the
>> Compacted partition maximum bytes (2,3 GB) seems to be far higher than
>> possible. Does anyone know how it is possible that Cassandra reports such
>> unlikely sizes?
>>
>
> Compression is enabled by default, and compaction reports the uncompressed
> size.
>
> =Rob
>
>



-- 
Tom van den Berge
Lead Software Engineer

[image: Drillster]

Middenburcht 136
3452 MT Vleuten
Netherlands +31 30 755 53 30
www.drillster.com

[image: Follow us on Facebook] Follow us on Facebook



Re: Unexplainably large reported partition sizes

2016-03-05 Thread DuyHai Doan
Maybe tombstones ? Do you issue a lot of DELETE statements ? Or do you
re-insert in the same partition with different TTL values ?

On Sat, Mar 5, 2016 at 7:16 PM, Tom van den Berge  wrote:

> I don't think compression can be the cause of the difference, because of
> two reasons:
>
> 1) The partition size I calculated myself (3 MB) is the uncompressed size,
> and so is the reported size (2.3 GB)
>
> 2) The difference is simply way too big to be explained by compression,
> even if the calculated size would have been the compressed size. The
> compression would be 0.125% of the original, which is not realistic. In the
> logs, I can see that the typical compression that is achieved for this
> table is around 80% of the original.
>
> Tom
>
> On Fri, Mar 4, 2016 at 9:48 PM, Robert Coli  wrote:
>
>> On Fri, Mar 4, 2016 at 5:56 AM, Tom van den Berge 
>> wrote:
>>
>>>  Compacting large partition
>>> drillster/subscriberstats:rqtPewK-1chi0JSO595u-Q (1,470,058,292 bytes)
>>>
>>> This means that this single partition is about 1.4GB large. This is much
>>> larger that it can possibly be, because of two reasons:
>>>   1) the partition has appr. 50K rows, each roughly 62 bytes = ~3 MB
>>>   2) the entire table consumes appr. 500MB of disk space on the node
>>> containing the partition (including snapshots)
>>>
>>> Furthermore, nodetool cfstats tells me this:
>>> Space used (live): 253,928,111
>>> Space used (total): 253,928,111
>>> Compacted partition maximum bytes: 2,395,318,855
>>> The space used seem to match the actual size (excl. snapshots), but the
>>> Compacted partition maximum bytes (2,3 GB) seems to be far higher than
>>> possible. Does anyone know how it is possible that Cassandra reports such
>>> unlikely sizes?
>>>
>>
>> Compression is enabled by default, and compaction reports the
>> uncompressed size.
>>
>> =Rob
>>
>>
>
>
>
> --
> Tom van den Berge
> Lead Software Engineer
>
> [image: Drillster]
>
> Middenburcht 136
> 3452 MT Vleuten
> Netherlands +31 30 755 53 30
> www.drillster.com
>
> [image: Follow us on Facebook] Follow us on Facebook
> 
>


Re: How to create an additional cluster in Cassandra exclusively for Analytics Purpose

2016-03-05 Thread Bhuvan Rawal
Thanks Sean and Nirmallaya.

@Jack, We are going with DSC right now and plan to use spark and later solr
over the analytics DC. The use case is to have  olap and oltp workloads
separated and not intertwine them, whether it is achieved by creating a new
DC or a new cluster altogether. From Nirmallaya's and Sean's answer I could
understand that its easily achievable by creating a separate DC, app client
will need to be made DC aware and it should not make a coordinator in dc3.
And same goes for spark configuration, it should read from 3rd DC. Correct
me if I'm wrong.

On Mar 4, 2016 7:55 PM, "Jack Krupansky"  wrote:
>
> DataStax Enterprise (DSE) should be fine for three or even four data
centers in the same cluster. Or are you talking about some custom Solr
implementation?
>
> -- Jack Krupansky
>
> On Fri, Mar 4, 2016 at 9:21 AM,  wrote:
>>
>> Sure. Just add a new DC. Alter your keyspaces with a new replication
factor for that DC. Run repairs on the new DC to get the data streamed.
Then make sure your clients only connect to the DC(s) that they need.
>>
>>
>>
>> Separation of workloads is one of the key powers of a Cassandra cluster.
>>
>>
>>
>> You may want to look at different configurations for the analytics
cluster – smaller replication factor, more memory per node, more disk per
node, perhaps less vnodes. Others may chime in with their experience.
>>
>>
>>
>>
>>
>> Sean Durity
>>
>>
>>
>> From: Bhuvan Rawal [mailto:bhu1ra...@gmail.com]
>> Sent: Friday, March 04, 2016 3:27 AM
>> To: user@cassandra.apache.org
>> Subject: How to create an additional cluster in Cassandra exclusively
for Analytics Purpose
>>
>>
>>
>> Hi,
>>
>>
>>
>> We would like to create an additional C* data center for batch
processing using spark on CFS. We would like to limit this DC exclusively
for Spark operations and would like to continue the Application Servers to
continue fetching data from OLTP.
>>
>>
>>
>> Is there any way to configure the same?
>>
>>
>>
>>
>> ​
>>
>> Regards,
>>
>> Bhuvan
>>
>>
>> 
>>
>> The information in this Internet Email is confidential and may be
legally privileged. It is intended solely for the addressee. Access to this
Email by anyone else is unauthorized. If you are not the intended
recipient, any disclosure, copying, distribution or any action taken or
omitted to be taken in reliance on it, is prohibited and may be unlawful.
When addressed to our clients any opinions or advice contained in this
Email are subject to the terms and conditions expressed in any applicable
governing The Home Depot terms of business or client engagement letter. The
Home Depot disclaims all responsibility and liability for the accuracy and
content of this attachment and for any damages or losses arising from any
inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
items of a destructive nature, which may be contained in this attachment
and shall not be liable for direct, indirect, consequential or special
damages in connection with this e-mail message or its attachment.
>
>


Re: How to create an additional cluster in Cassandra exclusively for Analytics Purpose

2016-03-05 Thread Jack Krupansky
You haven't been clear about how you intend to add Solr. You can also use
Stratio or Stargate for basic Lucene search if you don't want need full
Solr support and want to stick to open source rather than go with DSE
Search for Solr.

-- Jack Krupansky

On Sun, Mar 6, 2016 at 12:25 AM, Bhuvan Rawal  wrote:

> Thanks Sean and Nirmallaya.
>
> @Jack, We are going with DSC right now and plan to use spark and later
> solr over the analytics DC. The use case is to have  olap and oltp
> workloads separated and not intertwine them, whether it is achieved by
> creating a new DC or a new cluster altogether. From Nirmallaya's and Sean's
> answer I could understand that its easily achievable by creating a separate
> DC, app client will need to be made DC aware and it should not make a
> coordinator in dc3. And same goes for spark configuration, it should read
> from 3rd DC. Correct me if I'm wrong.
>
> On Mar 4, 2016 7:55 PM, "Jack Krupansky"  wrote:
> >
> > DataStax Enterprise (DSE) should be fine for three or even four data
> centers in the same cluster. Or are you talking about some custom Solr
> implementation?
> >
> > -- Jack Krupansky
> >
> > On Fri, Mar 4, 2016 at 9:21 AM,  wrote:
> >>
> >> Sure. Just add a new DC. Alter your keyspaces with a new replication
> factor for that DC. Run repairs on the new DC to get the data streamed.
> Then make sure your clients only connect to the DC(s) that they need.
> >>
> >>
> >>
> >> Separation of workloads is one of the key powers of a Cassandra cluster.
> >>
> >>
> >>
> >> You may want to look at different configurations for the analytics
> cluster – smaller replication factor, more memory per node, more disk per
> node, perhaps less vnodes. Others may chime in with their experience.
> >>
> >>
> >>
> >>
> >>
> >> Sean Durity
> >>
> >>
> >>
> >> From: Bhuvan Rawal [mailto:bhu1ra...@gmail.com]
> >> Sent: Friday, March 04, 2016 3:27 AM
> >> To: user@cassandra.apache.org
> >> Subject: How to create an additional cluster in Cassandra exclusively
> for Analytics Purpose
> >>
> >>
> >>
> >> Hi,
> >>
> >>
> >>
> >> We would like to create an additional C* data center for batch
> processing using spark on CFS. We would like to limit this DC exclusively
> for Spark operations and would like to continue the Application Servers to
> continue fetching data from OLTP.
> >>
> >>
> >>
> >> Is there any way to configure the same?
> >>
> >>
> >>
> >>
> >> ​
> >>
> >> Regards,
> >>
> >> Bhuvan
> >>
> >>
> >> 
> >>
> >> The information in this Internet Email is confidential and may be
> legally privileged. It is intended solely for the addressee. Access to this
> Email by anyone else is unauthorized. If you are not the intended
> recipient, any disclosure, copying, distribution or any action taken or
> omitted to be taken in reliance on it, is prohibited and may be unlawful.
> When addressed to our clients any opinions or advice contained in this
> Email are subject to the terms and conditions expressed in any applicable
> governing The Home Depot terms of business or client engagement letter. The
> Home Depot disclaims all responsibility and liability for the accuracy and
> content of this attachment and for any damages or losses arising from any
> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
> items of a destructive nature, which may be contained in this attachment
> and shall not be liable for direct, indirect, consequential or special
> damages in connection with this e-mail message or its attachment.
> >
> >
>


Re: How to create an additional cluster in Cassandra exclusively for Analytics Purpose

2016-03-05 Thread Bhuvan Rawal
Yes Jack, we are rolling out with Stratio right now, we will assess the
performance benefit it yields and can go for ElasticSearch/Solr later.

As per your experience how does Stratio perform vis-a-vis Secondary Indexes?

On Sun, Mar 6, 2016 at 11:15 AM, Jack Krupansky 
wrote:

> You haven't been clear about how you intend to add Solr. You can also use
> Stratio or Stargate for basic Lucene search if you don't want need full
> Solr support and want to stick to open source rather than go with DSE
> Search for Solr.
>
> -- Jack Krupansky
>
> On Sun, Mar 6, 2016 at 12:25 AM, Bhuvan Rawal  wrote:
>
>> Thanks Sean and Nirmallaya.
>>
>> @Jack, We are going with DSC right now and plan to use spark and later
>> solr over the analytics DC. The use case is to have  olap and oltp
>> workloads separated and not intertwine them, whether it is achieved by
>> creating a new DC or a new cluster altogether. From Nirmallaya's and Sean's
>> answer I could understand that its easily achievable by creating a separate
>> DC, app client will need to be made DC aware and it should not make a
>> coordinator in dc3. And same goes for spark configuration, it should read
>> from 3rd DC. Correct me if I'm wrong.
>>
>> On Mar 4, 2016 7:55 PM, "Jack Krupansky" 
>> wrote:
>> >
>> > DataStax Enterprise (DSE) should be fine for three or even four data
>> centers in the same cluster. Or are you talking about some custom Solr
>> implementation?
>> >
>> > -- Jack Krupansky
>> >
>> > On Fri, Mar 4, 2016 at 9:21 AM,  wrote:
>> >>
>> >> Sure. Just add a new DC. Alter your keyspaces with a new replication
>> factor for that DC. Run repairs on the new DC to get the data streamed.
>> Then make sure your clients only connect to the DC(s) that they need.
>> >>
>> >>
>> >>
>> >> Separation of workloads is one of the key powers of a Cassandra
>> cluster.
>> >>
>> >>
>> >>
>> >> You may want to look at different configurations for the analytics
>> cluster – smaller replication factor, more memory per node, more disk per
>> node, perhaps less vnodes. Others may chime in with their experience.
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> Sean Durity
>> >>
>> >>
>> >>
>> >> From: Bhuvan Rawal [mailto:bhu1ra...@gmail.com]
>> >> Sent: Friday, March 04, 2016 3:27 AM
>> >> To: user@cassandra.apache.org
>> >> Subject: How to create an additional cluster in Cassandra exclusively
>> for Analytics Purpose
>> >>
>> >>
>> >>
>> >> Hi,
>> >>
>> >>
>> >>
>> >> We would like to create an additional C* data center for batch
>> processing using spark on CFS. We would like to limit this DC exclusively
>> for Spark operations and would like to continue the Application Servers to
>> continue fetching data from OLTP.
>> >>
>> >>
>> >>
>> >> Is there any way to configure the same?
>> >>
>> >>
>> >>
>> >>
>> >> ​
>> >>
>> >> Regards,
>> >>
>> >> Bhuvan
>> >>
>> >>
>> >> 
>> >>
>> >> The information in this Internet Email is confidential and may be
>> legally privileged. It is intended solely for the addressee. Access to this
>> Email by anyone else is unauthorized. If you are not the intended
>> recipient, any disclosure, copying, distribution or any action taken or
>> omitted to be taken in reliance on it, is prohibited and may be unlawful.
>> When addressed to our clients any opinions or advice contained in this
>> Email are subject to the terms and conditions expressed in any applicable
>> governing The Home Depot terms of business or client engagement letter. The
>> Home Depot disclaims all responsibility and liability for the accuracy and
>> content of this attachment and for any damages or losses arising from any
>> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
>> items of a destructive nature, which may be contained in this attachment
>> and shall not be liable for direct, indirect, consequential or special
>> damages in connection with this e-mail message or its attachment.
>> >
>> >
>>
>
>