RE: Questions on the count and multiple index behaviour in cassandra

Durity, Sean R via user Thu, 29 Sep 2022 08:27:39 -0700

Aggregate queries (like count(*) ) are fine *within* a reasonably sized 
partition (under 100 MB in size). However, Cassandra  is not the right tool if 
you want to do aggregate queries *across* partitions (unless you break up the 
work with something like Spark). Choosing the right partition key and values IS 
the goal of Cassandra data modeling. (Clustering keys are used for ordering 
data within a partition.)


Good:
Select count(*) from my_table where my_partition_key = ‘1’; --and the partition 
is 100 MB or less

Not good:
Select count(*) from my_table;

Are the counts the actual workload or just a measure of the completion of the 
load? You want to model the data to satisfy the queries for the workload. 
Queries should be very simple; getting the data model right is the hard work. 
Make sure that Cassandra fits the use case you have.


Sean R. Durity
DB Solutions
Staff Systems Engineer – Cassandra

From: Karthik K <mailidofkarthike...@gmail.com>
Sent: Wednesday, September 28, 2022 8:48 AM
To: user@cassandra.apache.org
Cc: rsesha...@altimetrik.com
Subject: [EXTERNAL] Re: Questions on the count and multiple index behaviour in 
cassandra

Hi Stéphane Alleaume, Thanks for your quick response. I have attached the Table 
stats by running the nodetool cfstats command to get to the size. If I am 
correct, the partition size must be 464 Mb. However, when I exported the data 
as csv the
ZjQcmQRYFpfptBannerStart
This Message Is From an Untrusted Sender
You have not previously corresponded with this sender.
    Report Suspicious  
<https://us-phishalarm-ewt.proofpoint.com/EWT/v1/M-nmYVHPHQ!GBuqOYqUdJskCVZpWpnorse_9wgMi29zj7R6P8-PPCNyOOWTjk-b2FIZI0VWXRdm0QtXilc7Rlklb2v4lumMt1R2Wj3x082TS8QFuvu5axHnPiPMC7WyoFT1rA7JbZmLqxU$>
   ‌
ZjQcmQRYFpfptBannerEnd
Hi Stéphane Alleaume,
Thanks for your quick response. I have attached the Table stats by running the 
nodetool cfstats command to get to the size.
If I am correct, the partition size must be 464 Mb. However, when I exported 
the data as csv the size was 1510 Mb.

1) If we segment this 464Mb data into more partitions, say, with each partition 
sizing <100Mb, Will the count(*) query work effectively?
2) What will be the approximate response in seconds if we run a select count(*) 
against 1 million records?
3) Though, Elasticsearch is a future option, we want to dig more with cassandra 
to achieve this. Do we have any work around using data modelling ?

Thanks & Regards,
Karthikeyan K

On Wed, Sep 28, 2022 at 5:31 PM Stéphane Alleaume 
<crystallo...@gmail.com<mailto:crystallo...@gmail.com>> wrote:
Hi

1) how much size in Mo is your partition ? Should be less than 100 Mo (but less 
in fact)

2) could you plug an Elasticsearch or Solr search in front  ?

Kind regards
Stephane





Le mer. 28 sept. 2022, 13:46, Karthik K 
<mailidofkarthike...@gmail.com<mailto:mailidofkarthike...@gmail.com>> a écrit :
Hi,

We have two doubts on cassandra 3.11 features:

1) Need to get counts of row from a cassandra table.
We have 3 node clusters with Apache Cassandra 3.11 version.

We loaded a table in cassandra with 9lakh records. We have around 91 columns in 
this table. Most of the records have text as datatype.
All these 9lakh records were part of a single partition key.

When we tried a select count(*) query with that partition key, the query was 
timing out.

However, we were able to retrieve counts through multiple calls by fetching only
1 lakh records in each call. The only disadvantage here is the time taken which
is around 1minute and 3 seconds.

Is there any other approach to get the row count faster in cassandra? Do we 
need to '
change the data modelling approach to achieve this? Suggestions are welcome


2) How to data model in cassandra to support usage of multiple filters.
 We may also need the count of rows for this multiple filter query.

Thanks & Regards,
Karthikeyan


INTERNAL USE

________________________________

The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages in connection with this e-mail 
message or its attachment.

RE: Questions on the count and multiple index behaviour in cassandra

Reply via email to