Hi Asad,

That’s because of the way Spark works. Essentially, when you execute a Spark 
job, it pulls the full content of the datastore (Cassandra in your case) in it 
RDDs and works with it “in memory”. While Spark uses “data locality” to read 
data from the nodes that have the required data on its local disks, it’s still 
reading all data from Cassandra tables. To do so it’s sending ‘select * from 
Table ALLOW FILTERING’ query to Cassandra.

From Spark you don’t have much control on the initial query to fill the RDDs, 
sometimes you’ll read the whole table even if you only need one row.

Regards,
Jacques-Henri Berthemet

From: "ZAIDI, ASAD A" <az1...@att.com>
Reply to: "user@cassandra.apache.org" <user@cassandra.apache.org>
Date: Thursday 25 July 2019 at 15:49
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Subject: Performance impact with ALLOW FILTERING clause.

Hello Folks,

I was going thru documentation and saw at many places saying ALLOW FILTERING 
causes performance unpredictability.  Our developers says ALLOW FILTERING 
clause is implicitly added on bunch of queries by spark-Cassandra  connector 
and they cannot control it; however at the same time we see unpredictability in 
application performance – just as documentation says.

I’m trying to understand why would a connector add a clause in query when this 
can cause negative impact on database/application performance. Is that data 
model that is driving connector make its decision and add allow filtering to 
query automatically or if there are other reason this clause is added to the 
code. I’m not a developer though I want to know why developer don’t have any 
control on this to happen.

I’ll appreciate your guidance here.

Thanks
Asad


Reply via email to