If you're thinking about rewriting your data to be more performant when doing analytics, you might as well go the distance and put it in an analytics friendly format like Parquet. My 2 cents.
On Thu, Jul 25, 2019 at 11:01 AM ZAIDI, ASAD A <az1...@att.com> wrote: > Thank you all for your insights. > > > > When spark-connector adds allows filtering to a query, it makes the query > to just ‘run’ no matter if it is expensive for larger table OR not so > expensive for table with fewer rows. > > In my particular case, nodes are reaching 2TB/per node load in 50 node > cluster. When bunch of such queries run , causes impact on server > resources. > > > > Since allow filtering is an expensive operation - I’m trying find knobs > which if I turn, mitigate the impact. > > > > What I think , correct me if I am wrong , is – it is query design itself > which is not optimized per table design - that in turn causing connector > to add allow filtering implicitly. I’m not thinking to add secondary > indexes on tables because they’ve their own overheads. kindly share if > there are other means which we can use to influence connector not to use > allow filtering. > > > > Thanks again. > > Asad > > > > > > > > *From:* Jeff Jirsa [mailto:jji...@gmail.com] > *Sent:* Thursday, July 25, 2019 10:24 AM > *To:* cassandra <user@cassandra.apache.org> > *Subject:* Re: Performance impact with ALLOW FILTERING clause. > > > > "unpredictable" is such a loaded word. It's quite predictable, but it's > often mispredicted by users. > > > > "ALLOW FILTERING" basically tells the database you're going to do a query > that will require scanning a bunch of data to return some subset of it, and > you're not able to provide a WHERE clause that's sufficiently fine grained > to avoid the scan. It's a loose equivalent of doing a full table scan in > SQL databases - sometimes it's a valid use case, but it's expensive, you're > ignoring all of the indexes, and you're going to do a lot more work. > > > > It's predictable, though - you're probably going to walk over some range > of data. Spark is grabbing all of the data to load into RDDs, and it > probably does it by slicing up the range, doing a bunch of range scans. > > > > It's doing that so it can get ALL of the data and do the filtering / > joining / searching in-memory in spark, rather than relying on cassandra to > do the scanning/searching on disk. > > > > On Thu, Jul 25, 2019 at 6:49 AM ZAIDI, ASAD A <az1...@att.com> wrote: > > Hello Folks, > > > > I was going thru documentation and saw at many places saying ALLOW > FILTERING causes performance unpredictability. Our developers says ALLOW > FILTERING clause is implicitly added on bunch of queries by spark-Cassandra > connector and they cannot control it; however at the same time we see > unpredictability in application performance – just as documentation says. > > > > I’m trying to understand why would a connector add a clause in query when > this can cause negative impact on database/application performance. Is that > data model that is driving connector make its decision and add allow > filtering to query automatically or if there are other reason this clause > is added to the code. I’m not a developer though I want to know why > developer don’t have any control on this to happen. > > > > I’ll appreciate your guidance here. > > > > Thanks > > Asad > > > > > >