If you're thinking about rewriting your data to be more performant when
doing analytics, you might as well go the distance and put it in an
analytics friendly format like Parquet.  My 2 cents.

On Thu, Jul 25, 2019 at 11:01 AM ZAIDI, ASAD A <az1...@att.com> wrote:

> Thank you all for your insights.
>
>
>
> When spark-connector adds allows filtering to a query, it makes the query
> to just ‘run’ no matter if it is expensive for larger table OR  not so
> expensive for table with fewer rows.
>
> In my particular case, nodes are reaching 2TB/per node load in 50 node
> cluster. When bunch of such queries run ,  causes impact on server
> resources.
>
>
>
> Since allow filtering is an expensive operation - I’m trying find knobs
> which if I turn, mitigate the impact.
>
>
>
> What I think , correct me if I am wrong , is – it is query design itself
> which is not optimized per table design  - that in turn causing connector
> to add allow filtering implicitly.  I’m not thinking to add secondary
> indexes on tables because they’ve their own overheads.  kindly share if
> there are  other means which we can use to influence connector not to use
> allow filtering.
>
>
>
> Thanks again.
>
> Asad
>
>
>
>
>
>
>
> *From:* Jeff Jirsa [mailto:jji...@gmail.com]
> *Sent:* Thursday, July 25, 2019 10:24 AM
> *To:* cassandra <user@cassandra.apache.org>
> *Subject:* Re: Performance impact with ALLOW FILTERING clause.
>
>
>
> "unpredictable" is such a loaded word. It's quite predictable, but it's
> often mispredicted by users.
>
>
>
> "ALLOW FILTERING" basically tells the database you're going to do a query
> that will require scanning a bunch of data to return some subset of it, and
> you're not able to provide a WHERE clause that's sufficiently fine grained
> to avoid the scan. It's a loose equivalent of doing a full table scan in
> SQL databases - sometimes it's a valid use case, but it's expensive, you're
> ignoring all of the indexes, and you're going to do a lot more work.
>
>
>
> It's predictable, though - you're probably going to walk over some range
> of data. Spark is grabbing all of the data to load into RDDs, and it
> probably does it by slicing up the range, doing a bunch of range scans.
>
>
>
> It's doing that so it can get ALL of the data and do the filtering /
> joining / searching in-memory in spark, rather than relying on cassandra to
> do the scanning/searching on disk.
>
>
>
> On Thu, Jul 25, 2019 at 6:49 AM ZAIDI, ASAD A <az1...@att.com> wrote:
>
> Hello Folks,
>
>
>
> I was going thru documentation and saw at many places saying ALLOW
> FILTERING causes performance unpredictability.  Our developers says ALLOW
> FILTERING clause is implicitly added on bunch of queries by spark-Cassandra
>  connector and they cannot control it; however at the same time we see
> unpredictability in application performance – just as documentation says.
>
>
>
> I’m trying to understand why would a connector add a clause in query when
> this can cause negative impact on database/application performance. Is that
> data model that is driving connector make its decision and add allow
> filtering to query automatically or if there are other reason this clause
> is added to the code. I’m not a developer though I want to know why
> developer don’t have any control on this to happen.
>
>
>
> I’ll appreciate your guidance here.
>
>
>
> Thanks
>
> Asad
>
>
>
>
>
>

Reply via email to