So it is also terribly slow.
Does not work with materialized views, quick hack about that below and UDT,
this requires more time to fix.
So I used it to retrieve the only built-in type column, the key. To make
the task more time-consuming I exteneded the dataset a bit, to ~2.5M
records.
All of m
Hi Alex,
How do you generate you subrange set for running queries?
It may happen that some of your ranges intersect data ownership range
borders (check it running 'nodetool describering [keyspace_name]')
Those range queries will be highly ineffective in that case and that could
explain your result
Brian Hess has perhaps the best open source code example of the right way
to do this:
https://github.com/brianmhess/cassandra-loader/blob/master/src/main/java/com/datastax/loader/CqlDelimUnload.java
On Thu, Aug 17, 2017 at 10:00 AM, Alex Kotelnikov <
alex.kotelni...@diginetica.com> wrote:
> yu
yup, user_id is the primary key.
First of all,can you share, how to "go to a node directly"?.
Also such approach will retrieve all the data RF times, coordinator should
have enough metadata to avoid that.
Should not requesting multiple coordinators provide certain concurrency?
On 17 August 2017
On Thu, Aug 17, 2017 at 9:36 AM, Alex Kotelnikov <
alex.kotelni...@diginetica.com> wrote:
> Dor,
>
> I believe, I tried it in many ways and the result is quite disappointing.
> I've run my scans on 3 different clusters, one of which was using on VMs
> and I was able to scale it up and down (3-5-7
Dor,
I believe, I tried it in many ways and the result is quite disappointing.
I've run my scans on 3 different clusters, one of which was using on VMs
and I was able to scale it up and down (3-5-7 VMs, 8 to 24 cores) to see,
how this affects the performance.
I also generated the flow from spark
Hi Alex,
You probably didn't get the paralelism right. Serial scan has
a paralelism of one. If the paralelism isn't large enough, perf will be
slow.
If paralelism is too large, Cassandra and the disk will trash and have too
many context switches.
So you need to find your cluster's sweet spot. We
Apache Cassandra is not great in terms of performance at the moment for
batch analytics workloads that require a full table scan. I would look at
FiloDB for all the benefits and familiarity of Cassandra with better
streaming and analytics performance: https://github.com/filodb/FiloDB
There are als