Question on efficient loading from Cassandra

Roger Fischer (CW) Wed, 26 Jul 2017 17:04:09 -0700

Hello,

what is the best way to efficiently load data from a backing store, like 
Cassandra. I am looking for a solution that minimizes work in Ignite and 
Cassandra.


As I understand:

The simplest way is to call loadCache() with a single select statement.
cache.loadCache( null, "select * from a_table where a_date_time >= '2017-07-25 
10:00:00');")

Is it correct that:
1) Each Ignite node gets the same loadCache() request.
2) Each Ignite node sends the same query to Cassandra.
3) Each Ignite node gets all matched objects (rows) back from Cassandra.
4) Each Ignite node stores only the objects for which it has the primary 
partition, or a backup partition.

Unless I misunderstand, this simple approach has the following inefficiencies:
a) Cassandra executes the same query multiple times, once for each Ignite node.
b) The query results are transferred multiple times, once for each Ignite node.
c) The Ignite node gets a lot of data which it does not need (has neither 
primary or backup partition).
d) Each Cassandra node has to query all partitions.

loadCache() supports multiple queries. This allows the query to be broken down, 
ideally (for this case) into one query per Cassandra partition.

cache.loadCache( null, "select * from a_table where partition_key = 0 and 
a_date_time >= '2017-07-25 10:00:00');", "select * from a_table where 
partition_key = 1 and a_date_time >= '2017-07-25 10:00:00');", ...)

This optimizes the Cassandra query, as each query is constrained to one 
Cassandra partition.

But, I think, each node still needs to execute each query. Thus none of the 
other inefficiencies are eliminated.

I believe that, when multiple cores (worker threads) are available, the Ignite 
nodes will execute multiple queries in parallel. So, there is a reduction in 
elapsed time. Correct?

Now, is there any way to avoid that Cassandra has to execute the same query 
multiple times, and that the data is transferred multiple times?

One approach would be that an Ignite node modifies the query so that it only 
includes the partitions for which it has the primary or a backup partition. 
That eliminates some duplication, but may not result in efficient queries in 
Cassandra.

Another approach is that Ignite forwards objects for which it is not the 
primary or does not have a backup (similar to when an application does a 
put()). That would optimize the Cassandra query, but require additional 
communications between Ignite nodes.

What if Ignite and Cassandra partitions were aligned? Then queries could be 
created that only return data relevant to the node and only query a subset of 
Cassandra partitions. But this seems not practical for a generalized system (I 
think).

Any other suggestions?

Thanks...

Roger

PS: The use case for this is to use Ignite as an SQL cache for a large data set 
in the Cassandra DB. The most recent data is pre-loaded (and updated) in 
Ignite. When older data is required, it is loaded first into Ignite, and then 
processed. It is this dynamic loading that should be quick (and efficient).

Question on efficient loading from Cassandra

Reply via email to