IN versus multiple asynchronous queries

2014-10-04 Thread Robert Wille
I have a table of small documents (less than 1K) that are often accessed 
together as a group. The group size is always less than 50. Which produces less 
load on the server, one query using an IN clause to get all 50 back together, 
or 50 concurrent queries? Which one is fastest?

Thanks

Robert



Cassandra + Solr

2014-10-04 Thread Robert Wille
I am architecting a solution for moving a large number of documents out of our 
MySQL database to C*. We use Solr to index these documents. I’ve recently 
become aware of a few different packages that integrate C* and Solr. At first 
blush, this seems like the perfect fit, as it would eliminate a complicated and 
somewhat fragile system that manages the indexing of our documents. However, I 
have a significant concern that would certainly be a show-stopper, unless I’m 
mistaken about some assumptions I’ve made. If any of you can confirm that my 
concern is justified, or let me know where I’m wrong, I’d greatly appreciate it.

So here’s what I think will be an issue. The guiding principle in setting up 
our current Solr cluster (which was done before my time), is that the index has 
to fit in the heap. Each node in our current Solr cluster has 32 GB of RAM and 
runs with a 28 GB heap, and serves up less than 28 GB of index. When I think 
about combining Solr and C*, it would seem that I’d need to cough up 8 GB for 
C*, leaving 20 GB for Solr. The entire Solr index and the entire MySQL database 
are roughly the same size. If I assume that the data will consume roughly the 
same space on C* as it does on MySQL, then each C* would be limited to roughly 
the same amount of data as the Solr index, or about 20 GB. If I have a 
replication factor of 3, then I need a node for each 7 GB of data. That is 
obviously a problem. We have about 1 TB of data. It would take about 150 nodes.

Is C*/Solr integration only feasible for applications for which a small subset 
of the data is indexed in Solr, or am I mistaken about the requirement that the 
Solr index be able to fit in heap?

Thanks in advance

Robert



Re: IN versus multiple asynchronous queries

2014-10-04 Thread DuyHai Doan
Definitely 50 concurrent queries, possibly in async mode.

If you're using the IN clause with 50 values, the coordinator will block,
waiting for 50 partitions to be fetched from different nodes (worst case =
50 nodes) before responding to client. In addition to the very  high
latency, you'll put the stress on the coordinator memory.



On Sat, Oct 4, 2014 at 3:09 PM, Robert Wille  wrote:

> I have a table of small documents (less than 1K) that are often accessed
> together as a group. The group size is always less than 50. Which produces
> less load on the server, one query using an IN clause to get all 50 back
> together, or 50 concurrent queries? Which one is fastest?
>
> Thanks
>
> Robert
>
>


Re: Cassandra + Solr

2014-10-04 Thread Jack Krupansky
The "requirement" is only that the Lucene (Solr) index fit in system memory 
that the OS uses for file caching. SSDs are another matter. If somebody or 
some information source is telling you that the index must fit  "in heap", 
please identify the source so that we can correct them.


If the index doesn't fit in the OS file system cache then you risk becoming 
I/O bound as Lucene does I/O to access various portions of the index.


But it all depends on your access patterns. I mean, if you typically aren't 
accessing all of your fields, then not all of the index will need to flow 
through memory or the file system cache.


But as a general proposition, plan on having enough system memory to "cache" 
your Lucene/Solr index. SSDs... are a bit different, but I'd still want most 
of the index to fit in RAM.


-- Jack Krupansky

-Original Message- 
From: Robert Wille

Sent: Saturday, October 4, 2014 9:27 AM
To: user@cassandra.apache.org
Subject: Cassandra + Solr

I am architecting a solution for moving a large number of documents out of 
our MySQL database to C*. We use Solr to index these documents. I’ve 
recently become aware of a few different packages that integrate C* and 
Solr. At first blush, this seems like the perfect fit, as it would eliminate 
a complicated and somewhat fragile system that manages the indexing of our 
documents. However, I have a significant concern that would certainly be a 
show-stopper, unless I’m mistaken about some assumptions I’ve made. If any 
of you can confirm that my concern is justified, or let me know where I’m 
wrong, I’d greatly appreciate it.


So here’s what I think will be an issue. The guiding principle in setting up 
our current Solr cluster (which was done before my time), is that the index 
has to fit in the heap. Each node in our current Solr cluster has 32 GB of 
RAM and runs with a 28 GB heap, and serves up less than 28 GB of index. When 
I think about combining Solr and C*, it would seem that I’d need to cough up 
8 GB for C*, leaving 20 GB for Solr. The entire Solr index and the entire 
MySQL database are roughly the same size. If I assume that the data will 
consume roughly the same space on C* as it does on MySQL, then each C* would 
be limited to roughly the same amount of data as the Solr index, or about 20 
GB. If I have a replication factor of 3, then I need a node for each 7 GB of 
data. That is obviously a problem. We have about 1 TB of data. It would take 
about 150 nodes.


Is C*/Solr integration only feasible for applications for which a small 
subset of the data is indexed in Solr, or am I mistaken about the 
requirement that the Solr index be able to fit in heap?


Thanks in advance

Robert