So I need to read what I write before hitting send. Should have been,
"If A works for YOUR use case." and "Wide rows DON'T spread across nodes
well"
On 09/29/2011 02:34 PM, Jeremiah Jordan wrote:
If A works for our use case, it is a much better option. A given row
has to be read in full to return data from it, there used to be
limitations that a row had to fit in memory, but there is now code to
page through the data, so while that isn't a limitation any more, it
means rows that don't fit in memory are very slow to use. Also wide
rows spread across nodes. You should also consider more nodes in your
cluster. From our experience node perform better when they are only
managing a few Hundred GB each. Pretty sure that 10TB+ of data (100's
* 100GB) will not perform very well on a 3 node cluster, especially if
you plan to have RF=3, making it 10TB+ per node.
-Jeremiah
On 09/29/2011 12:20 PM, M Vieira wrote:
What would be the best approach
A) millions of ~2Kb rows, where each row could have ~6 columns
B) hundreds of ~100Gb rows, where each row could have ~1million columns
Considerarions:
Most entries will be searched for (read+write) at least once a day
but no more than 3 times a day.
Cheap hardware accross the cluster of 3 nodes each with 16Gb mem
(heap = 8Gb)
Any input would be appreciated
M.